date:20210317

psql tab completion for \h with IMPORT FOREIGN SCHEMA

2021-03-17 Thread Michael Paquier

Hi all,

Well, as $subject tells, I just found confusing that \h does not
complete the so-said command, the only one using IMPORT as keyword,
so I'd like to do the following:
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1493,7 +1493,7 @@ psql_completion(const char *text, int start, int end)
 "ABORT", "ALTER", "ANALYZE", "BEGIN", "CALL", "CHECKPOINT", "CLOSE", 
"CLUSTER",
 "COMMENT", "COMMIT", "COPY", "CREATE", "DEALLOCATE", "DECLARE",
 "DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
-"FETCH", "GRANT", "IMPORT", "INSERT", "LISTEN", "LOAD", "LOCK",
+"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT", "LISTEN", "LOAD", 
"LOCK",
 "MOVE", "NOTIFY", "PREPARE",
 "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
 "RESET", "REVOKE", "ROLLBACK",

That's not the patch of the year.

Thanks,
--
Michael
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index ecdb8d752b..a9743f5c63 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1493,7 +1493,7 @@ psql_completion(const char *text, int start, int end)
 		"ABORT", "ALTER", "ANALYZE", "BEGIN", "CALL", "CHECKPOINT", "CLOSE", "CLUSTER",
 		"COMMENT", "COMMIT", "COPY", "CREATE", "DEALLOCATE", "DECLARE",
 		"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
-		"FETCH", "GRANT", "IMPORT", "INSERT", "LISTEN", "LOAD", "LOCK",
+		"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT", "LISTEN", "LOAD", "LOCK",
 		"MOVE", "NOTIFY", "PREPARE",
 		"REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
 		"RESET", "REVOKE", "ROLLBACK",


signature.asc
Description: PGP signature

Re: fdatasync performance problem with large number of DB files

2021-03-17 Thread Fujii Masao





On 2021/03/17 12:45, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao  wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea.  We already have a
running-with-scissors mode you could use for that: fsync=off.


I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.


Hopefully syncfs() will return quickly in that case, without doing any work?


Yes, in Linux.




They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.


I still do not like this footgun :-)  However, perhaps I am being
overly dogmatic.  Consider the change in d8179b00, which decided that
I/O errors in this phase should be reported at LOG level rather than
ERROR.  In contrast, my "sync_after_crash=wal" mode (which I need to
rebase over this) will PANIC in this case, because any syncing will be
handled through the usual checkpoint codepaths.

Do you think it would be OK to commit this feature with just "fsync"
and "syncfs", and then to continue to consider adding "none" as a
possible separate commit?


+1. "syncfs" feature is useful whether we also support "none" mode or not.
It's good idea to commit "syncfs" in advance.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: a verbose option for autovacuum

2021-03-17 Thread Michael Paquier

On Wed, Mar 17, 2021 at 08:50:26AM -0300, Euler Taveira wrote:
> Since commit 5f8727f5a6, this patch doesn't apply anymore. Fortunately, it is
> just a small hunk. I reviewed this patch and it looks good to me. There is 
> just
> a small issue (double space after 'if') that I fixed in the attached version.

No major objections to what you are proposing here.

>   /* and log the action if appropriate */
> - if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
> + if (IsAutoVacuumWorkerProcess())
>   {
> - TimestampTz endtime = GetCurrentTimestamp();
> + TimestampTz endtime = 0;
> + int i;
>  
> - if (params->log_min_duration == 0 ||
> - TimestampDifferenceExceeds(starttime, endtime,
> -
> params->log_min_duration))
> + if (params->log_min_duration >= 0)
> + endtime = GetCurrentTimestamp();
> +
> + if (endtime > 0 &&
> + (params->log_min_duration == 0 ||
> +  TimestampDifferenceExceeds(starttime, endtime,

Why is there any need to actually change this part?  If I am following
the patch correctly, the reason why you are doing things this way is
to free the set of N statistics all the time for autovacuum.  However,
we could make that much simpler, and your patch is already half-way
through that by adding the stats of the N indexes to LVRelStats.  Here
is the idea:
- Allocation of the N items for IndexBulkDeleteResult at the beginning
of heap_vacuum_rel().  It seems to me that we are going to need the N
index names within LVRelStats, to be able to still call
vac_close_indexes() *before* logging the stats.
- No need to pass down indstats for the two callers of
lazy_vacuum_all_indexes().
- Clean up of vacrelstats once heap_vacuum_rel() is done with it.
--
Michael


signature.asc
Description: PGP signature

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Peter Geoghegan

On Wed, Mar 17, 2021 at 11:23 PM Masahiko Sawada  wrote:
> Attached the updated patch that can be applied on top of your v3 patches.

Some feedback on this:

* I think that we can afford to be very aggressive here, because we're
checking dynamically. And we're concerned about extremes only. So an
age of as high as 1 billion transactions seems like a better approach.
What do you think?

* I think that you need to remember that we have decided not to do any
more index vacuuming, rather than calling
check_index_cleanup_xid_limit() each time -- maybe store that
information in a state variable.

This seems like a good idea because we should try to avoid changing
back to index vacuuming having decided to skip it once. Also, we need
to refer to this in lazy_scan_heap(), so that we avoid index cleanup
having also avoided index vacuuming. This is like the INDEX_CLEANUP =
off case, which is also only for emergencies. It is not like the
SKIP_VACUUM_PAGES_RATIO case, which is just an optimization.

Thanks
-- 
Peter Geoghegan

Re: [PATCH] Identify LWLocks in tracepoints

2021-03-17 Thread Craig Ringer

On Thu, 11 Mar 2021 at 15:57, Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 10.03.21 06:38, Craig Ringer wrote:
> > On Wed, 3 Mar 2021 at 20:50, David Steele  > > wrote:
> >
> > On 1/22/21 6:02 AM, Peter Eisentraut wrote:
> >
> > This patch set no longer applies:
> > http://cfbot.cputube.org/patch_32_2927.log
> > .
> >
> > Can we get a rebase? Also marked Waiting on Author.
> >
> >
> > Rebased as requested.
>
> In patch 0001, why was the TRACE_POSTGRESQL_LWLOCK_RELEASE() call moved?
>   Is there some correctness issue?  If so, we should explain that (at
> least in the commit message, or as a separate patch).
>

If you want I can split it out, or drop that change. I thought it was
sufficiently inconsequential, but you're right to check.

The current tracepoint TRACE_POSTGRESQL_LWLOCK_RELEASE really means
"releaseD". It's appropriate to emit this as soon as the lock could be
acquired by anything else. By deferring it until we'd processed the
waitlist and woken other backends the window during which the lock was
reported as "held" was longer than it truly was, and it was easy to see one
backend acquire the lock while another still appeared to hold it.

It'd possibly make more sense to have a separate
TRACE_POSTGRESQL_LWLOCK_RELEASING just before the `pg_atomic_sub_fetch_u32`
call. But I didn't want to spam the tracepoints too hard, and there's
always going to be some degree of overlap because tracing tools cannot
intercept and act during the atomic swap, so they'll always see a slightly
premature or slightly delayed release. This window should be as short as
possible though, hence moving the tracepoint.

Side note:

The main reason I didn't want to add more tracepoints than were strictly
necessary is that Pg doesn't enable the systemtap semaphores feature, so
right now we do a T_NAME(lock) evaluation each time we pass a tracepoint if
--enable-dtrace is compiled in, whether or not anything is tracing. This
was fine on pg11 where it was just:

#define T_NAME(lock) \
(LWLockTrancheArray[(lock)->tranche])

but since pg13 it instead expands to

GetLWTrancheName((lock)->tranche)

where GetLWTrancheName isn't especially trivial. We'll run that function
every single time we pass any of these tracepoints and then discard the
result, which is ... not ideal. That applies so long as Pg is compiled with
--enable-dtrace. I've been meaning to look at enabling the systemtap
semaphores feature in our build so these can be wrapped in
unlikely(TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED()) guards, but I wanted to
wrap this patch set up first as there are some complexities around enabling
the semaphores feature.

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Masahiko Sawada

On Thu, Mar 18, 2021 at 12:23 PM Peter Geoghegan  wrote:
>
> On Wed, Mar 17, 2021 at 7:16 PM Masahiko Sawada  wrote:
> > Since I was thinking that always skipping index vacuuming on
> > anti-wraparound autovacuum is legitimate, skipping index vacuuming
> > only when we're really close to the point of going into read-only mode
> > seems a bit conservative, but maybe a good start. I've attached a PoC
> > patch to disable index vacuuming if the table's relfrozenxid is too
> > older than autovacuum_freeze_max_age (older than 1.5x of
> > autovacuum_freeze_max_age).
>
> Most anti-wraparound VACUUMs are really not emergencies, though. So
> treating them as special simply because they're anti-wraparound
> vacuums doesn't seem like the right thing to do. I think that we
> should dynamically decide to do this when (antiwraparound) VACUUM has
> already been running for some time. We need to delay the decision
> until it is almost certainly true that we really have an emergency.

That's a good idea to delay the decision until two_pass_strategy().

>
> Can you take what you have here, and make the decision dynamic? Delay
> it until we're done with the first heap scan? This will require
> rebasing on top of the patch I posted. And then adding a third patch,
> a little like the second patch -- but not too much like it.

Attached the updated patch that can be applied on top of your v3 patches.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v3-0003-Skip-index-vacuuming-when-there-is-a-risk-of-wrap.patch
Description: Binary data

Re: Detecting File Damage & Inconsistencies

2021-03-17 Thread Craig Ringer

On Mon, 15 Mar 2021 at 21:01, David Steele  wrote:

> On 11/18/20 5:23 AM, Simon Riggs wrote:
> > On Wed, 18 Nov 2020 at 06:42, Craig Ringer
> >  wrote:
> >>
> >> On Fri, Nov 13, 2020 at 7:24 PM Simon Riggs 
> wrote:
> >>>
> >>>
> >>> What I'm proposing is an option to add 16 bytes onto each COMMIT
> >>> record
> >>
> >>
> >> Would it make sense to write this at the time we write a topxid
> assignment to WAL instead?
> >>
> >> Otherwise it won't be accessible to streaming-mode logical decoding.
> >
> > Do you mean extend the xl_xact_assignment record? My understanding is
> > that is not sent in all cases, so not sure what you mean by "instead".
>
> Craig, can you clarify?
>

Right. Or write a separate WAL record when the feature is enabled. But it's
probably sufficient to write it as an optional chunk on xl_xact_assignment
records. We often defer writing them so we can optimise away xacts that
never actually wrote anything, but IIRC we still write one before we write
any WAL that references the xid. That'd be fine, since we don't need the
info any sooner than that during decoding. I'd have to double check that we
write it in all cases and won't get to that too soon, but I'm pretty sure
we do...

Re: fdatasync performance problem with large number of DB files

2021-03-17 Thread Fujii Masao





On 2021/03/17 12:02, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 9:10 PM Fujii Masao  wrote:

Thanks for the patch!

+When set to fsync, which is the default,
+PostgreSQL will recursively open and fsync
+all files in the data directory before crash recovery begins.

Isn't this a bit misleading? This may cause users to misunderstand that
such fsync can happen only in the case of crash recovery.


If I insert the following extra sentence after that one, is it better?
  "This applies whenever starting a database cluster that did not shut
down cleanly, including copies created with pg_basebackup."


Yes. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Get memory contexts of an arbitrary backend process

2021-03-17 Thread Fujii Masao





On 2021/03/17 22:24, torikoshia wrote:

I remade the patch and introduced a function
pg_print_backend_memory_contexts(PID) which prints the memory contexts of
the specified PID to elog.


Thanks for the patch!



   =# SELECT pg_print_backend_memory_contexts(450855);

   ** log output **
   2021-03-17 15:21:01.942 JST [450855] LOG:  Printing memory contexts of PID 
450855
   2021-03-17 15:21:01.942 JST [450855] LOG:  level: 0 TopMemoryContext: 68720 
total in 5 blocks; 16312 free (15 chunks); 52408 used
   2021-03-17 15:21:01.942 JST [450855] LOG:  level: 1 Prepared Queries: 65536 
total in 4 blocks; 35088 free (14 chunks); 30448 used
   2021-03-17 15:21:01.942 JST [450855] LOG:  level: 1 pgstat TabStatusArray 
lookup hash table: 8192 total in 1 blocks; 1408 free (0 chunks); 6784 used
   ..(snip)..
   2021-03-17 15:21:01.942 JST [450855] LOG:  level: 2 CachedPlanSource: 4096 
total in 3 blocks; 680 free (0 chunks); 3416 used: PREPARE hoge_200 AS SELECT * 
FROM pgbench_accounts WHERE aid = 1...
   2021-03-17 15:21:01.942 JST [450855] LOG:  level: 3 CachedPlanQuery: 4096 
total in 3 blocks; 464 free (0 chunks); 3632 used
   ..(snip)..
   2021-03-17 15:21:01.945 JST [450855] LOG:  level: 1 Timezones: 104128 total 
in 2 blocks; 2584 free (0 chunks); 101544 used
   2021-03-17 15:21:01.945 JST [450855] LOG:  level: 1 ErrorContext: 8192 total 
in 1 blocks; 7928 free (5 chunks); 264 used
   2021-03-17 15:21:01.945 JST [450855] LOG:  Grand total: 2802080 bytes in 
1399 blocks; 480568 free (178 chunks); 2321512 used


As above, the output is almost the same as MemoryContextStatsPrint()
except for the way of expression of the level.
MemoryContextStatsPrint() uses indents, but
pg_print_backend_memory_contexts() writes it as "level: %d".


This format looks better to me.



Since there was discussion about enlarging StringInfo may cause
errors on OOM[1], this patch calls elog for each context.

As with MemoryContextStatsPrint(), each context shows 100
children at most.
I once thought it should be configurable, but something like
pg_print_backend_memory_contexts(PID, num_children) needs to send
the 'num_children' from requestor to dumper and it seems to require
another infrastructure.
Creating a new GUC for this seems overkill.
If MemoryContextStatsPrint(), i.e. showing 100 children at most is
enough, this hard limit may be acceptable.


Can't this number be passed via shared memory?



Only superusers can call pg_print_backend_memory_contexts().


+   /* Only allow superusers to signal superuser-owned backends. */
+   if (superuser_arg(proc->roleId) && !superuser())

The patch seems to allow even non-superuser to request to print the memory
contexts if the target backend is owned by non-superuser. Is this intentional?
I think that only superuser should be allowed to execute
pg_print_backend_memory_contexts() whoever owns the target backend.
Because that function can cause lots of log messages.



I'm going to add documentation and regression tests.


Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [HACKERS] logical decoding of two-phase transactions

2021-03-17 Thread Amit Kapila

On Wed, Mar 17, 2021 at 11:27 AM Amit Kapila  wrote:
>
> 5. I have modified the comments atop worker.c to explain the design
> and some of the problems clearly. See attached. If you are fine with
> this, please include it in the next version of the patch.
>

I have further expanded these comments to explain the handling of
prepared transactions for multiple subscriptions on the same server
especially when the same prepared transaction operates on tables for
those subscriptions. See attached, this applies atop the patch sent by
me in the last email. I am not sure but I think it might be better to
add something on those lines in user-facing docs. What do you think?

Another comment:
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription \"%s\" 2PC is %s.",
+ MySubscription->name,
+ MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+ MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+ MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+ "?")));

I don't think this is required in LOGs, maybe at some DEBUG level,
because users can check this in pg_subscription. If we keep this
message, there will be two consecutive messages like below in logs for
subscriptions that have two_pc option enabled which looks a bit odd.
LOG:  logical replication apply worker for subscription "mysub" has started
LOG:  logical replication apply worker for subscription "mysub" 2PC is ENABLED.

-- 
With Regards,
Amit Kapila.

change_two_phase_desc_2.patch
Description: Binary data

Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-17 Thread Fujii Masao





On 2021/03/18 11:59, kuroda.hay...@fujitsu.com wrote:

Dear Ikeda-san,

I confirmed new patch and no problem was found. Thanks.
(I'm not a native English speaker, so I cannot check your comments correctly, 
sorry)


One user-visible side-effect by this change is; with the patch, the stats is
cleared when only the stats collector is killed (with SIGQUIT) accidentally
and restarted by postmaster later. On the other than, currently the stats is
written in that case and subsequently-restarted stats collector can use
that stats file. I'm not sure if we need to keep supporting this behavior, 
though.

When only the stats collector exits by SIGQUIT, with the patch
FreeWaitEventSet() is also skipped. Is this ok?

-* Loop to process messages until we get SIGQUIT or detect ungraceful
-* death of our parent postmaster.
+* Loop to process messages until we get SIGTERM or SIGQUIT of our 
parent
+* postmaster.

"SIGTERM or SIGQUIT of our parent postmaster" should be
"SIGTERM, SIGQUIT, or detect ungraceful death of our parent postmaster"?

+SignalHandlerForUnsafeExit(SIGNAL_ARGS)

I don't think SignalHandlerForUnsafeExit is good name. Because that's not
"unsafe" exit. No? Even after this signal handler is triggered, the server is
still running normally and a process that exits will be restarted later. What
about SignalHandlerForNonCrashExit or SignalHandlerForNonFatalExit?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-17 Thread Pavel Stehule

st 17. 3. 2021 v 23:16 odesílatel Hannu Krosing  napsal:

> why are you using yet another special syntax for this ?
>
> would it not be better to do something like this:
> CREATE FUNCTION a_reall_long_and_winding_function_name(i int, out o int)
> LANGUAGE plpgsql AS $plpgsql$
> DECLARE
>args function_name_alias
> BEGIN
>   args.o = 2 * args.i
> END;
> $plpgsql$;
>
> or at least do something based on block labels (see end of
> https://www.postgresql.org/docs/13/plpgsql-structure.html )
>
> CREATE FUNCTION somefunc() RETURNS integer AS $$
> << outerblock >>
> DECLARE
> ...
>
> maybe extend this to
>
> CREATE FUNCTION somefunc() RETURNS integer AS $$
> << functionlabel:args >>
> << outerblock >>
> DECLARE
> ...
>
> or replace 'functionlabel' with something less ugly :)
>
> maybe <<< args >>>
>

There are few main reasons:

a) compile options are available already - so I don't need invent new
syntax - #OPTION DUMP, #PRINT_STRICT ON, #VARIABLE_CONFLICT ERROR

b) compile options are processed before any any other operations, so
implementation can be very simple

c) implementation is very simple

I don't like an idea - <> because you mix label syntax with
some functionality, and <> and <<>> are visually similar

"#routine_label xxx" is not maybe visually nice (like other compile options
in plpgsql), but has good readability, simplicity, verbosity

Because we already have this functionality (compile options), and because
this functionality is working well (there are not any limits from this
syntax), I strongly prefer to use it. We should not invite new syntax when
some already available syntax can work well. We can talk about the keyword
ROUTINE_LABEL - maybe #OPTION ROUTINE_LABEL xxx can look better, but I
think so using the compile option is a good solution.

Regards

Pavel



> Cheers
> Hannu
>
>
>
> On Wed, Mar 17, 2021 at 5:05 PM Pavel Stehule 
> wrote:
> >
> >
> >
> > st 17. 3. 2021 v 9:20 odesílatel Michael Paquier 
> napsal:
> >>
> >> On Wed, Mar 17, 2021 at 02:06:57PM +0800, Julien Rouhaud wrote:
> >> > I also think that there should be a single usable top label,
> otherwise it will
> >> > lead to confusing code and it can be a source of bug.
> >>
> >> Okay, that's fine by me.  Could it be possible to come up with an
> >> approach that does not hijack the namespace list contruction and the
> >> lookup logic as much as it does now?  I get it that the patch is done
> >> this way because of the ordering of operations done for the initial ns
> >> list creation and the grammar parsing that adds the routine label on
> >> top of the root one, but I'd like to believe that there are more solid
> >> ways to achieve that, like for example something that decouples the
> >> root label and its alias, or something that associates an alias
> >> directly with its PLpgSQL_nsitem entry?
> >
> >
> > I am checking it now, and I don't see any easy solution. The namespace
> is a one direction tree - only backward iteration from leafs to roots is
> supported. At the moment, when I try to replace the name of root ns_item,
> this root ns_item is referenced by the function's arguments (NS_TYPE_VAR
> type). So anytime I need some  helper ns_item node, that can be a
> persistent target for these nodes. In this case is a memory overhead of
> just one ns_item.
> >
> > orig_label <- argument1 <- argument2
> >
> > The patch has to save the original orig_label because it can be
> referenced from argument1 or by "FOUND" variable. New graphs looks like
> >
> > new_label <- invalidated orig_label <- argument1 <- ...
> >
> > This tree has a different direction than is usual, and then replacing
> the root node is not simple.
> >
> > Second solution (significantly more simple) is an additional pointer in
> ns_item structure. In almost all cases this pointer will not be used.
> Because ns_item uses a flexible array, then union cannot be used. I
> implemented this in a patch marked as "alias-implementation".
> >
> > What do you think about it?
> >
> > Pavel
> >
> >
> >
> >
> >
> >
> >>
> >> --
> >> Michael
>

Re: pg_amcheck contrib application

2021-03-17 Thread Tom Lane

Mark Dilger  writes:
>> On Mar 16, 2021, at 12:52 PM, Robert Haas  wrote:
>> Since we now know that shutting autovacuum off makes the problem go
>> away, I don't see a reason to commit 0001. We should fix pg_amcheck
>> instead, if, as presently seems to be the case, that's where the
>> problem is.

> If you get unlucky, autovacuum will process one of the tables that the test 
> intentionally corrupted, with bad consequences, ultimately causing build farm 
> intermittent test failures.

Um, yeah, the test had better shut off autovacuum on any table that
it intentionally corrupts.

regards, tom lane

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-17 Thread Pavel Stehule

st 17. 3. 2021 v 23:32 odesílatel Michael Paquier 
napsal:

> On Wed, Mar 17, 2021 at 05:04:48PM +0100, Pavel Stehule wrote:
> > This tree has a different direction than is usual, and then replacing the
> > root node is not simple.
>
> Yeah, it is not like we should redesign this whole part just for the
> feature discussed here, and that may impact performance as the current
> list handling is cheap now.
>
> > Second solution (significantly more simple) is an additional pointer in
> > ns_item structure. In almost all cases this pointer will not be used.
> > Because ns_item uses a flexible array, then union cannot be used. I
> > implemented this in a patch marked as "alias-implementation".
> >
> > What do you think about it?
>
> I am not sure that it is necessary nor good to replace entirely the
> root label.  So associating a new field directly into it sounds like a
> promising approach?
>

I have not a problem with this.

Pavel

--
> Michael
>

Re: pg_amcheck contrib application

2021-03-17 Thread Mark Dilger

> On Mar 16, 2021, at 12:52 PM, Robert Haas  wrote:
> 
> On Mon, Mar 15, 2021 at 10:10 PM Mark Dilger
>  wrote:
>> It is unfortunate that the failing test only runs pg_amcheck after creating 
>> numerous corruptions, as we can't know if pg_amcheck would have complained 
>> about pg_statistic before the corruptions were created in other tables, or 
>> if it only does so after.  The attached patch v7-0003 adds a call to 
>> pg_amcheck after all tables are created and populated, but before any 
>> corruptions are caused.  This should help narrow down what is happening, and 
>> doesn't hurt to leave in place long-term.
>> 
>> I don't immediately see anything wrong with how pg_statistic uses a 
>> pseudo-type, but it leads me to want to poke a bit more at pg_statistic on 
>> hornet and tern, though I don't have any regression tests specifically for 
>> doing so.
>> 
>> Tests v7-0001 and v7-0002 are just repeats of the tests posted previously.
> 
> Since we now know that shutting autovacuum off makes the problem go
> away, I don't see a reason to commit 0001. We should fix pg_amcheck
> instead, if, as presently seems to be the case, that's where the
> problem is.

If you get unlucky, autovacuum will process one of the tables that the test 
intentionally corrupted, with bad consequences, ultimately causing build farm 
intermittent test failures.  We could wait to see if this ever happens before 
fixing it, if you prefer.  I'm not sure what that buys us, though.

The right approach, I think, is to extend the contrib/amcheck tests to include 
regressing this particular case to see if it fails.  I've written that test and 
verified that it fails against the old code and passes against the new.

> I just committed 0002.

Thanks!

> I think 0003 is probably a good idea, but I haven't committed it yet.

It won't do anything for us in this particular case, but it might make 
debugging failures quicker in the future.

> As for 0004, it seems to me that we might want to do a little more
> rewording of these messages and perhaps we should try to do it all at
> once. Like, for example, your other change to print out the toast
> value ID seems like a good idea, and could apply to any new messages
> as well as some existing ones. Maybe there are also more fields in the
> TOAST pointer for which we could add checks.

Of the toast pointer fields:

int32   va_rawsize; /* Original data size (includes header) */
int32   va_extsize; /* External saved size (doesn't) */
Oid va_valueid; /* Unique ID of value within TOAST table */
Oid va_toastrelid;  /* RelID of TOAST table containing it */

all seem worth getting as part of any toast error message, even if these fields 
themselves are not corrupt.  It just makes it easier to understand the context 
of the error you're looking at.  At first I tried putting these into each 
message, but it is very wordy to say things like "toast pointer with rawsize %u 
and extsize %u pointing at relation with oid %u" and such.  It made more sense 
to just add these four fields to the verify_heapam tuple format.  That saves 
putting them in the message text itself, and has the benefit that you could 
filter the rows coming from verify_heapam() for ones where valueid is or is not 
null, for example.  This changes the external interface of verify_heapam, but I 
didn't bother with a amcheck--1.3--1.4.sql because amcheck--1.2--1.3. sql was 
added as part of the v14 development work and has not yet been released.  My 
assumption is that I can just change it, rather than making a new upgrade file.

These patches fix the visibility rules and add extra toast checking.  Some of 
the previous patch material is not included, since it is not clear to me if you 
wanted to commit any of it.

v9-0001-Fixing-amcheck-tuple-visibility-rules.patch
Description: Binary data

v9-0002-pg_amcheck-provide-additional-toast-corruption-in.patch
Description: Binary data

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-17 Thread Michael Paquier

On Wed, Mar 17, 2021 at 05:09:50PM +0900, Michael Paquier wrote:
> Currently with HEAD and back branches, nothing would be broken as
> logical contexts cannot exist in recovery.  Still it would be easy
> to miss the new behavior for anybody attempting to work more on this
> feature in the future if we change read_local_xlog_page to not update
> ThisTimeLineID anymore.  Resetting the value of ThisTimeLineID in
> read_local_xlog_page() does not seem completely right either with this
> argument, as they could be some custom code relying on the existing
> behavior of read_local_xlog_page() to maintain ThisTimeLineID.

I was looking at uses of ThisTimeLineID in the wild, and could not
find it getting checked or used actually in backend-side code that
involved the WAL reader facility.  Even if it brings confidence, it
does not mean that it is not used somewhere :/

> Hmmm.  I am wondering whether the best answer for the moment would not
> be to save and reset ThisTimeLineID just in XlogReadTwoPhaseData(), as
> that's the local change that uses read_local_xlog_page().

And attached is the patch able to achieve that.  At least it is
simple, and does not break the actual assumptions this callback relies
on.  This is rather weak though if there are errors as this is out of
critical sections, still the disease is worse.  I have double-checked
all the existing backend code that uses XLogReadRecord(), and did not
notice any code paths with issues similar to this one.

> The state of the code is really confusing on HEAD, and I'd like to
> think that the best thing we could do in the long-term is to make the
> logical decoding path not rely on ThisTimeLineID at all and decouple
> all that, putting the code in a state sane enough so as we don't
> finish with similar errors as what is discussed on this thread.  That
> would be a work for a different patch though, not for stable
> branches.  And seeing some slot and advancing functions update
> directly ThisTimeLineID is no good either..

However, I'd like to think that we should completely untie the
dependency to ThisTimeLineID in any page read callbacks in core in the
long-term, and potentially clean up any assumptions behind timeline
jumps while in recovery for logical contexts as that cannot happen.
At this stage of the 14 dev cycle, that would be material for 15~, but
I also got to wonder if there is work going on to support logical
decoding on standbys, in particular if this would really rely on
ThisTimeLineID.

Thoughts are welcome.
--
Michael
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c16f..80be6cc7a4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1316,7 +1316,11 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
  * twophase files and ReadTwoPhaseFile should be used instead.
  *
  * Note clearly that this function can access WAL during normal operation,
- * similarly to the way WALSender or Logical Decoding would do.
+ * similarly to the way WALSender or Logical Decoding would do.  While
+ * accessing WAL, read_local_xlog_page() may change ThisTimeLineID,
+ * particularly if this routine is called for the end-of-recovery checkpoint
+ * in the checkpointer itself, so save its existing value and restore it
+ * once done.
  */
 static void
 XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
@@ -1324,6 +1328,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
+	TimeLineID	save_currtli = ThisTimeLineID;
 
 	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
 	XL_ROUTINE(.page_read = &read_local_xlog_page,
@@ -1338,6 +1343,14 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 
 	XLogBeginRead(xlogreader, lsn);
 	record = XLogReadRecord(xlogreader, &errormsg);
+
+	/*
+	 * Restore immediately the timeline where it was previously, as
+	 * read_local_xlog_page() could have changed it if the record was read
+	 * while recovery was finishing or if the timeline has jumped in-between.
+	 */
+	ThisTimeLineID = save_currtli;
+
 	if (record == NULL)
 		ereport(ERROR,
 (errcode_for_file_access(),
diff --git a/src/test/recovery/t/022_pitr_prepared_xact.pl b/src/test/recovery/t/022_pitr_prepared_xact.pl
new file mode 100644
index 00..3a7907b2a0
--- /dev/null
+++ b/src/test/recovery/t/022_pitr_prepared_xact.pl
@@ -0,0 +1,86 @@
+# Test for point-in-time-recovery (PITR) with prepared transactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use File::Compare;
+
+# Initialize and start primary node with WAL archiving
+my $node_primary = get_new_node('primary');
+$node_primary->init(has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', qq{
+max_wal_senders = 10
+wal_level = 'replica'
+max_prepared_transactions = 10});
+$node_primary->start;
+
+# Take backup
+my $backup_name = 'my_backup';

RE: libpq debug log

2021-03-17 Thread tsunakawa.ta...@fujitsu.com

From: Alvaro Herrera 
> In pqTraceOutputString(), you can use the return value from fprintf to
> move the cursor -- no need to count chars.

Yes, precisely, 2 bytes for the double quotes needs to be subtracted as follows:

len = fprintf(...);
*cursor += (len - 2);


> I still think that the message-type specific functions should print the
> message type, rather than having the string arrays.

I sort of think so to remove the big arrays.  But to minimize duplicate code, I 
think the code structure will look like:

fprintf(timestamp, length);
switch (type)
{
case '?':
pqTraceOutput?(...);
break;
case '?':
/* No message content */
fprintf("message_type_name");
break;
}   

void
pqTraceOutput?(...)
{
fprintf("message_type_name");
print message content;  
}

The order of length and message type is reversed.  The .sgml should also be 
changed accordingly.  What do you think?

Iwata-san, 
Why don't you submit v27 patch with the current arrays kept, and then v28 with 
the arrays removed soon after?


From: Kyotaro Horiguchi 
> It would help when the value is "255", which is confusing between -1
> (or 255) in byte or 255 in 2-byte word. Or when the value looks like
> broken. I'd suggest "b"yte for 1 byte, "s"hort for 2 bytes, "l"ong for
> 4 bytes ('l' is confusing with '1', but anyway it is not needed)..

I don't have a strong opinion on this.  (I kind of think I want to see 
unprefixed numbers; readers will look at the protocol reference anyway.)  I'd 
like to leave this up to Iwata-san and Alvaro-san.


Regards
Takayuki Tsunakawa}

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-17 Thread houzj.f...@fujitsu.com

> >  If a table parameter value is set and the
> >  equivalent toast. parameter is not, the TOAST table
> >  will use the table's parameter value.
> > -Specifying these parameters for partitioned tables is not supported,
> > -but you may specify them for individual leaf partitions.
> > +These parameters, with the exception of
> > +parallel_insert_enabled,
> > +are not supported on partitioned tables, but may be specified for
> > +individual leaf partitions.
> > 
> >
> 
> Your patch looks good to me. While checking this, I notice a typo in the
> previous patch:
> -  planner parameter parallel_workers.
> +  planner parameter parallel_workers and
> +  parallel_insert_enabled.
> 
> Here, it should be /planner parameter/planner parameters.

Thanks amit and justin for pointing this out !
The changes looks good to me.

Best regards,
houzj

Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-17 Thread Amit Kapila

On Thu, Mar 18, 2021 at 8:30 AM Justin Pryzby  wrote:
>
> diff --git a/doc/src/sgml/ref/create_table.sgml 
> b/doc/src/sgml/ref/create_table.sgml
> index ff1b642722..d5d356f2de 100644
> --- a/doc/src/sgml/ref/create_table.sgml
> +++ b/doc/src/sgml/ref/create_table.sgml
> @@ -1338,8 +1338,10 @@ WITH ( MODULUS  class="parameter">numeric_literal, REM
>  If a table parameter value is set and the
>  equivalent toast. parameter is not, the TOAST table
>  will use the table's parameter value.
> -Specifying these parameters for partitioned tables is not supported,
> -but you may specify them for individual leaf partitions.
> +These parameters, with the exception of
> +parallel_insert_enabled,
> +are not supported on partitioned tables, but may be specified for
> +individual leaf partitions.
> 
>

Your patch looks good to me. While checking this, I notice a typo in
the previous patch:
-  planner parameter parallel_workers.
+  planner parameter parallel_workers and
+  parallel_insert_enabled.

Here, it should be /planner parameter/planner parameters.

-- 
With Regards,
Amit Kapila.

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Peter Geoghegan

On Wed, Mar 17, 2021 at 7:16 PM Masahiko Sawada  wrote:
> Since I was thinking that always skipping index vacuuming on
> anti-wraparound autovacuum is legitimate, skipping index vacuuming
> only when we're really close to the point of going into read-only mode
> seems a bit conservative, but maybe a good start. I've attached a PoC
> patch to disable index vacuuming if the table's relfrozenxid is too
> older than autovacuum_freeze_max_age (older than 1.5x of
> autovacuum_freeze_max_age).

Most anti-wraparound VACUUMs are really not emergencies, though. So
treating them as special simply because they're anti-wraparound
vacuums doesn't seem like the right thing to do. I think that we
should dynamically decide to do this when (antiwraparound) VACUUM has
already been running for some time. We need to delay the decision
until it is almost certainly true that we really have an emergency.

Can you take what you have here, and make the decision dynamic? Delay
it until we're done with the first heap scan? This will require
rebasing on top of the patch I posted. And then adding a third patch,
a little like the second patch -- but not too much like it.

In the second/SKIP_VACUUM_PAGES_RATIO patch I posted today, the
function two_pass_strategy() (my new name for the main entry point for
calling lazy_vacuum_all_indexes() and lazy_vacuum_heap()) is only
willing to perform the "skip index vacuuming" optimization when the
call to two_pass_strategy() is the first call and the last call for
that entire VACUUM (plus we test the number of heap blocks with
LP_DEAD items using SKIP_VACUUM_PAGES_RATIO, of course). It works this
way purely because I don't think that we should be aggressive when
we've already run out of maintenance_work_mem. That's a bad time to
apply a performance optimization.

But what you're talking about now isn't a performance optimization
(the mechanism is similar or the same, but the underlying reasons are
totally different) -- it's a safety/availability thing. I don't think
that you need to be concerned about running out of
maintenance_work_mem in two_pass_strategy() when applying logic that
is concerned about keeping the database online by avoiding XID
wraparound. You just need to have high confidence that it is a true
emergency. I think that we can be ~99% sure that we're in a real
emergency by using dynamic information about how old relfrozenxid is
*now*, and by rechecking a few times during VACUUM. Probably by
rechecking every time we call two_pass_strategy().

I now believe that there is no fundamental correctness issue with
teaching two_pass_strategy() to skip index vacuuming when we're low on
memory -- it is 100% a matter of costs and benefits. The core
skip-index-vacuuming mechanism is very flexible. If we can be sure
that it's a real emergency, I think that we can justify behaving very
aggressively (letting indexes get bloated is after all very
aggressive). We just need to be 99%+ sure that continuing with
vacuuming will be worse that ending vacuuming. Which seems possible by
making the decision dynamic (and revisiting it at least a few times
during a very long VACUUM, in case things change).

-- 
Peter Geoghegan

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Tom Lane

Andres Freund  writes:
> The most glaring case is the RelationInitTableAccessMethod() call in
> RelationBuildLocalRelation(). Seems like the best fix is to just move
> the MemoryContextSwitchTo() to just before the
> RelationInitTableAccessMethod().  Although I wonder if we shouldn't go
> further, and move it to much earlier, somewhere after the rd_rel
> allocation.

Yeah, same thing I did locally.  Not sure if it's worth working harder.

> There's plenty other hits, but I think I should get back to working on
> making the shared memory stats patch committable. I really wouldn't want
> it to slip yet another year.

+1, so far there's little indication that we're finding any serious leaks
here.  Might be best to set it all aside till there's more time.

regards, tom lane

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

On 2021-03-17 22:33:59 -0400, Tom Lane wrote:
> >> I've not yet tried to grok the comment that purports to justify it,
> >> but I fail to see why it'd ever be useful to drop values inherited
> >> from the postmaster.
> 
> > I can't really make sense of it either. I think it may be trying to
> > restore the GUC state to what it would have been in postmaster,
> > disregarding all the settings that were set as part of PostgresInit()
> > etc?
> 
> At least in a non-EXEC_BACKEND build, the most reliable way to reproduce
> the postmaster's settings is to do nothing whatsoever.  And I think the
> same is true for EXEC_BACKEND, really, because the guc.c subsystem is
> responsible for restoring what would have been the inherited-via-fork
> settings.  So I'm really not sure what this is on about, and I'm too
> tired to try to figure it out tonight.

The restore thing runs after we've already set and initialized GUCs,
including things like user/database default GUCs. Is see things like

==2251779== 4,560 bytes in 1 blocks are definitely lost in loss record 383 of 
406
==2251779==at 0x483877F: malloc (vg_replace_malloc.c:307)
==2251779==by 0x714A45: ConvertTimeZoneAbbrevs (datetime.c:4556)
==2251779==by 0x88DE95: load_tzoffsets (tzparser.c:465)
==2251779==by 0x88511D: check_timezone_abbreviations (guc.c:11565)
==2251779==by 0x884633: call_string_check_hook (guc.c:11232)
==2251779==by 0x87CB57: parse_and_validate_value (guc.c:7012)
==2251779==by 0x87DD5F: set_config_option (guc.c:7630)
==2251779==by 0x88397F: ProcessGUCArray (guc.c:10784)
==2251779==by 0x32BCCF: ApplySetting (pg_db_role_setting.c:256)
==2251779==by 0x874CA2: process_settings (postinit.c:1163)
==2251779==by 0x874A0B: InitPostgres (postinit.c:1048)
==2251779==by 0x60129A: BackgroundWorkerInitializeConnectionByOid 
(postmaster.c:5681)
==2251779==by 0x2BB283: ParallelWorkerMain (parallel.c:1374)
==2251779==by 0x5EBA6A: StartBackgroundWorker (bgworker.c:864)
==2251779==by 0x6014FE: do_start_bgworker (postmaster.c:5802)
==2251779==by 0x6018D2: maybe_start_bgworkers (postmaster.c:6027)
==2251779==by 0x600811: sigusr1_handler (postmaster.c:5190)
==2251779==by 0x4DD513F: ??? (in 
/usr/lib/x86_64-linux-gnu/libpthread-2.31.so)
==2251779==by 0x556E865: select (select.c:41)
==2251779==by 0x5FC0CB: ServerLoop (postmaster.c:1700)
==2251779==by 0x5FBA68: PostmasterMain (postmaster.c:1408)
==2251779==by 0x4F8BFD: main (main.c:209)

The BackgroundWorkerInitializeConnectionByOid() in that trace is before
the RestoreGUCState() later in ParallelWorkerMain().  But that doesn't
really explain the approach.


> In other news, I found that there's a genuine leak in
> RelationBuildLocalRelation: RelationInitTableAccessMethod
> was added in a spot where CurrentMemoryContext is CacheMemoryContext,
> which is bad because it does a syscache lookup that can lead to
> a table access which can leak some memory.  Seems easy to fix though.

Yea, that's the one I was talking about re
ParallelBlockTableScanWorkerData not being freed. While I think we
should also add that pfree, I think you're right, and we shouldn't do
RelationInitTableAccessMethod() while in CacheMemoryContext.


> The plpgsql situation looks like a mess.  As a short-term answer,
> I'm inclined to recommend adding an exclusion that will ignore anything
> allocated within plpgsql_compile().  Doing better will require a fair
> amount of rewriting.  (Although I suppose we could also consider adding
> an on_proc_exit callback that runs through and frees all the function
> cache entries.)

The error variant of this one seems like it might actually be a
practically relevant leak? As well as increasing memory usage for
compiled plpgsql functions unnecessarily, of course. The latter would be
good to fix, but the former seems like it might be a practical issue for
poolers and the like?

So I think we should do at least the reparenting thing to address that?

Greetings,

Andres Freund

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 00:01:55 -0400, Tom Lane wrote:
> As for the particular point about ParallelBlockTableScanWorkerData,
> I agree with your question to David about why that's in TableScanDesc
> not HeapScanDesc, but I can't get excited about it not being freed in
> heap_endscan. That's mainly because I do not believe that anything as
> complex as a heap or indexscan should be counted on to be zero-leakage.
> The right answer is to not do such operations in long-lived contexts.
> So if we're running such a thing while switched into CacheContext,
> *that* is the bug, not that heap_endscan didn't free this particular
> allocation.

I agree that it's a bad idea to do scans in non-transient contexts. It
does however seems like there's a number of places that do...

I added the following hacky definition of "permanent" contexts

/*
 * NB: Only for assertion use.
 *
 * TopMemoryContext itself obviously is permanent. Treat CacheMemoryContext
 * and all its children as permanent too.
 *
 * XXX: Might be worth adding this as an explicit flag on the context?
 */
bool
MemoryContextIsPermanent(MemoryContext c)
{
if (c == TopMemoryContext)
return true;

while (c)
{
if (c == CacheMemoryContext)
return true;
c = c->parent;
}

return false;
}

and checked that the CurrentMemoryContext is not permanent in
SearchCatCacheInternal() and systable_beginscan(). Hit a number of
times.

The most glaring case is the RelationInitTableAccessMethod() call in
RelationBuildLocalRelation(). Seems like the best fix is to just move
the MemoryContextSwitchTo() to just before the
RelationInitTableAccessMethod().  Although I wonder if we shouldn't go
further, and move it to much earlier, somewhere after the rd_rel
allocation.

There's plenty other hits, but I think I should get back to working on
making the shared memory stats patch committable. I really wouldn't want
it to slip yet another year.

But I think it might make sense to add a flag indicating contexts that
shouldn't be used for non-transient data. Seems like we fairly regularly
have "bugs" around this?

Greetings,

Andres Freund

Re: Permission failures with WAL files in 13~ on Windows

2021-03-17 Thread Michael Paquier

On Wed, Mar 17, 2021 at 07:30:04PM -0700, Andres Freund wrote:
> I suspect it might be easier to reproduce the issue with smaller WAL
> segments, a short checkpoint_timeout, and multiple jobs generating WAL
> and then sleeping for random amounts of time. Not sure if that's the
> sole ingredient, but consider what happens there's processes that
> XLogWrite()s some WAL and then sleeps. Typically such a process'
> openLogFile will still point to the WAL segment. And they may still do
> that when the next checkpoint finishes and we recycle the WAL file.

Yep.  That's basically the kind of scenarios I have been testing to
stress the recycling/removing, with pgbench putting some load into the
server.  This has worked for me.  Once.  But I have little idea why it
gets easier to reproduce in the environments of others, so there may
be an OS-version dependency in the equation here.

> I wonder if we actually fail to unlink() the file in
> durable_link_or_rename(), and then end up recycling the same old file
> into multiple "future" positions in the WAL stream.

You actually mean durable_rename_excl() as of 13~, right?  Yeah, this
matches my impression that it is a two-step failure:
- Failure in one of the steps of durable_rename_excl().
- Fallback to segment removal, where we get the complain about
renaming.

> 1) and 2) seems problematic for restore_command use. I wonder if there's
> a chance that some of the reports ended up hitting 3), and that windows
> doesn't handle that well.

Yeap.  I was thinking about 3) being the actual problem while going
through those docs two days ago.

> If you manage to reproduce, could you check what the link count of the
> all the segments is? Apparently sysinternal's findlinks can do that.
> 
> Or perhaps even better, add an error check that the number of links of
> WAL segments is 1 in a bunch of places (recycling, opening them, closing
> them, maybe?).
> 
> Plus error reporting for unlink failures, of course.

Yep, that's actually something I wrote for my own setups, with
log_checkpoints enabled to catch all concurrent checkpoint activity
and some LOGs.  Still no luck unfortunately :(
--
Michael


signature.asc
Description: PGP signature

Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-17 Thread Justin Pryzby

diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index ff1b642722..d5d356f2de 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1338,8 +1338,10 @@ WITH ( MODULUS numeric_literal, REM
 If a table parameter value is set and the
 equivalent toast. parameter is not, the TOAST table
 will use the table's parameter value.
-Specifying these parameters for partitioned tables is not supported,
-but you may specify them for individual leaf partitions.
+These parameters, with the exception of
+parallel_insert_enabled,
+are not supported on partitioned tables, but may be specified for
+individual leaf partitions.

 

>From 55c7326e56f9b49710a564e91c46212a17f12b24 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 17 Mar 2021 21:47:27 -0500
Subject: [PATCH] Fix for parallel_insert_enabled

---
 doc/src/sgml/ref/create_table.sgml | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index ff1b642722..d5d356f2de 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1338,8 +1338,10 @@ WITH ( MODULUS numeric_literal, REM
 If a table parameter value is set and the
 equivalent toast. parameter is not, the TOAST table
 will use the table's parameter value.
-Specifying these parameters for partitioned tables is not supported,
-but you may specify them for individual leaf partitions.
+These parameters, with the exception of
+parallel_insert_enabled,
+are not supported on partitioned tables, but may be specified for
+individual leaf partitions.

 

-- 
2.17.0

RE: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-17 Thread kuroda.hay...@fujitsu.com

Dear Ikeda-san,

I confirmed new patch and no problem was found. Thanks.
(I'm not a native English speaker, so I cannot check your comments correctly, 
sorry)

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Peter Geoghegan

On Mon, Mar 15, 2021 at 4:11 PM Andres Freund  wrote:
> I kinda wonder whether this case should just be handled by just gotoing
> back to the start of the blkno loop, and redoing the pruning. The only
> thing that makes that a bit more complicatd is that we've already
> incremented vacrelstats->{scanned_pages,vacrelstats->tupcount_pages}.
>
> We really should put the per-page work (i.e. the blkno loop body) of
> lazy_scan_heap() into a separate function, same with the
> too-many-dead-tuples branch.

Attached patch series splits everything up. There is now a large patch
that removes the tupgone special case, and a second patch that
actually adds code that dynamically decides to not do index vacuuming
in cases where (for whatever reason) it doesn't seem useful.

Here are some key points about the first patch:

* Eliminates the "tupgone = true" special case by putting pruning, the
HTSV() call, as well as tuple freezing into a new, dedicated function
-- the function is prepared to restart pruning in those rare cases
where the vacuumlazy.c HTSV() call indicates that a tuple is dead.
Restarting pruning again prunes again, rendering the DEAD tuple with
storage an LP_DEAD line pointer stub.

The restart thing is based on Andres' suggestion.

This patch enables incremental VACUUM (the second patch, and likely
other variations) by allowing us to make a uniform assumption that it
is never strictly necessary to reach lazy_vacuum_all_indexes() or
lazy_vacuum_heap(). It is now possible to "end VACUUM early" while
still advancing relfrozenxid. Provided we've finished the first scan
of the heap, that should be safe.

* In principle we could visit and revisit the question of whether or
not vacuuming should continue or end early many times, as new
information comes to light. For example, maybe Masahiko's patch from
today could be taught to monitor how old relfrozenxid is again and
again, before finally giving up early when the risk of wraparound
becomes very severe -- but not before then.

* I've added three structs that replace a blizzard of local variables
we used lazy_scan_heap() with just three (three variables for each of
the three structs). I've also moved several chunks of logic to other
new functions (in addition to one that does pruning and freezing).

I think that I have the data structures roughly right here -- but I
would appreciate some feedback on that. Does this seem like the right
direction?

-- 
Peter Geoghegan

v3-0002-Skip-index-vacuuming-dynamically.patch
Description: Binary data

v3-0001-Remove-tupgone-special-case-from-vacuumlazy.c.patch
Description: Binary data

RE: libpq debug log

2021-03-17 Thread tsunakawa.ta...@fujitsu.com

I'll look at the comments from Alvaro-san and Horiguchi-san.  Here are my 
review comments:


(23)
+   /* Trace message only when there is first 1 byte */
+   if (conn->Pfdebug && conn->outCount < conn->outMsgStart)
+   {
+   if (conn->outCount < conn->outMsgStart)
+   pqTraceOutputMessage(conn, conn->outBuffer + 
conn->outCount, true);
+   else
+   pqTraceOutputNoTypeByteMessage(conn,
+   
conn->outBuffer + conn->outMsgStart);
+   }

The inner else path doesn't seem to be reached because both the outer and inner 
if contain the same condition.  I think you want to remove the second condition 
from the outer if.


(24) pqTraceOutputNoTypeByteMessage
+   case 16:/* CancelRequest */
+   fprintf(conn->Pfdebug, "%s\t>\tCancelRequest\t%d", 
timestr, length);
+   pqTraceOutputInt32(message, &LogCursor, conn->Pfdebug);
+   pqTraceOutputInt32(message, &LogCursor, conn->Pfdebug);
+   break;

Another int32 data needs to be output as follows:

--
Int32(80877102)
The cancel request code. The value is chosen to contain 1234 in the most 
significant 16 bits, and 5678 in the least significant 16 bits. (To avoid 
confusion, this code must not be the same as any protocol version number.)

Int32
The process ID of the target backend.

Int32
The secret key for the target backend.
--


(25)
+   case 8 :/* GSSENRequest or SSLRequest */

GSSENRequest -> GSSENCRequest


(26)
+static void
+pqTraceOutputByte1(const char *data, int *cursor, FILE *pfdebug)
+{
+   const char *v = data + *cursor;
+   /*

+static void
+pqTraceOutputf(const char *message, int end, FILE *f)
+{
+   int cursor = 0;
+   pqTraceOutputString(message, &cursor, end, f);
+}

Put an empty line to separate the declaration part and execution part.


(27)
+   const char  *message_type = "UnkownMessage";
+
+   id = message[LogCursor++];
+
+   memcpy(&length, message + LogCursor , 4);
+   length = (int) pg_ntoh32(length);
+   LogCursor += 4;
+   LogEnd = length - 4;

+   /* Get a protocol type from first byte identifier */
+   if (toServer &&
+   id < lengthof(protocol_message_type_f) &&
+   protocol_message_type_f[(unsigned char)id] != NULL)
+   message_type = protocol_message_type_f[(unsigned char)id];
+   else if (!toServer &&
+   id < lengthof(protocol_message_type_b) &&
+   protocol_message_type_b[(unsigned char)id] != NULL)
+   message_type = protocol_message_type_b[(unsigned char)id];
+   else
+   {
+   fprintf(conn->Pfdebug, "Unknown message: %02x\n", id);
+   return;
+   }
+

The initial value "UnkownMessage" is not used.  So, you can initialize 
message_type with NULL and do like:

+if (...)
+   ...
+   else if (...)
+   ...
+
+   if (message_type == NULL)
+   {
+   fprintf(conn->Pfdebug, "Unknown message: %02x\n", id);
+   return;
+

Plus, I think this should be done before looking at the message length.


(28)
pqTraceOutputBinary() is only used in pqTraceOutputNchar().  Do they have to be 
separated?


Regards
Takayuki Tsunakawa

Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-17 Thread Justin Pryzby

On Fri, Mar 12, 2021 at 04:05:09PM +0900, Amit Langote wrote:
> On Fri, Mar 12, 2021 at 6:10 AM Justin Pryzby  wrote:
> > Note also this CF entry
> > https://commitfest.postgresql.org/32/2987/
> > | Allow setting parallel_workers on partitioned tables

I'm rebasing that other patch on top of master (with this patch), and I noticed
that it adds this bit, and this patch should have done something similar:

--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1340,4 +1340,5 @@ WITH ( MODULUS numeric_literal, REM
 will use the table's parameter value.
-Specifying these parameters for partitioned tables is not supported,
-but you may specify them for individual leaf partitions.
+These parameters, with the exception of 
parallel_workers,
+are not supported on partitioned tables, but you may specify them for
+individual leaf partitions.


-- 
Justin

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Tom Lane

Andres Freund  writes:
> On 2021-03-17 21:30:48 -0400, Tom Lane wrote:
>> I believe I've just tracked down the cause of that.  Those errors
>> are (as far as I've seen) only happening in parallel workers, and
>> the reason is this gem in RestoreGUCState: ...

> Ouch. At least it's a short lived issue rather than something permanently
> leaking memory on every SIGHUP...

Yeah.  This leak is really insignificant in practice, although I'm
suspicious that randomly init'ing GUCs like this might have semantic
issues that we've not detected yet.

>> I've not yet tried to grok the comment that purports to justify it,
>> but I fail to see why it'd ever be useful to drop values inherited
>> from the postmaster.

> I can't really make sense of it either. I think it may be trying to
> restore the GUC state to what it would have been in postmaster,
> disregarding all the settings that were set as part of PostgresInit()
> etc?

At least in a non-EXEC_BACKEND build, the most reliable way to reproduce
the postmaster's settings is to do nothing whatsoever.  And I think the
same is true for EXEC_BACKEND, really, because the guc.c subsystem is
responsible for restoring what would have been the inherited-via-fork
settings.  So I'm really not sure what this is on about, and I'm too
tired to try to figure it out tonight.

In other news, I found that there's a genuine leak in
RelationBuildLocalRelation: RelationInitTableAccessMethod
was added in a spot where CurrentMemoryContext is CacheMemoryContext,
which is bad because it does a syscache lookup that can lead to
a table access which can leak some memory.  Seems easy to fix though.

The plpgsql situation looks like a mess.  As a short-term answer,
I'm inclined to recommend adding an exclusion that will ignore anything
allocated within plpgsql_compile().  Doing better will require a fair
amount of rewriting.  (Although I suppose we could also consider adding
an on_proc_exit callback that runs through and frees all the function
cache entries.)

regards, tom lane

Re: Permission failures with WAL files in 13~ on Windows

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-18 09:55:46 +0900, Michael Paquier wrote:
> Let's see how it goes from this point, but, FWIW, I have not been able
> to reproduce again my similar problem with the archive command :/ --

I suspect it might be easier to reproduce the issue with smaller WAL
segments, a short checkpoint_timeout, and multiple jobs generating WAL
and then sleeping for random amounts of time. Not sure if that's the
sole ingredient, but consider what happens there's processes that
XLogWrite()s some WAL and then sleeps. Typically such a process'
openLogFile will still point to the WAL segment. And they may still do
that when the next checkpoint finishes and we recycle the WAL file.

I wonder if we actually fail to unlink() the file in
durable_link_or_rename(), and then end up recycling the same old file
into multiple "future" positions in the WAL stream.

There's also these interesting notes at
https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-createhardlinka

1)
> The security descriptor belongs to the file to which a hard link
> points. The link itself is only a directory entry, and does not have a
> security descriptor. Therefore, when you change the security
> descriptor of a hard link, you a change the security descriptor of the
> underlying file, and all hard links that point to the file allow the
> newly specified access. You cannot give a file different security
> descriptors on a per-hard-link basis.

2)
> Flags, attributes, access, and sharing that are specified in
> CreateFile operate on a per-file basis. That is, if you open a file
> that does not allow sharing, another application cannot share the file
> by creating a new hard link to the file.

3)
> The maximum number of hard links that can be created with this
> function is 1023 per file. If more than 1023 links are created for a
> file, an error results.

1) and 2) seems problematic for restore_command use. I wonder if there's
a chance that some of the reports ended up hitting 3), and that windows
doesn't handle that well.

If you manage to reproduce, could you check what the link count of the
all the segments is? Apparently sysinternal's findlinks can do that.

Or perhaps even better, add an error check that the number of links of
WAL segments is 1 in a bunch of places (recycling, opening them, closing
them, maybe?).

Plus error reporting for unlink failures, of course.

Greetings,

Andres Freund

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Masahiko Sawada

On Wed, Mar 17, 2021 at 7:21 AM Peter Geoghegan  wrote:
>
> On Tue, Mar 16, 2021 at 6:40 AM Masahiko Sawada  wrote:
> > > Note that I've merged multiple existing functions in vacuumlazy.c into
> > > one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
> > > into a single function named vacuum_indexes_mark_unused()
>
> > I agree to create a function like vacuum_indexes_mark_unused() that
> > makes a decision and does index and heap vacumming accordingly. But
> > what is the point of removing both lazy_vacuum_all_indexes() and
> > lazy_vacuum_heap()? I think we can simply have
> > vacuum_indexes_mark_unused() call those functions. I'm concerned that
> > if we add additional criteria to vacuum_indexes_mark_unused() in the
> > future the function will become very large.
>
> I agree now. I became overly excited about advertising the fact that
> these two functions are logically one thing. This is important, but it
> isn't necessary to go as far as actually making everything into one
> function. Adding some comments would also make that point clear, but
> without making vacuumlazy.c even more spaghetti-like. I'll fix it.
>
> > > I wonder if we can add some kind of emergency anti-wraparound vacuum
> > > logic to what I have here, for Postgres 14.
>
> > +1
> >
> > I think we can set VACOPT_TERNARY_DISABLED to
> > tab->at_params.index_cleanup in table_recheck_autovac() or increase
> > the thresholds used to not skipping index vacuuming.
>
> I was worried about the "tupgone = true" special case causing problems
> when we decide to do some index vacuuming and some heap
> vacuuming/LP_UNUSED-marking but then later decide to end the VACUUM.
> But I now think that getting rid of "tupgone = true" gives us total
> freedom to
> choose what to do, including the freedom to start with index vacuuming
> and then give up on it later -- even after we do some amount of
> LP_UNUSED-marking (during a VACUUM with multiple index passes, perhaps
> due to a low maintenance_work_mem setting). That isn't actually
> special at all, because everything will be 100% decoupled.
>
> Whether or not it's a good idea to either not start index vacuuming or
> to end it early (e.g. due to XID wraparound pressure) is a complicated
> question, and the right approach will be debatable in each case/when
> thinking about each issue. However, deciding on the best behavior to
> address these problems should have nothing to do with implementation
> details and everything to do with the costs and benefits to users.
> Which now seems possible.
>
> A sophisticated version of the "XID wraparound pressure"
> implementation could increase reliability without ever being
> conservative, just by evaluating the situation regularly and being
> prepared to change course when necessary -- until it is truly a matter
> of shutting down new XID allocations/the server. It should be possible
> to decide to end VACUUM early and advance relfrozenxid (provided we
> have reached the point of finishing all required pruning and freezing,
> of course). Highly agile behavior seems quite possible, even if it
> takes a while to agree on a good design.

Since I was thinking that always skipping index vacuuming on
anti-wraparound autovacuum is legitimate, skipping index vacuuming
only when we're really close to the point of going into read-only mode
seems a bit conservative, but maybe a good start. I've attached a PoC
patch to disable index vacuuming if the table's relfrozenxid is too
older than autovacuum_freeze_max_age (older than 1.5x of
autovacuum_freeze_max_age).

>
> > Aside from whether it's good or bad, strictly speaking, it could
> > change the index AM API contract. The documentation of
> > amvacuumcleanup() says:
> >
> > ---
> > stats is whatever the last ambulkdelete call returned, or NULL if
> > ambulkdelete was not called because no tuples needed to be deleted.
> > ---
> >
> > With this change, we could pass NULL to amvacuumcleanup even though
> > the index might have tuples needed to be deleted.
>
> I think that we should add a "Note" box to the documentation that
> notes the difference here. Though FWIW my interpretation of the words
> "no tuples needed to be deleted" was always "no tuples needed to be
> deleted because vacuumlazy.c didn't call ambulkdelete()". After all,
> VACUUM can add to tups_vacuumed through pruning inside
> heap_prune_chain(). It is already possible (though only barely) to not
> call ambulkdelete() for indexes (to only call amvacuumcleanup() during
> cleanup) despite the fact that heap vacuuming did "delete tuples".

Agreed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


poc_skip_index_cleanup_at_anti_wraparound.patch
Description: Binary data

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 21:30:48 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2021-03-17 18:07:36 -0400, Tom Lane wrote:
> >> Huh, interesting.  I wonder why that makes the ident problem go away?
> 
> > I spent a bit of time looking at valgrind, and it looks to be explicit
> > behaviour:
> >
> >// Our goal is to construct a set of chunks that includes every
> >// mempool chunk, and every malloc region that *doesn't* contain a
> >// mempool chunk.
> 
> Ugh.
> 
> > I guess that makes sense, but would definitely be nice to have had
> > documented...
> 
> Indeed.  So this started happening to us when we merged the aset
> header with its first allocation block.

Yea.

I have the feeling that valgrinds error detection capability in our
codebase used to be higher too. I wonder if that could be related
somehow. Or maybe it's just a figment of my imagination.


> I believe I've just tracked down the cause of that.  Those errors
> are (as far as I've seen) only happening in parallel workers, and
> the reason is this gem in RestoreGUCState:
> 
>   /* See comment at can_skip_gucvar(). */
>   for (i = 0; i < num_guc_variables; i++)
>   if (!can_skip_gucvar(guc_variables[i]))
>   InitializeOneGUCOption(guc_variables[i]);
> 
> where InitializeOneGUCOption zeroes out that GUC variable, causing
> it to lose track of any allocations it was responsible for.

Ouch. At least it's a short lived issue rather than something permanently
leaking memory on every SIGHUP...



> At minimum, somebody didn't understand the difference between "initialize"
> and "reset".  But I suspect we could just nuke this loop altogether.
> I've not yet tried to grok the comment that purports to justify it,
> but I fail to see why it'd ever be useful to drop values inherited
> from the postmaster.

I can't really make sense of it either. I think it may be trying to
restore the GUC state to what it would have been in postmaster,
disregarding all the settings that were set as part of PostgresInit()
etc?

Greetings,

Andres Freund

Re: EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW

2021-03-17 Thread Japin Li




On Tue, 16 Mar 2021 at 20:13, Bharath Rupireddy 
 wrote:
> On Tue, Mar 16, 2021 at 1:15 AM Tom Lane  wrote:
>>
>> [ Sorry for not looking at this thread sooner ]
>>
>> Bharath Rupireddy  writes:
>> > Currently, $subject is not allowed. We do plan the mat view query
>> > before every refresh. I propose to show the explain/explain analyze of
>> > the select part of the mat view in case of Refresh Mat View(RMV).
>>
>> TBH, I think we should reject this.  The problem with it is that it
>> binds us to the assumption that REFRESH MATERIALIZED VIEW has an
>> explainable plan.  There are various people poking at ideas like
>> incremental matview updates, which might rely on some implementation
>> that doesn't exactly equate to a SQL query.  Incremental updates are
>> hard enough already; they'll be even harder if they also have to
>> maintain compatibility with a pre-existing EXPLAIN behavior.
>>
>> I don't really see that this feature buys us anything you can't
>> get by explaining the view's query, so I think we're better advised
>> to keep our options open about how REFRESH might be implemented
>> in future.
>
> That makes sense to me. Thanks for the comments. I'm fine to withdraw the 
> patch.
>
> I would like to see if the 0001 patch(attaching here) will be useful
> at all. It just splits up the existing ExecRefreshMatView into a few
> functions to make the code readable.

+1.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: replication slot stats memory bug

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 19:36:46 -0400, Tom Lane wrote:
> > But it seems like we just shouldn't allocate it dynamically at all?
> > max_replication_slots doesn't change during postmaster lifetime, so it
> > seems like it should just be allocated once?
> 
> Meh.  I don't see a need to wire in such an assumption here.

It does make it easier for the shared memory stats patch, because if
there's a fixed number + location, the relevant stats reporting doesn't
need to go through a hashtable with the associated locking.  I guess
that may have colored my perception that it's better to just have a
statically sized memory allocation for this.  Noteworthy that SLRU stats
are done in a fixed size allocation as well...

Greetings,

Andres Freund

Re: Permission failures with WAL files in 13~ on Windows

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-16 16:20:37 +0900, Michael Paquier wrote:
> Fujii-san has mentioned that on twitter, but one area that has changed
> during the v13 cycle is aaa3aed, where the code recycling segments has
> been switched from a pgrename() (with a retry loop) to a
> CreateHardLinkA()+pgunlink() (with a retry loop for the second).  One
> theory that I got in mind here is the case where we create the hard
> link, but fail to finish do the pgunlink() on the xlogtemp.N file,
> though after some testing it did not seem to have any impact.

A related question: What on earth is the point of using the unlink
approach on any operating system. We use the durable_rename_excl() (and
its predecessor, durable_link_or_rename(), and in turn its open coded
predecessors) for things like recycling WAL files at check points.

Now imagine that durable_rename_excl() fails to unlink the old
file. We'll still have the old file, but there's a second link to a new
WAL file, which will be used. No error will be thrown, because we don't
check unlink()'s return code (but if we did, we'd still have similar
issues).

And then imagine that that happens again, during the next checkpoint,
because the permission or whatever issue is not yet resolved. We now
will have the same physical file in two location in the future WAL
stream.

Welcome impossible to debug issues.

And all of this with the sole justification of "paranoidly trying to
avoid overwriting an existing file (there shouldn't be one).". A few
lines after we either unlinked the target filename, or used stat() to
find an unused filename.

Isn't the whole idea of durable_rename_excl() bad? There's not the same
danger of using it when we start from a temp filename, e.g. during
creation of new segments, or timeline files or whatnot. But I still
don't see what the whole thing is supposed to protect us against
realistically.

Greetings,

Andres Freund

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Justin Pryzby

On Wed, Mar 17, 2021 at 02:50:34PM -0700, Andres Freund wrote:
> On 2021-03-17 16:01:58 -0400, Robert Haas wrote:
> > > - why is HIDE_TOAST_COMPRESSION useful? Doesn't quite seem to be
> > >   comparable to HIDE_TABLEAM?
> > 
> > Andres, what do you mean by this exactly? It's exactly the same issue:
> > without this, if you change the default compression method, every test
> > that uses \d+ breaks. If you want to be able to run the whole test
> > suite with either compression method and get the same results, you
> > need this. Now, maybe you don't, because perhaps that doesn't seem so
> > important with compression methods as with table AMs.

Arguably, it's more important, since it affects every column in \d+, not just a
"footer" line.

> I think that latter part is why I wasn't sure such an option is
> warranted. Given it's a builtin feature, I didn't really forsee a need
> to be able to run all the tests with a different compression method. And
> it looked a like it could just have been copied from the tableam logic,
> without a clear need. But if it's useful, then ...

This was one of my suggestions and contributions.
I copied it from tableam specifically, not incidentally.
https://www.postgresql.org/message-id/20210214184940.GL1793%40telsasoft.com

-- 
Justin

Re: replication slot stats memory bug

2021-03-17 Thread Amit Kapila

On Thu, Mar 18, 2021 at 4:55 AM Andres Freund  wrote:
>
> Hi,
>
> On 2021-03-17 16:04:47 -0700, Andres Freund wrote:
> > I'll push the minimal fix of forcing the allocation to happen in
> > pgStatLocalContext and setting it to NULL in pgstat_clear_snapshot().
>
> Done: 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5f79580ad69f6e696365bdc63bc265f45bd77211
>

Thank you!

-- 
With Regards,
Amit Kapila.

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Tom Lane

Andres Freund  writes:
> On 2021-03-17 18:07:36 -0400, Tom Lane wrote:
>> Huh, interesting.  I wonder why that makes the ident problem go away?

> I spent a bit of time looking at valgrind, and it looks to be explicit
> behaviour:
>
>// Our goal is to construct a set of chunks that includes every
>// mempool chunk, and every malloc region that *doesn't* contain a
>// mempool chunk.

Ugh.

> I guess that makes sense, but would definitely be nice to have had
> documented...

Indeed.  So this started happening to us when we merged the aset
header with its first allocation block.

>>> There are a few leak warnings around guc.c that look like they might be
>>> real, not false positives, and thus a bit concerning. Looks like several
>>> guc check hooks don't bother to free the old *extra before allocating a
>>> new one.

>> I'll take a look, but I'm pretty certain that guc.c, not the hooks, is
>> responsible for freeing those.

I believe I've just tracked down the cause of that.  Those errors
are (as far as I've seen) only happening in parallel workers, and
the reason is this gem in RestoreGUCState:

/* See comment at can_skip_gucvar(). */
for (i = 0; i < num_guc_variables; i++)
if (!can_skip_gucvar(guc_variables[i]))
InitializeOneGUCOption(guc_variables[i]);

where InitializeOneGUCOption zeroes out that GUC variable, causing
it to lose track of any allocations it was responsible for.

At minimum, somebody didn't understand the difference between "initialize"
and "reset".  But I suspect we could just nuke this loop altogether.
I've not yet tried to grok the comment that purports to justify it,
but I fail to see why it'd ever be useful to drop values inherited
from the postmaster.

regards, tom lane

Re: Calendar support in localization

2021-03-17 Thread Vik Fearing

On 3/17/21 3:48 PM, Tom Lane wrote:
> Also, the SQL spec says in so many words
> that the SQL-defined datetime types follow the Gregorian calendar.

We already don't follow the SQL spec for timestamps (and I, for one,
think we are better for it) so I don't think we should worry about that.
-- 
Vik Fearing

Re: Permission failures with WAL files in 13~ on Windows

2021-03-17 Thread Michael Paquier

On Tue, Mar 16, 2021 at 11:40:12AM +0100, Magnus Hagander wrote:
> If we can provide a new .EXE built with exactly the same flags as the
> EDB downloads that they can just drop into a directory, I think it's a
> lot easier to get that done.

Yeah, multiple people have been complaining about that bug, so I have
just produced two builds that people with those sensitive environments
can use, and sent some private links to get the builds.  Let's see how
it goes from this point, but, FWIW, I have not been able to reproduce
again my similar problem with the archive command :/
--
Michael


signature.asc
Description: PGP signature

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 18:07:36 -0400, Tom Lane wrote:
> Andres Freund  writes:
> >> I found a way around that late last night. Need to mark the context
> >> itself as an allocation. But I made a mess on the way to that and need
> >> to clean the patch up before sending it (and need to drop my
> >> girlfriend off first).
> 
> > Unfortunately I didn't immediately find a way to do this while keeping
> > the MEMPOOL_CREATE/DESTROY in mcxt.c. The attached patch moves the pool
> > creation into the memory context implementations, "allocates" the
> > context itself as part of that pool, and changes the reset
> > implementation from MEMPOOL_DESTROY + MEMPOOL_CREATE to instead do
> > MEMPOOL_TRIM. That leaves the memory context itself valid (and thus
> > tracking ->ident etc), but marks all the other memory as freed.
> 
> Huh, interesting.  I wonder why that makes the ident problem go away?
> I'd supposed that valgrind would see the context headers as ordinary
> memory belonging to the global "malloc" pool, so that any pointers
> inside them ought to be considered valid.

I'm didn't quite understand either at the time of writing the change. It
just seemed the only explanation for some the behaviour I was seeing, so
I tried it. Just to be initially be rebuffed due to errors when
accessing the recycled sets...

I spent a bit of time looking at valgrind, and it looks to be explicit
behaviour:

memcheck/mc_leakcheck.c

static MC_Chunk**
find_active_chunks(Int* pn_chunks)
{
   // Our goal is to construct a set of chunks that includes every
   // mempool chunk, and every malloc region that *doesn't* contain a
   // mempool chunk.
...
   // Then we loop over the mempool tables. For each chunk in each
   // pool, we set the entry in the Bool array corresponding to the
   // malloc chunk containing the mempool chunk.
   VG_(HT_ResetIter)(MC_(mempool_list));
   while ( (mp = VG_(HT_Next)(MC_(mempool_list))) ) {
  VG_(HT_ResetIter)(mp->chunks);
  while ( (mc = VG_(HT_Next)(mp->chunks)) ) {

 // We'll need to record this chunk.
 n_chunks++;

 // Possibly invalidate the malloc holding the beginning of this chunk.
 m = find_chunk_for(mc->data, mallocs, n_mallocs);
 if (m != -1 && malloc_chunk_holds_a_pool_chunk[m] == False) {
tl_assert(n_chunks > 0);
n_chunks--;
malloc_chunk_holds_a_pool_chunk[m] = True;
 }

I think that means it explicitly ignores the entire malloced allocation
whenever there's *any* mempool chunk in it, instead considering only the
mempool chunks. So once aset allocats something in the init block, the
context itself is ignored for leak checking purposes. But that wouldn't
be the case if we didn't have the init block...

I guess that makes sense, but would definitely be nice to have had
documented...

> Anyway, I don't have a problem with rearranging the responsibility
> like this.  It gives the individual allocators more freedom to do
> odd stuff, at the cost of very minor duplication of valgrind calls.

Yea, similar.

> I agree we need more comments -- would you like me to have a go at
> writing them?

Gladly!

> One thing I was stewing over last night is that a MemoryContextReset
> will mess up any context identifier assigned with
> MemoryContextCopyAndSetIdentifier.

Valgrind should catch such problems. Not having the danger would be
better, of course.

We could also add an assertion at reset time that the identifier isn't
in the current context.

> The very simplest fix would be to allocate non-constant idents with
> malloc; which'd require adding a flag to track whether context->ident
> needs to be free()d.  We have room for another bool near the top of
> struct MemoryContextData (and at some point we could turn those
> bool fields into a flags word).  The only real cost here is one
> more free() while destroying a labeled context, which is probably
> negligible.

Hm. A separate malloc()/free() could conceivably actually show up in
profiles at some point.

What if we instead used that flag to indicate that MemoryContextReset()
needs to save the identifier? Then any cost would only be paid if the
context is actually reset.

> > There are a few leak warnings around guc.c that look like they might be
> > real, not false positives, and thus a bit concerning. Looks like several
> > guc check hooks don't bother to free the old *extra before allocating a
> > new one.
> 
> I'll take a look, but I'm pretty certain that guc.c, not the hooks, is
> responsible for freeing those.

Yea, I had misattributed the leak to the callbacks. One of the things I
saw definitely is a leak: if call_string_check_hook() ereport(ERRORs)
the guc_strdup() of newval->stringval is lost.

There's another set of them that seems to be related to paralellism. But
I've not hunted it down yet.

I think it might be worth adding a VALGRIND_DO_CHANGED_LEAK_CHECK() at
the end of a transaction or such? That way it'd not be quite as hard to
pinpoint the so

Re: replication slot stats memory bug

2021-03-17 Thread Tom Lane

Andres Freund  writes:
> I saw a leak in pgstat_read_statsfiles(), more precisely:
>   /* Allocate the space for replication slot statistics */
>   replSlotStats = palloc0(max_replication_slots * 
> sizeof(PgStat_ReplSlotStats));

Yeah, I just found that myself.  I think your fix is good.

> But it seems like we just shouldn't allocate it dynamically at all?
> max_replication_slots doesn't change during postmaster lifetime, so it
> seems like it should just be allocated once?

Meh.  I don't see a need to wire in such an assumption here.

regards, tom lane

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Julien Rouhaud

On Wed, Mar 17, 2021 at 06:32:16PM -0400, Bruce Momjian wrote:
> On Wed, Mar 17, 2021 at 04:04:44PM -0400, Robert Haas wrote:
> > On Wed, Mar 17, 2021 at 12:48 PM Julien Rouhaud  wrote:
> > > I originally suggested to make it clearer by having an enum GUC rather 
> > > than a
> > > boolean, say compute_queryid = [ none | core | external ], and if set to
> > > external then a hook would be explicitely called.  Right now, "none" and
> > > "external" are binded with compute_queryid = off, and depends on whether 
> > > an
> > > extension is computing a queryid during post_parse_analyse_hook.
> > 
> > I would just make it a Boolean and have a hook. The Boolean controls
> > whether it gets computed at all, and the hook lets an external module
> > override the way it gets computed.
> 
> OK, is that what everyone wants?  I think that is what the patch already
> does.

Note exactly.  Right now a custom queryid can be computed even if
compute_queryid is off, if some extension does that in post_parse_analyze_hook.

I'm assuming that what Robert was thinking was more like:

if (compute_queryid)
{
if (queryid_hook)
queryId = queryid_hook(...);
else
queryId = JumbeQuery(...);
}
else
queryId = 0;

And that should be done *after* post_parse_analyse_hook so that it's clear that
this hook is no longer the place to compute queryid.

Is that what should be done?

> I think having multiple queryids used in a single cluster is much too
> confusing to support.  You would have to label and control which queryid
> is displayed by pg_stat_activity and log_line_prefix, and that seems too
> confusing and not useful.

I agree.

Re: replication slot stats memory bug

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 16:04:47 -0700, Andres Freund wrote:
> I'll push the minimal fix of forcing the allocation to happen in
> pgStatLocalContext and setting it to NULL in pgstat_clear_snapshot().

Done: 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5f79580ad69f6e696365bdc63bc265f45bd77211

Greetings,

Andres Freund

replication slot stats memory bug

2021-03-17 Thread Andres Freund

Hi,

in the course of https://postgr.es/m/3471359.1615937770%40sss.pgh.pa.us
I saw a leak in pgstat_read_statsfiles(), more precisely:
/* Allocate the space for replication slot statistics */
replSlotStats = palloc0(max_replication_slots * 
sizeof(PgStat_ReplSlotStats));

the issue is that the current memory context is not set by
pgstat_read_statsfiles().

In some cases CurrentMemoryContext is going to be a long-lived context,
accumulating those allocations over time. In other contexts it will be a
too short lived context, e.g. an ExprContext from the pg_stat_*
invocation in the query. A reproducer for the latter:

postgres[2252294][1]=# SELECT pg_create_logical_replication_slot('test', 
'test_decoding');
┌┐
│ pg_create_logical_replication_slot │
├┤
│ (test,0/456C1878)  │
└┘
(1 row)

postgres[2252294][1]=# BEGIN ;
BEGIN

postgres[2252294][1]*=# SELECT * FROM pg_stat_replication_slots ;
┌───┬┬─┬─┬─┬──┬──┬─┐
│ slot_name │ spill_txns │ spill_count │ spill_bytes │ stream_txns │ 
stream_count │ stream_bytes │ stats_reset │
├───┼┼─┼─┼─┼──┼──┼─┤
│ test  │  0 │   0 │   0 │   0 │
0 │0 │ (null)  │
└───┴┴─┴─┴─┴──┴──┴─┘
(1 row)

postgres[2252294][1]*=# SELECT * FROM pg_stat_replication_slots ;
┌─>
│   
  >
├─>
│ 
\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F>
└─>
(1 row)

I'll push the minimal fix of forcing the allocation to happen in
pgStatLocalContext and setting it to NULL in pgstat_clear_snapshot().


But it seems like we just shouldn't allocate it dynamically at all?
max_replication_slots doesn't change during postmaster lifetime, so it
seems like it should just be allocated once?

Greetings,

Andres Freund

Re: WIP: WAL prefetch (another approach)

2021-03-17 Thread Tomas Vondra

Hi,

On 3/17/21 10:43 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
>> Right, I was just going to point out the FPIs are not necessary - what
>> matters is the presence of long streaks of WAL records touching the same
>> set of blocks. But people with workloads where this is common likely
>> don't need the WAL prefetching at all - the replica can keep up just
>> fine, because it doesn't need to do much I/O anyway (and if it can't
>> then prefetching won't help much anyway). So just don't enable the
>> prefetching, and there'll be no overhead.
> 
> Isn't this exactly the common case though..?  Checkpoints happening
> every 5 minutes, the replay of the FPI happens first and then the record
> is updated and everything's in SB for the later changes?

Well, as I said before, the FPIs are not very significant - you'll have
mostly the same issue with any repeated changes to the same block. It
does not matter much if you do

FPI for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1

or just

WAL record for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1

In both cases some of the prefetches are probably unnecessary. But the
frequency of checkpoints does not really matter, the important bit is
repeated changes to the same block(s).

If you have active set much larger than RAM, this is quite unlikely. And
we know from the pgbench tests that prefetching has a huge positive
effect in this case.

On smaller active sets, with frequent updates to the same block, we may
issue unnecessary prefetches - that's true. But (a) you have not shown
any numbers suggesting this is actually an issue, and (b) those cases
don't really need prefetching because all the data is already either in
shared buffers or in page cache. So if it happens to be an issue, the
user can simply disable it.

So how exactly would a problematic workload look like?

> You mentioned elsewhere that this would improve 80% of cases but that
> doesn't seem to be backed up by anything and certainly doesn't seem
> likely to be the case if we're talking about across all PG
> deployments.

Obviously, the 80% was just a figure of speech, illustrating my belief
that the proposed patch is beneficial for most users who currently have
issues with replication lag. That is based on my experience with support
customers who have such issues - it's almost invariably an OLTP workload
with large active set, and we know (from the benchmarks) that in these
cases it helps.

Users who don't have issues with replication lag can disable (or not
enable) the prefetching, and won't get any negative effects.

Perhaps there are users with weird workloads that have replication lag
issues but this patch won't help them - bummer, we can't solve
everything in one go. Also, no one actually demonstrated such workload
in this thread so far.

But as you're suggesting we don't have data to support the claim that
this actually helps many users (with no risk to others), I'd point out
you have not actually provided any numbers showing that it actually is
an issue in practice.

> I also disagree that asking the kernel to go do random I/O for us, 
> even as a prefetch, is entirely free simply because we won't
> actually need those pages.  At the least, it potentially pushes out
> pages that we might need shortly from the filesystem cache, no?
Where exactly did I say it's free? I said that workloads where this
happens a lot most likely don't need the prefetching at all, so it can
be simply disabled, eliminating all negative effects.

Moreover, looking at a limited number of recently prefetched blocks
won't eliminate this problem anyway - imagine a random OLTP on large
data set that however fits into RAM. After a while no read I/O needs to
be done, but you'd need pretty much infinite list of prefetched blocks
to eliminate that, and with smaller lists you'll still do 99% of the
prefetches.

Just disabling prefetching on such instances seems quite reasonable.

>> If it was up to me, I'd just get the patch committed as is. Delaying the
>> feature because of concerns that it might have some negative effect in
>> some cases, when that can be simply mitigated by disabling the feature,
>> is not really beneficial for our users.
> 
> I don't know that we actually know how many cases it might have a
> negative effect on or what the actual amount of such negative case there
> might be- that's really why we should probably try to actually benchmark
> it and get real numbers behind it, particularly when the chances of
> running into such a negative effect with the default configuration (that
> is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more
> likely to occur in the field than the cases where FPWs are disabled and
> someone's running on ZFS.
> 
> Perhaps more to the point, it'd be nice to see how this

Re: Calendar support in localization

2021-03-17 Thread Thomas Munro

On Thu, Mar 18, 2021 at 3:48 AM Tom Lane  wrote:
> Robert Haas  writes:
> > It's not very obvious how to scale this kind of approach to a wide
> > variety of calendar types, and as Thomas says, it would much cooler to
> > be able to handle all of the ones that ICU knows how to support rather
> > than just one. But, the problem I see with using timestamptz is that
> > it's not so obvious how to get a different output format ... unless, I
> > guess, we can cram it into DateStyle. And it's also much less obvious
> > how you get the other functions and operators to do what you want, if
> > it's different.
>
> Yeah, I'm afraid that it probably is different.  The most obvious
> example is in operations involving type interval:
> select now() + '1 month'::interval;
> That should almost certainly give a different answer when using a
> different calendar --- indeed the units of interest might not even
> be the same.  (Do all human calendars use the concept of months?)

Right, so if this is done by trying to extend Daniel Verite's icu_ext
extension (link given earlier) and Robert's idea of a fast-castable
type, I suppose you might want now()::icu_date + '1 month'::internal
to advance you by one Ethiopic month if you have done SET
icu_ext.ICU_LC_TIME = 'am_ET@calendar=traditional'.  Or if using my
first idea of just sticking with the core types, perhaps you'd have to
replace stuff via search path... I admit that sounds rather error
prone and fragile (I was thinking mainly of different functions, not
operators).  Either way, I suppose there'd also be more explicit
functions for various operations including ones that take an extra
argument if you want to use an explicit locale instead of relying on
the ICU_LC_TIME setting.  I dunno.

As for whether all calendars have months, it looks like ICU's model
has just the familiar looking standardised fields; whether some of
them make no sense in some calendars, I don't know, but it has stuff
like x.get(field, &error), x.set(field, &error), x.add(field, amount,
&error) and if it fails for some field on your particular calendar, or
for some value (you can't set a Gregorian date's month to 13
(apparently we call this month "undecember", hah), but you can for a
Hebrew or Ethiopic one) I suppose we'd just report the error?

> I don't feel like DateStyle is chartered to affect the behavior
> of datetime operators; it's understood as tweaking the I/O behavior
> only.  There might be more of a case for letting LC_TIME choose
> this behavior, but I bet the relevant standards only contemplate

About LC_TIME... I suppose in one possible future we eventually use
ICU for more core stuff, and someone proposes to merge hypothetical
icu_date etc types into the core date etc types, and then LC_TIME
controls that.  But then you might have a version of the problem that
Peter E ran into in attempts so far to use ICU collations as the
default: if you put ICU's funky extensible locale names into the
LC_XXX environment variables, then your libc will see it too, and
might get upset, since PostgreSQL uses the en.  I suspect that ICU
will understand typical libc locale names, but common libcs won't
understand ICU's highly customisable syntax, but I haven't looked into
it.  If that's generally true, then perhaps the solution to both
problems is a kind of partial separation:  regular LC_XXX, and then
also ICU_LC_XXX which defaults to the same value but can be changed to
access more advanced stuff, and is used only for interacting with ICU.

> Gregorian calendars.  Also, the SQL spec says in so many words
> that the SQL-defined datetime types follow the Gregorian calendar.

:-(

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-17 Thread Michael Paquier

On Wed, Mar 17, 2021 at 05:04:48PM +0100, Pavel Stehule wrote:
> This tree has a different direction than is usual, and then replacing the
> root node is not simple.

Yeah, it is not like we should redesign this whole part just for the
feature discussed here, and that may impact performance as the current
list handling is cheap now.

> Second solution (significantly more simple) is an additional pointer in
> ns_item structure. In almost all cases this pointer will not be used.
> Because ns_item uses a flexible array, then union cannot be used. I
> implemented this in a patch marked as "alias-implementation".
> 
> What do you think about it?

I am not sure that it is necessary nor good to replace entirely the
root label.  So associating a new field directly into it sounds like a
promising approach?
--
Michael


signature.asc
Description: PGP signature

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Bruce Momjian

On Wed, Mar 17, 2021 at 04:04:44PM -0400, Robert Haas wrote:
> On Wed, Mar 17, 2021 at 12:48 PM Julien Rouhaud  wrote:
> > I originally suggested to make it clearer by having an enum GUC rather than 
> > a
> > boolean, say compute_queryid = [ none | core | external ], and if set to
> > external then a hook would be explicitely called.  Right now, "none" and
> > "external" are binded with compute_queryid = off, and depends on whether an
> > extension is computing a queryid during post_parse_analyse_hook.
> 
> I would just make it a Boolean and have a hook. The Boolean controls
> whether it gets computed at all, and the hook lets an external module
> override the way it gets computed.

OK, is that what everyone wants?  I think that is what the patch already
does.

I think having multiple queryids used in a single cluster is much too
confusing to support.  You would have to label and control which queryid
is displayed by pg_stat_activity and log_line_prefix, and that seems too
confusing and not useful.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-17 Thread Hannu Krosing

why are you using yet another special syntax for this ?

would it not be better to do something like this:
CREATE FUNCTION a_reall_long_and_winding_function_name(i int, out o int)
LANGUAGE plpgsql AS $plpgsql$
DECLARE
   args function_name_alias
BEGIN
  args.o = 2 * args.i
END;
$plpgsql$;

or at least do something based on block labels (see end of
https://www.postgresql.org/docs/13/plpgsql-structure.html )

CREATE FUNCTION somefunc() RETURNS integer AS $$
<< outerblock >>
DECLARE
...

maybe extend this to

CREATE FUNCTION somefunc() RETURNS integer AS $$
<< functionlabel:args >>
<< outerblock >>
DECLARE
...

or replace 'functionlabel' with something less ugly :)

maybe <<< args >>>

Cheers
Hannu



On Wed, Mar 17, 2021 at 5:05 PM Pavel Stehule  wrote:
>
>
>
> st 17. 3. 2021 v 9:20 odesílatel Michael Paquier  napsal:
>>
>> On Wed, Mar 17, 2021 at 02:06:57PM +0800, Julien Rouhaud wrote:
>> > I also think that there should be a single usable top label, otherwise it 
>> > will
>> > lead to confusing code and it can be a source of bug.
>>
>> Okay, that's fine by me.  Could it be possible to come up with an
>> approach that does not hijack the namespace list contruction and the
>> lookup logic as much as it does now?  I get it that the patch is done
>> this way because of the ordering of operations done for the initial ns
>> list creation and the grammar parsing that adds the routine label on
>> top of the root one, but I'd like to believe that there are more solid
>> ways to achieve that, like for example something that decouples the
>> root label and its alias, or something that associates an alias
>> directly with its PLpgSQL_nsitem entry?
>
>
> I am checking it now, and I don't see any easy solution. The namespace is a 
> one direction tree - only backward iteration from leafs to roots is 
> supported. At the moment, when I try to replace the name of root ns_item, 
> this root ns_item is referenced by the function's arguments (NS_TYPE_VAR 
> type). So anytime I need some  helper ns_item node, that can be a persistent 
> target for these nodes. In this case is a memory overhead of just one ns_item.
>
> orig_label <- argument1 <- argument2
>
> The patch has to save the original orig_label because it can be referenced 
> from argument1 or by "FOUND" variable. New graphs looks like
>
> new_label <- invalidated orig_label <- argument1 <- ...
>
> This tree has a different direction than is usual, and then replacing the 
> root node is not simple.
>
> Second solution (significantly more simple) is an additional pointer in 
> ns_item structure. In almost all cases this pointer will not be used. Because 
> ns_item uses a flexible array, then union cannot be used. I implemented this 
> in a patch marked as "alias-implementation".
>
> What do you think about it?
>
> Pavel
>
>
>
>
>
>
>>
>> --
>> Michael

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Tom Lane

Andres Freund  writes:
>> I found a way around that late last night. Need to mark the context
>> itself as an allocation. But I made a mess on the way to that and need
>> to clean the patch up before sending it (and need to drop my
>> girlfriend off first).

> Unfortunately I didn't immediately find a way to do this while keeping
> the MEMPOOL_CREATE/DESTROY in mcxt.c. The attached patch moves the pool
> creation into the memory context implementations, "allocates" the
> context itself as part of that pool, and changes the reset
> implementation from MEMPOOL_DESTROY + MEMPOOL_CREATE to instead do
> MEMPOOL_TRIM. That leaves the memory context itself valid (and thus
> tracking ->ident etc), but marks all the other memory as freed.

Huh, interesting.  I wonder why that makes the ident problem go away?
I'd supposed that valgrind would see the context headers as ordinary
memory belonging to the global "malloc" pool, so that any pointers
inside them ought to be considered valid.

Anyway, I don't have a problem with rearranging the responsibility
like this.  It gives the individual allocators more freedom to do
odd stuff, at the cost of very minor duplication of valgrind calls.

I agree we need more comments -- would you like me to have a go at
writing them?

One thing I was stewing over last night is that a MemoryContextReset
will mess up any context identifier assigned with
MemoryContextCopyAndSetIdentifier.  I'd left that as a problem to
fix later, because we don't currently have a need to reset contexts
that use copied identifiers.  But that assumption obviously will bite
us someday, so maybe now is a good time to think about it.

The very simplest fix would be to allocate non-constant idents with
malloc; which'd require adding a flag to track whether context->ident
needs to be free()d.  We have room for another bool near the top of
struct MemoryContextData (and at some point we could turn those
bool fields into a flags word).  The only real cost here is one
more free() while destroying a labeled context, which is probably
negligible.

Other ideas are possible but they seem to require getting the individual
mcxt methods involved, and I doubt it's worth the complexity.

> There are a few leak warnings around guc.c that look like they might be
> real, not false positives, and thus a bit concerning. Looks like several
> guc check hooks don't bother to free the old *extra before allocating a
> new one.

I'll take a look, but I'm pretty certain that guc.c, not the hooks, is
responsible for freeing those.  Might be another case of valgrind not
understanding what's happening.

> I suspect we might get better results from valgrind, not just for leaks
> but also undefined value tracking, if we changed the way we represent
> pools to utilize VALGRIND_MEMPOOL_METAPOOL |
> VALGRIND_MEMPOOL_AUTO_FREE.

Don't really see why that'd help?  I mean, it could conceivably help
catch bugs in the allocators themselves, but I don't follow the argument
that it'd improve anything else.  Defined is defined, as far as I can
tell from the valgrind manual.

regards, tom lane

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 16:01:58 -0400, Robert Haas wrote:
> > - why is HIDE_TOAST_COMPRESSION useful? Doesn't quite seem to be
> >   comparable to HIDE_TABLEAM?
> 
> Andres, what do you mean by this exactly? It's exactly the same issue:
> without this, if you change the default compression method, every test
> that uses \d+ breaks. If you want to be able to run the whole test
> suite with either compression method and get the same results, you
> need this. Now, maybe you don't, because perhaps that doesn't seem so
> important with compression methods as with table AMs.

I think that latter part is why I wasn't sure such an option is
warranted. Given it's a builtin feature, I didn't really forsee a need
to be able to run all the tests with a different compression method. And
it looked a like it could just have been copied from the tableam logic,
without a clear need. But if it's useful, then ...

Greetings,

Andres Freund

Re: WIP: WAL prefetch (another approach)

2021-03-17 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> Right, I was just going to point out the FPIs are not necessary - what
> matters is the presence of long streaks of WAL records touching the same
> set of blocks. But people with workloads where this is common likely
> don't need the WAL prefetching at all - the replica can keep up just
> fine, because it doesn't need to do much I/O anyway (and if it can't
> then prefetching won't help much anyway). So just don't enable the
> prefetching, and there'll be no overhead.

Isn't this exactly the common case though..?  Checkpoints happening
every 5 minutes, the replay of the FPI happens first and then the record
is updated and everything's in SB for the later changes?  You mentioned
elsewhere that this would improve 80% of cases but that doesn't seem to
be backed up by anything and certainly doesn't seem likely to be the
case if we're talking about across all PG deployments.  I also disagree
that asking the kernel to go do random I/O for us, even as a prefetch,
is entirely free simply because we won't actually need those pages.  At
the least, it potentially pushes out pages that we might need shortly
from the filesystem cache, no?

> If it was up to me, I'd just get the patch committed as is. Delaying the
> feature because of concerns that it might have some negative effect in
> some cases, when that can be simply mitigated by disabling the feature,
> is not really beneficial for our users.

I don't know that we actually know how many cases it might have a
negative effect on or what the actual amount of such negative case there
might be- that's really why we should probably try to actually benchmark
it and get real numbers behind it, particularly when the chances of
running into such a negative effect with the default configuration (that
is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more
likely to occur in the field than the cases where FPWs are disabled and
someone's running on ZFS.

Perhaps more to the point, it'd be nice to see how this change actually
improves the caes where PG is running with more-or-less the defaults on
the more commonly deployed filesystems.  If it doesn't then maybe it
shouldn't be the default..?  Surely the folks running on ZFS and running
with FPWs disabled would be able to manage to enable it if they
wished to and we could avoid entirely the question of if this has a
negative impact on the more common cases.

Guess I'm just not a fan of pushing out a change that will impact
everyone by default, in a possibly negative way (or positive, though
that doesn't seem terribly likely, but who knows), without actually
measuring what that impact will look like in those more common cases.
Showing that it's a great win when you're on ZFS or running with FPWs
disabled is good and the expected best case, but we should be
considering the worst case too when it comes to performance
improvements.

Anyhow, ultimately I don't know that there's much more to discuss on
this thread with regard to this particular topic, at least.  As I said
before, if everyone else is on board and not worried about it then so be
it; I feel that at least the concern that I raised has been heard.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Tomas Vondra




On 3/17/21 9:58 PM, Dean Rasheed wrote:
> On Wed, 17 Mar 2021 at 20:48, Dean Rasheed  wrote:
>>
>> For reference, here is the test case I was using (which isn't really very 
>> good for
>> catching dependence between columns):
>>
> 
> And here's a test case with much more dependence between the columns:
> 
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (a int, b int, c int, d int);
> INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,10) x;
> SELECT COUNT(DISTINCT a) FROM foo; -- 2
> SELECT COUNT(DISTINCT b) FROM foo; -- 5
> SELECT COUNT(DISTINCT c) FROM foo; -- 10
> SELECT COUNT(DISTINCT d) FROM foo; -- 15
> SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6
> SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20
> SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10
> SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30
> 
> -- First case: stats on [(a+b),c]
> CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE
> SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
>   -- Estimate = 150, Actual = 30
>   -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15,
>   -- which is much better than ndistinct((a+b)) * ndistinct(c) *
> ndistinct(d) = 6*10*15 = 900
>   -- Estimate with no stats = 1500
> 
> -- Second case: stats on (c+d) as well
> CREATE STATISTICS s2 ON (c+d) FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE
> SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
>   -- Estimate = 120, Actual = 30
>   -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20
> 
> Again, I'd say the current behaviour is pretty good.
> 

Thanks!

I agree applying at least the [(a+b),c] stats is probably the right
approach, as it means we're considering at least the available
information about dependence between the columns.

I think to improve this, we'll need to teach the code to use overlapping
statistics, a bit like conditional probability. In this case we might do
something like this:

ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c))

Which in this case would be either, for the "less correlated" case

228 * 24 / 12 = 446   (actual = 478, current estimate = 480)

or, for the "more correlated" case

10 * 20 / 10 = 20 (actual = 30, current estimate = 120)

But that's clearly a matter for a future patch, and I'm sure there are
cases where this will produce worse estimates.


Anyway, I plan to go over the patches one more time, and start pushing
them sometime early next week. I don't want to leave it until the very
last moment in the CF.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed

On Wed, 17 Mar 2021 at 20:48, Dean Rasheed  wrote:
>
> For reference, here is the test case I was using (which isn't really very 
> good for
> catching dependence between columns):
>

And here's a test case with much more dependence between the columns:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (a int, b int, c int, d int);
INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,10) x;
SELECT COUNT(DISTINCT a) FROM foo; -- 2
SELECT COUNT(DISTINCT b) FROM foo; -- 5
SELECT COUNT(DISTINCT c) FROM foo; -- 10
SELECT COUNT(DISTINCT d) FROM foo; -- 15
SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6
SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20
SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10
SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30

-- First case: stats on [(a+b),c]
CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 150, Actual = 30
  -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15,
  -- which is much better than ndistinct((a+b)) * ndistinct(c) *
ndistinct(d) = 6*10*15 = 900
  -- Estimate with no stats = 1500

-- Second case: stats on (c+d) as well
CREATE STATISTICS s2 ON (c+d) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 120, Actual = 30
  -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20

Again, I'd say the current behaviour is pretty good.

Regards,
Dean

Re: WIP: BRIN multi-range indexes

2021-03-17 Thread John Naylor

On Wed, Mar 17, 2021 at 3:16 PM Tomas Vondra 
wrote:

> Ummm, no attachment ;-)

Oops, here it is.

--
John Naylor
EDB: http://www.enterprisedb.com


jcn-costing-test.sql
Description: Binary data

Re: non-HOT update not looking at FSM for large tuple update

2021-03-17 Thread John Naylor

On Fri, Mar 12, 2021 at 8:45 AM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:
>
> If this case isn't added, the lower else branch will fail to find
> fitting pages for len > maxPaddedFsmRequest tuples; potentially
> extending the relation when there is actually still enough space
> available.

Okay, with that it looks better to go back to using Max().

Also in v4:

- With the separate constant you suggested, I split up the comment into two
parts.
- I've added a regression test to insert.sql similar to your earlier test,
but we cannot use vacuum, since in parallel tests there could still be
tuples visible to other transactions. It's still possible to test
almost-all-free by inserting a small tuple.
- Draft commit message

--
John Naylor
EDB: http://www.enterprisedb.com


v4-0001-Fix-bug-in-heap-space-management-that-was-overly-.patch
Description: Binary data

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2021-03-17 Thread Simon Riggs

On Fri, 12 Mar 2021 at 22:16, Tomas Vondra
 wrote:
>
> On 1/28/21 2:33 PM, Simon Riggs wrote:
> > On Thu, 28 Jan 2021 at 12:53, Masahiko Sawada  wrote:
> >
> >> This entry has been "Waiting on Author" status and the patch has not
> >> been updated since Nov 30. Are you still planning to work on this?
> >
> > Yes, new patch version tomorrow. Thanks for the nudge and the review.
> >
>
> So, is it tomorrow already? ;-)

Been a long day. ;-)

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed

On Wed, 17 Mar 2021 at 19:07, Tomas Vondra
 wrote:
>
> On 3/17/21 7:54 PM, Dean Rasheed wrote:
> >
> > it might have been better to estimate the first case as
> >
> >  ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> >
> > and the second case as
> >
> >  ndistinct((a+b)) * ndistinct((c+d))
>
> OK. I might be confused, but isn't that what the algorithm currently
> does? Or am I just confused about what the first/second case refers to?
>

No, it currently estimates the first case as ndistinct((a+b),c) *
ndistinct(d). Having said that, maybe that's OK after all. It at least
makes an effort to account for any correlation between (a+b) and
(c+d), using the known correlation between (a+b) and c. For reference,
here is the test case I was using (which isn't really very good for
catching dependence between columns):

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (a int, b int, c int, d int);
INSERT INTO foo SELECT x%10, x%11, x%12, x%13 FROM generate_series(1,10) x;
SELECT COUNT(DISTINCT a) FROM foo; -- 10
SELECT COUNT(DISTINCT b) FROM foo; -- 11
SELECT COUNT(DISTINCT c) FROM foo; -- 12
SELECT COUNT(DISTINCT d) FROM foo; -- 13
SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 20
SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 24
SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 228
SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 478

-- First case: stats on [(a+b),c]
CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 2964, Actual = 478
  -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 228*13

-- Second case: stats on (c+d) as well
CREATE STATISTICS s2 ON (c+d) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
  -- Estimate = 480, Actual = 478
  -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 20*24

I think that's probably pretty reasonable behaviour, given incomplete
stats (the estimate with no extended stats is capped at 1).

Regards,
Dean

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Robert Haas

On Wed, Mar 17, 2021 at 12:48 PM Julien Rouhaud  wrote:
> I originally suggested to make it clearer by having an enum GUC rather than a
> boolean, say compute_queryid = [ none | core | external ], and if set to
> external then a hook would be explicitely called.  Right now, "none" and
> "external" are binded with compute_queryid = off, and depends on whether an
> extension is computing a queryid during post_parse_analyse_hook.

I would just make it a Boolean and have a hook. The Boolean controls
whether it gets computed at all, and the hook lets an external module
override the way it gets computed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Robert Haas

).On Mon, Mar 15, 2021 at 6:58 PM Andres Freund  wrote:
> - Adding all these indirect function calls via toast_compression[] just
>   for all of two builtin methods isn't fun either.

Yeah, it feels like this has too many layers of indirection now. Like,
toast_decompress_datum() first gets TOAST_COMPRESS_METHOD(attr). Then
it calls CompressionIdToMethod to convert one constant (like
TOAST_PGLZ_COMPRESSION_ID) to another constant with a slightly
different name (like TOAST_PGLZ_COMPRESSION). Then it calls
GetCompressionRoutines() to get hold of the function pointers. Then it
does an indirect functional call. That seemed like a pretty reasonable
idea when we were trying to support arbitrary compression AMs without
overly privileging the stuff that was built into core, but if we're
just doing stuff that's built into core, then we could just switch
(TOAST_COMPRESS_METHOD(attr)) and call the correct function. In fact,
we could even move the stuff from toast_compression.c into detoast.c,
which would allow the compiler to optimize better (e.g. by inlining,
if it wants).

The same applies to toast_decompress_datum_slice().

There's a similar issue in toast_get_compression_method() and the only
caller, pg_column_compression(). Here the multiple mapping layers and
the indirect function call are split across those two functions rather
than all in the same one, but here again one could presumably find a
place to just switch on TOAST_COMPRESS_METHOD(attr) or
VARATT_EXTERNAL_GET_COMPRESSION(attr) and return "pglz" or "lz4"
directly.

In toast_compress_datum(), I think we could have a switch that invokes
the appropriate compressor based on cmethod and sets a variable to the
value to be passed as the final argument of
TOAST_COMPRESS_SET_SIZE_AND_METHOD().

Likewise, I suppose CompressionNameToMethod could at least be
simplified to use constant strings rather than stuff like
toast_compression[TOAST_PGLZ_COMPRESSION_ID].cmname.

> - why is HIDE_TOAST_COMPRESSION useful? Doesn't quite seem to be
>   comparable to HIDE_TABLEAM?

Andres, what do you mean by this exactly? It's exactly the same issue:
without this, if you change the default compression method, every test
that uses \d+ breaks. If you want to be able to run the whole test
suite with either compression method and get the same results, you
need this. Now, maybe you don't, because perhaps that doesn't seem so
important with compression methods as with table AMs. I think that's a
defensible position. But, it is at the underlying level, the same
thing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Robert Haas

On Wed, Mar 17, 2021 at 2:17 PM Andres Freund  wrote:
> OTOH heap_form_flattened_tuple() has the advantage that we can optimize
> it further (e.g. to do the conversion to flattened values in fill_val())
> without changing the outside API.

Well, in my view, that does change the outside API, because either the
input values[] array is going to get scribbled on, or it's not. We
should either decide we're not OK with it and just do the fill_val()
thing now, or we should decide that we are and not worry about doing
the fill_val() thing later. IMHO, anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: WIP: BRIN multi-range indexes

2021-03-17 Thread Tomas Vondra




On 3/17/21 7:59 PM, John Naylor wrote:
> On Thu, Mar 11, 2021 at 12:26 PM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
>>
>> Hi,
>>
>> Here is an updated version of the patch series.
>>
>> It fixes the assert failure (just remove the multiplication from it) and
>> adds a simple regression test that exercises this.
>>
>> Based on the discussion so far, I've decided to keep just the new
>> signature of the consistent function. That's a bit simpler than having
>> to support both 3 and 4 parameters, and it would not deal with the NULL
>> changes anyway (mostly harmless code duplication, but still). I've also
>> realized the API documentation in SGML needs updating.
>>
>> At this point, I think 0001-0006 parts are mostly committable.
> 
> I think so. I've marked it RFC for this six.
> 
>> As for the remaining two parts, the one dealing with correlation may not
>> be strictly necessary, but not having it (or something like it) may
>> result in not picking the BRIN index in some cases.
>>
>> But maybe it's not a major problem. I tried the example from [1] but it
>> no longer triggers the issue for me - I'm not entirely sure why, but the
>> costing changed for some reason. It used to look like this:
> 
>> ...
> 
>> The index scan cost is about the same, but the heap scan is about half
>> the cost. The row estimates are a bit different, perhaps that's related.
>> The seqscan cost (169248) and duration (~500ms) is still about the same,
>> but so now we still pick the bitmap heap scan.
> 
> With only 0001-0006, I get a parallel bitmap scan in both cases:
> 
>                                                             QUERY PLAN
> ---
>  Gather  (cost=6542.42..52779.35 rows=10 width=4) (actual
> time=3.283..22.308 rows=10 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    ->  Parallel Bitmap Heap Scan on t0  (cost=5542.42..51778.35 rows=4
> width=4) (actual time=3.434..17.257 rows=3 loops=3)
>          Recheck Cond: (a = 1)
>          Rows Removed by Index Recheck: 83165
>          Heap Blocks: lossy=421
>          ->  Bitmap Index Scan on t0_a_idx  (cost=0.00..5542.42
> rows=381682 width=0) (actual time=2.996..2.996 rows=11040 loops=1)
>                Index Cond: (a = 1)
>  Planning Time: 0.088 ms
>  Execution Time: 22.341 ms
> (11 rows)
> 
>> Not sure we can rely on
>> this, though. Would be quite sad if we had new improved opclasses but
>> the planner often decided not to use them.
> 
> Yeah, I'm not sure what to do here. It might be okay to leave it out now
> and study it further as a PG14 open item or PG15 improvement.
> 

Yeah, that's definitely an option.

>> I had an idea of tweaking choose_bitmap_and to consider both the cost
>> and selectivity (similarly to how add_path considers statup/total cost),
>> and that did indeed resolve this particular case. This is what the 0008
>> part does.
>>
>> But it may also have negative consequence, as demonstrated by the
>> reproducer2.sql script. So maybe the logic would need to be more
>> complicated. Or maybe there's no reliable solution, considering how
>> tricky/unreliable BRIN estimates are.
> 
> Ok, so with 0008 in reproducer2, it chooses the more selective path,
> even though it has a higher total cost:
> 
> 0001-0007:
> 
>                                                      QUERY PLAN
> -
>  Bitmap Heap Scan on t2  (cost=29.03..24032.28 rows=1 width=8) (actual
> time=0.498..1.755 rows=1 loops=1)
>    Recheck Cond: (a = 1000)
>    Rows Removed by Index Recheck: 7167
>    Heap Blocks: lossy=128
>    ->  Bitmap Index Scan on idx_2  (cost=0.00..29.03 rows=7163 width=0)
> (actual time=0.278..0.278 rows=1280 loops=1)
>          Index Cond: (a = 1000)
>  Planning Time: 0.148 ms
>  Execution Time: 1.774 ms
> (8 rows)
> 
> DROP INDEX
>                                                     QUERY PLAN
> ---
>  Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
> time=9.695..9.708 rows=1 loops=1)
>    Recheck Cond: (a = 1000)
>    Rows Removed by Index Recheck: 223
>    Heap Blocks: lossy=4
>    ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0)
> (actual time=9.675..9.675 rows=40 loops=1)
>          Index Cond: (a = 1000)
>  Planning Time: 0.110 ms
>  Execution Time: 9.727 ms
> (8 rows)
> 
> and with 0008:
> 
>                                                     QUERY PLAN
> ---
>  Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
> time=8.540..8.577 rows=1 loops=1)
>    Recheck Cond: (a = 1000)
>    Rows Removed by I

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Tomas Vondra




On 3/17/21 7:54 PM, Dean Rasheed wrote:
> On Wed, 17 Mar 2021 at 17:26, Tomas Vondra
>  wrote:
>>
>> My concern is that the current behavior (where we prefer expression
>> stats over multi-column stats to some extent) works fine as long as the
>> parts are independent, but once there's dependency it's probably more
>> likely to produce underestimates. I think underestimates for grouping
>> estimates were a risk in the past, so let's not make that worse.
>>
> 
> I'm not sure the current behaviour really is preferring expression
> stats over multi-column stats. In this example, where we're grouping
> by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of
> those multi-column stats actually match more than one
> column/expression. If anything, I'd go the other way and say that it
> was wrong to use the [(a+b),c] stats in the first case, where they
> were the only stats available, since those stats aren't really
> applicable to (c+d), which probably ought to be treated as
> independent. IOW, it might have been better to estimate the first case
> as
> 
>  ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> 
> and the second case as
> 
>  ndistinct((a+b)) * ndistinct((c+d))
> 

OK. I might be confused, but isn't that what the algorithm currently
does? Or am I just confused about what the first/second case refers to?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: BRIN multi-range indexes

2021-03-17 Thread John Naylor

On Thu, Mar 11, 2021 at 12:26 PM Tomas Vondra 
wrote:
>
> Hi,
>
> Here is an updated version of the patch series.
>
> It fixes the assert failure (just remove the multiplication from it) and
> adds a simple regression test that exercises this.
>
> Based on the discussion so far, I've decided to keep just the new
> signature of the consistent function. That's a bit simpler than having
> to support both 3 and 4 parameters, and it would not deal with the NULL
> changes anyway (mostly harmless code duplication, but still). I've also
> realized the API documentation in SGML needs updating.
>
> At this point, I think 0001-0006 parts are mostly committable.

I think so. I've marked it RFC for this six.

> As for the remaining two parts, the one dealing with correlation may not
> be strictly necessary, but not having it (or something like it) may
> result in not picking the BRIN index in some cases.
>
> But maybe it's not a major problem. I tried the example from [1] but it
> no longer triggers the issue for me - I'm not entirely sure why, but the
> costing changed for some reason. It used to look like this:

> ...

> The index scan cost is about the same, but the heap scan is about half
> the cost. The row estimates are a bit different, perhaps that's related.
> The seqscan cost (169248) and duration (~500ms) is still about the same,
> but so now we still pick the bitmap heap scan.

With only 0001-0006, I get a parallel bitmap scan in both cases:

QUERY PLAN
---
 Gather  (cost=6542.42..52779.35 rows=10 width=4) (actual
time=3.283..22.308 rows=10 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on t0  (cost=5542.42..51778.35 rows=4
width=4) (actual time=3.434..17.257 rows=3 loops=3)
 Recheck Cond: (a = 1)
 Rows Removed by Index Recheck: 83165
 Heap Blocks: lossy=421
 ->  Bitmap Index Scan on t0_a_idx  (cost=0.00..5542.42 rows=381682
width=0) (actual time=2.996..2.996 rows=11040 loops=1)
   Index Cond: (a = 1)
 Planning Time: 0.088 ms
 Execution Time: 22.341 ms
(11 rows)

> Not sure we can rely on
> this, though. Would be quite sad if we had new improved opclasses but
> the planner often decided not to use them.

Yeah, I'm not sure what to do here. It might be okay to leave it out now
and study it further as a PG14 open item or PG15 improvement.

> I had an idea of tweaking choose_bitmap_and to consider both the cost
> and selectivity (similarly to how add_path considers statup/total cost),
> and that did indeed resolve this particular case. This is what the 0008
> part does.
>
> But it may also have negative consequence, as demonstrated by the
> reproducer2.sql script. So maybe the logic would need to be more
> complicated. Or maybe there's no reliable solution, considering how
> tricky/unreliable BRIN estimates are.

Ok, so with 0008 in reproducer2, it chooses the more selective path, even
though it has a higher total cost:

0001-0007:

 QUERY PLAN
-
 Bitmap Heap Scan on t2  (cost=29.03..24032.28 rows=1 width=8) (actual
time=0.498..1.755 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 7167
   Heap Blocks: lossy=128
   ->  Bitmap Index Scan on idx_2  (cost=0.00..29.03 rows=7163 width=0)
(actual time=0.278..0.278 rows=1280 loops=1)
 Index Cond: (a = 1000)
 Planning Time: 0.148 ms
 Execution Time: 1.774 ms
(8 rows)

DROP INDEX
QUERY PLAN
---
 Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
time=9.695..9.708 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 223
   Heap Blocks: lossy=4
   ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0)
(actual time=9.675..9.675 rows=40 loops=1)
 Index Cond: (a = 1000)
 Planning Time: 0.110 ms
 Execution Time: 9.727 ms
(8 rows)

and with 0008:

QUERY PLAN
---
 Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
time=8.540..8.577 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 223
   Heap Blocks: lossy=4
   ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0)
(actual time=8.507..8.508 rows=40 loops=1)
 Index Cond: (a = 1000)
 Planning Time: 0.175 ms
 Execution Time: 8.601 ms
(8 rows)

DROP INDEX

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed

On Wed, 17 Mar 2021 at 17:26, Tomas Vondra
 wrote:
>
> My concern is that the current behavior (where we prefer expression
> stats over multi-column stats to some extent) works fine as long as the
> parts are independent, but once there's dependency it's probably more
> likely to produce underestimates. I think underestimates for grouping
> estimates were a risk in the past, so let's not make that worse.
>

I'm not sure the current behaviour really is preferring expression
stats over multi-column stats. In this example, where we're grouping
by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of
those multi-column stats actually match more than one
column/expression. If anything, I'd go the other way and say that it
was wrong to use the [(a+b),c] stats in the first case, where they
were the only stats available, since those stats aren't really
applicable to (c+d), which probably ought to be treated as
independent. IOW, it might have been better to estimate the first case
as

 ndistinct((a+b)) * ndistinct(c) * ndistinct(d)

and the second case as

 ndistinct((a+b)) * ndistinct((c+d))

Regards,
Dean

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-03-17 Thread Justin Pryzby

The v8 patch has the "broken constraint" problem.

Also, it "fails to avoid" adding duplicate constraints:

Check constraints:
"c" CHECK (i IS NOT NULL AND i > 1 AND i < 2)
"cc" CHECK (i IS NOT NULL AND i >= 1 AND i < 2)
"p1_check" CHECK (true)
"p1_i_check" CHECK (i IS NOT NULL AND i >= 1 AND i < 2)

> diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
> index 5c9f4af1d5..0cb846f408 100644
> --- a/doc/src/sgml/catalogs.sgml
> +++ b/doc/src/sgml/catalogs.sgml
> @@ -4485,6 +4485,16 @@ SCRAM-SHA-256$ count>:&l
> when using declarative partitioning.
>
>   
> +
> + 
> +  
> +   inhdetachpending bool
> +  
> +  
> +   Set to true for a partition that is in the process of being detached;
> +   false otherwise.
> +  
> + 

Remove "Set to" ?
And say true and false

Probably you'll hate the suggestion, but maybe it should be "pendingdetach".
We already have pg_settings.pending_restart.

> +  If CONCURRENTLY is specified, this process runs in 
> two
> +  transactions in order to avoid blocking other sessions that might be 
> accessing
> +  the partitioned table.  During the first transaction, a
> +  SHARE UPDATE EXCLUSIVE lock is taken on both parent 
> table and
> +  partition, and its partition is marked detached; at that point, the 
> transaction
> +  is committed and all transactions using the partitioned table are 
> waited for.
> +  Once all those transactions are gone, the second stage acquires

Instead of "gone", say "have completed" ?

> +/*
> + * MarkInheritDetached
> + *
> + * When a partition is detached from its parent concurrently, we don't
> + * remove the pg_inherits row until a second transaction; as a preparatory
> + * step, this function marks the entry as 'detached', so that other

*pending detached

> + * The strategy for concurrency is to first modify the partition catalog
> + * rows to make it visible to everyone that the partition is detached,

the inherits catalog?

> + /*
> +  * In concurrent mode, the partition is locked with 
> share-update-exclusive
> +  * in the first transaction.  This allows concurrent transactions to be
> +  * doing DML to the partition.

> + /*
> +  * Check inheritance conditions and either delete the pg_inherits row
> +  * (in non-concurrent mode) or just set the inhisdetached flag.

detachpending

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Andres Freund

Hi,

On 2021-03-17 13:31:14 -0400, Robert Haas wrote:
> On Wed, Mar 17, 2021 at 7:41 AM Dilip Kumar  wrote:
> > 0002:
> > - Wrapper over heap_form_tuple and used in ExecEvalRow() and
> > ExecEvalFieldStoreForm()
> 
> Instead of having heap_form_flattened_tuple(), how about
> heap_flatten_values(tupleDesc, values, isnull) that is documented to
> modify the values array? Then instead of replacing the
> heap_form_tuple() calls with a call to heap_form_flattened_tuple(),
> you just insert a call to heap_flatten_values() before the call to
> heap_form_tuple(). I think that might be easier for people looking at
> this code in the future to understand what's happening.

OTOH heap_form_flattened_tuple() has the advantage that we can optimize
it further (e.g. to do the conversion to flattened values in fill_val())
without changing the outside API.

Greetings,

Andres Freund

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

Hi,

(really need to fix my mobile phone mail program to keep the CC list...)

On 2021-03-17 08:15:43 -0700, Andres Freund wrote:
> On Wed, Mar 17, 2021, at 07:16, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2021-03-16 20:50:17 -0700, Andres Freund wrote:
> > Meanwhile, I'm still trying to understand why valgrind is whining
> > about the rd_indexcxt identifier strings.  AFAICS it shouldn't.
> 
> I found a way around that late last night. Need to mark the context
> itself as an allocation. But I made a mess on the way to that and need
> to clean the patch up before sending it (and need to drop my
> girlfriend off first).

Unfortunately I didn't immediately find a way to do this while keeping
the MEMPOOL_CREATE/DESTROY in mcxt.c. The attached patch moves the pool
creation into the memory context implementations, "allocates" the
context itself as part of that pool, and changes the reset
implementation from MEMPOOL_DESTROY + MEMPOOL_CREATE to instead do
MEMPOOL_TRIM. That leaves the memory context itself valid (and thus
tracking ->ident etc), but marks all the other memory as freed.

This is just a first version, it probably needs more work, and
definitely a few comments...

After this, your changes, and the previously mentioned fixes, I get far
fewer false positives. Also found a crash / memory leak in pgstat.c due
to the new replication slot stats, but I'll start a separate thread.

There are a few leak warnings around guc.c that look like they might be
real, not false positives, and thus a bit concerning. Looks like several
guc check hooks don't bother to free the old *extra before allocating a
new one.

I suspect we might get better results from valgrind, not just for leaks
but also undefined value tracking, if we changed the way we represent
pools to utilize VALGRIND_MEMPOOL_METAPOOL |
VALGRIND_MEMPOOL_AUTO_FREE. E.g. aset.c would associate AllocBlock using
VALGRIND_MEMPOOL_ALLOC and then mcxt.c would use
VALGRIND_MALLOCLIKE_BLOCK for the individual chunk allocation.

https://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.mempools

I played with naming the allocations underlying aset.c using
VALGRIND_CREATE_BLOCK(block, strlen(context->name), context->name).
That does produce better undefined-value warnings, but it seems that
e.g. the leak detector doen't have that information around. Nor does it
seem to be usable for use-afte-free. At least the latter likely because
I had to VALGRIND_DISCARD by that point...

Greetings,

Andres Freund
>From c7e69e4bccfc4da97b0b999399732e3dc806b67e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 17 Mar 2021 10:46:39 -0700
Subject: [PATCH] Make memory contexts themselves more visible to valgrind.

---
 src/include/utils/memdebug.h|  1 +
 src/backend/utils/mmgr/aset.c   | 22 ++
 src/backend/utils/mmgr/generation.c |  8 
 src/backend/utils/mmgr/mcxt.c   |  5 -
 src/backend/utils/mmgr/slab.c   |  8 
 5 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/src/include/utils/memdebug.h b/src/include/utils/memdebug.h
index e88b4c6e8ef..5988bff8839 100644
--- a/src/include/utils/memdebug.h
+++ b/src/include/utils/memdebug.h
@@ -23,6 +23,7 @@
 #define VALGRIND_CHECK_MEM_IS_DEFINED(addr, size)			do {} while (0)
 #define VALGRIND_CREATE_MEMPOOL(context, redzones, zeroed)	do {} while (0)
 #define VALGRIND_DESTROY_MEMPOOL(context)	do {} while (0)
+#define VALGRIND_MEMPOOL_TRIM(context, ptr, size)			do {} while (0)
 #define VALGRIND_MAKE_MEM_DEFINED(addr, size)do {} while (0)
 #define VALGRIND_MAKE_MEM_NOACCESS(addr, size)do {} while (0)
 #define VALGRIND_MAKE_MEM_UNDEFINED(addr, size)do {} while (0)
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index ec6c130d0fb..9793ddf4a3d 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -433,6 +433,12 @@ AllocSetContextCreateInternal(MemoryContext parent,
 		{
 			/* Remove entry from freelist */
 			set = freelist->first_free;
+
+			VALGRIND_CREATE_MEMPOOL(set, 0, false);
+			VALGRIND_MEMPOOL_ALLOC(set, set, sizeof(AllocSetContext));
+			/* the contents are still valid, but valgrind can't know that */
+			VALGRIND_MAKE_MEM_DEFINED(set, sizeof(AllocSetContext));
+
 			freelist->first_free = (AllocSet) set->header.nextchild;
 			freelist->num_free--;

@@ -477,6 +483,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
 		   name)));
 	}

+	VALGRIND_CREATE_MEMPOOL(set, 0, false);
+	VALGRIND_MEMPOOL_ALLOC(set, set, sizeof(AllocSetContext));
 	/*
 	 * Avoid writing code that can fail between here and MemoryContextCreate;
 	 * we'd leak the header/initial block if we ereport in this stretch.
@@ -611,6 +619,8 @@ AllocSetReset(MemoryContext context)

 	Assert(context->mem_allocated == keepersize);

+	VALGRIND_MEMPOOL_TRIM(set, set, sizeof(AllocSetAlloc));
+
 	/* Reset block size allocation sequence, too */
 	set->nextBlockSize = set->initBlockSize;
 }
@@ -652,6 +662,16 @@ A

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-03-17 Thread Alvaro Herrera

On 2021-Mar-15, Alvaro Herrera wrote:

> Here's a fixup patch to do it that way.  I tried running the commands
> you showed and one of them immediately dies with the new error message;
> I can't cause the bogus constraint to show up anymore.

Actually, that was a silly fix that didn't actually work correctly, as I
discovered immediately after sending it.  The right fix is to forbid all
commands other than DETACH PARTITION FINALIZE in a partition that's in
the process of being detached.

In the attached v8, I did that; I also added a ton more tests that
hopefully show how the feature should work in concurrent cases,
including one case in which the transaction doing the detach is
cancelled.  I also renamed "inhdetached" to "inhdetachpending", per
previous discussion, including changing how to looks in psql.

I am not aware of any other loose end in this patch; I consider this
version final.  Barring further problem reports, I'll get this pushed
tomorrow morning.

psql completion is missing. If somebody would like to contribute that,
I'm grateful.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"If you have nothing to say, maybe you need just the right tool to help you
not say it."   (New York Times, about Microsoft PowerPoint)
>From 101a20d1ccf86a255af166e198c7ac691af58c99 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 13 Jul 2020 20:15:30 -0400
Subject: [PATCH v8 1/2] Let ALTER TABLE exec routines deal with the relation

This means that ATExecCmd relies on AlteredRelationInfo->rel instead of
keeping the relation as a local variable; this is useful if the
subcommand needs to modify the relation internally.  For example, a
subcommand that internally commits its transaction needs this.
---
 src/backend/commands/tablecmds.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ffb1308a0c..8e753f1efd 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -156,6 +156,8 @@ typedef struct AlteredTableInfo
 	Oid			relid;			/* Relation to work on */
 	char		relkind;		/* Its relkind */
 	TupleDesc	oldDesc;		/* Pre-modification tuple descriptor */
+	/* Transiently set during Phase 2, normally set to NULL */
+	Relation	rel;
 	/* Information saved by Phase 1 for Phase 2: */
 	List	   *subcmds[AT_NUM_PASSES]; /* Lists of AlterTableCmd */
 	/* Information saved by Phases 1/2 for Phase 3: */
@@ -353,7 +355,7 @@ static void ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 	  AlterTableUtilityContext *context);
 static void ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			  AlterTableUtilityContext *context);
-static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
+static void ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 	  AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass,
 	  AlterTableUtilityContext *context);
 static AlterTableCmd *ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab,
@@ -4512,7 +4514,6 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 		{
 			AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);
 			List	   *subcmds = tab->subcmds[pass];
-			Relation	rel;
 			ListCell   *lcmd;
 
 			if (subcmds == NIL)
@@ -4521,10 +4522,10 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			/*
 			 * Appropriate lock was obtained by phase 1, needn't get it again
 			 */
-			rel = relation_open(tab->relid, NoLock);
+			tab->rel = relation_open(tab->relid, NoLock);
 
 			foreach(lcmd, subcmds)
-ATExecCmd(wqueue, tab, rel,
+ATExecCmd(wqueue, tab,
 		  castNode(AlterTableCmd, lfirst(lcmd)),
 		  lockmode, pass, context);
 
@@ -4536,7 +4537,11 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			if (pass == AT_PASS_ALTER_TYPE)
 ATPostAlterTypeCleanup(wqueue, tab, lockmode);
 
-			relation_close(rel, NoLock);
+			if (tab->rel)
+			{
+relation_close(tab->rel, NoLock);
+tab->rel = NULL;
+			}
 		}
 	}
 
@@ -4562,11 +4567,12 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
  * ATExecCmd: dispatch a subcommand to appropriate execution routine
  */
 static void
-ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
+ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 		  AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass,
 		  AlterTableUtilityContext *context)
 {
 	ObjectAddress address = InvalidObjectAddress;
+	Relation	rel = tab->rel;
 
 	switch (cmd->subtype)
 	{
@@ -5669,6 +5675,7 @@ ATGetQueueEntry(List **wqueue, Relation rel)
 	 */
 	tab = (AlteredTableInfo *) palloc0(sizeof(AlteredTableInfo));
 	tab->relid = relid;
+	tab->rel = NULL;			/* set later */
 	tab->relkind = rel->rd_rel->relkind;
 	tab->oldDesc = CreateTupleDescCopyConstr(RelationGetDescr(rel));
 	tab->newrelpersistence = RELPERSISTENCE_PERMANENT;
-- 
2.20.1

>From 6e893b5f2e7abae9e55ba0d1487d4621272d2cdf Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date

Re: Index Skip Scan (new UniqueKeys)

2021-03-17 Thread Tomas Vondra




On 3/17/21 6:02 PM, Dmitry Dolgov wrote:
>> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote:
>> Hi,
>>
>> I took a look at the new patch series, focusing mostly on the uniquekeys
>> part. It'd be a bit tedious to explain all the review comments here, so
>> attached is a patch series with a "review" patch for some of the parts.
> 
> Great, thanks.
> 
>> Most of it is fairly small (corrections to comments etc.), I'll go over
>> the more serious part so that we can discuss it here. I'll keep it split
>> per parts of the original patch series.
>> I suggest looking for XXX and FIXME comments in all the review patches.
>>
>>
>> 0001
>> 
>>
>> 
>>
>> 0002
>> 
>>
> 
> In fact both 0001 & 0002 belong to another thread, which these days
> span [1], [2]. I've included them only because they happened to be a
> dependency for index skip scan following David suggestions, sorry if
> it's confusing.
> 
> At the same time the author behind 0001 & 0002 is present in this thread
> as well, maybe Andy can answer these comments right here and better than me.
> 

Ah, sorry for the confusion. In that case the review comments probably
belong to the other threads, so we should move the discussion there.
It's not clear to me which of the threads is the right one.

>> 0003
>> 
>>
>> Just some comments/whitespace.
>>
>>
>> 0004
>> 
>>
>> I wonder why we don't include this in explain TEXT format? Seems it
>> might make it harder to write regression tests for this? It's easier to
>> just check that we deduced the right unique key(s) than having to
>> construct an example where it actually changes the plan.
> 
> Yeah, good point. I believe originally it was like that to not make
> explain too verbose for skip scans, but displaying prefix definitely
> could be helpful for testing, so will do this (and address other
> comments as well).
> 

Cool. Thanks.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2021-03-17 Thread Robert Haas

On Wed, Mar 17, 2021 at 7:41 AM Dilip Kumar  wrote:
> 0002:
> - Wrapper over heap_form_tuple and used in ExecEvalRow() and
> ExecEvalFieldStoreForm()

Instead of having heap_form_flattened_tuple(), how about
heap_flatten_values(tupleDesc, values, isnull) that is documented to
modify the values array? Then instead of replacing the
heap_form_tuple() calls with a call to heap_form_flattened_tuple(),
you just insert a call to heap_flatten_values() before the call to
heap_form_tuple(). I think that might be easier for people looking at
this code in the future to understand what's happening.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Tomas Vondra

Hi,

On 3/17/21 4:55 PM, Dean Rasheed wrote:
> On Sun, 7 Mar 2021 at 21:10, Tomas Vondra  
> wrote:
>>
>> 2) ndistinct
>>
>> There's one thing that's bugging me, in how we handle "partial" matches.
>> For each expression we track both the original expression and the Vars
>> we extract from it. If we can't find a statistics matching the whole
>> expression, we try to match those individual Vars, and we remove the
>> matching ones from the list. And in the end we multiply the estimates
>> for the remaining Vars.
>>
>> This works fine with one matching ndistinct statistics. Consider for example
>>
>>  GROUP BY (a+b), (c+d)
>>
>> with statistics on [(a+b),c] - that is, expression and one column.
> 
> I've just been going over this example, and I think it actually works
> slightly differently from how you described, though I haven't worked
> out the full general implications of that.
> 
>> We parse the expressions into two GroupExprInfo
>>
>>  {expr: (a+b), vars: [a, b]}
>>  {expr: (c+d), vars: [c, d]}
>>
> 
> Here, I think what you actually get, in the presence of stats on
> [(a+b),c] is actually the following two GroupExprInfos:
> 
>   {expr: (a+b), vars: []}
>   {expr: (c+d), vars: [c, d]}
> 

Yeah, right. To be precise, we get

{expr: (a+b), vars: [(a+b)]}

because in the first case we pass NIL, so add_unique_group_expr treats
the whole expression as a var (a bit strange, but OK).

> because of the code immediately after this comment in estimate_num_groups():
> 
> /*
>  * If examine_variable is able to deduce anything about the GROUP BY
>  * expression, treat it as a single variable even if it's really more
>  * complicated.
>  */
> 
> As it happens, that makes no difference in this case, where there is
> just this one stats object, but it does change things when there are
> two stats objects.
> 
>> and the statistics matches the first item exactly (the expression). The
>> second expression is not in the statistics, but we match "c". So we end
>> up with an estimate for "(a+b), c" and have one remaining GroupExprInfo:
>>
>>  {expr: (c+d), vars: [d]}
> 
> Right.
> 
>> Without any other statistics we estimate that as ndistinct for "d", so
>> we end up with
>>
>>  ndistinct((a+b), c) * ndistinct(d)
>>
>> which mostly makes sense. It assumes ndistinct(c+d) is product of the
>> ndistinct estimates, but that's kinda what we've been always doing.
> 
> Yes, that appears to be what happens, and it's probably the best that
> can be done with the available stats.
> 
>> But now consider we have another statistics on just (c+d). In the second
>> loop we end up matching this expression exactly, so we end up with
>>
>>  ndistinct((a+b), c) * ndistinct((c+d))
> 
> In this case, with stats on (c+d) as well, the two GroupExprInfos
> built at the start change to:
> 
>   {expr: (a+b), vars: []}
>   {expr: (c+d), vars: []}
> 
> so it actually ends up not using any multivariate stats, but instead
> uses the two univariate expression stats, giving
> 
>  ndistinct((a+b)) * ndistinct((c+d))
> 
> which actually seems pretty good, and gave very good estimates in the
> simple test case I tried.
> 

Yeah, that works pretty well in this case.

I wonder if we'd be better off extracting the Vars and doing what I
mistakenly described as the current behavior. That's essentially mean
extracting the Vars even in the case where we now pass NIL.

My concern is that the current behavior (where we prefer expression
stats over multi-column stats to some extent) works fine as long as the
parts are independent, but once there's dependency it's probably more
likely to produce underestimates. I think underestimates for grouping
estimates were a risk in the past, so let's not make that worse.

>> i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think
>> what we should do after the first loop is just discarding the whole
>> expression and "expand" into per-variable GroupExprInfo, so in the
>> second step we would not match the (c+d) statistics.
> 
> Not using the (c+d) stats would give either
> 
>  ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> 
> or
> 
>  ndistinct((a+b), c) * ndistinct(d)
> 
> depending on exactly how the algorithm was changed. In my test, these
> both gave worse estimates, but there are probably other datasets for
> which they might do better. It all depends on how much correlation
> there is between (a+b) and c.
> 
> I suspect that there is no optimal strategy for handling overlapping
> stats that works best for all datasets, but the current algorithm
> seems to do a pretty decent job.
> 
>> Of course, maybe there's a better way to pick the statistics, but I
>> think our conclusion so far was that people should just create
>> statistics covering all the columns in the query, to not have to match
>> multiple statistics like this.
> 
> Yes, I think that's always likely to work better, especially for
> ndistinct stats, where

Re: pg_dump new feature: exporting functions only. Bad or good idea ?

2021-03-17 Thread Tom Lane

Vik Fearing  writes:
> On 3/17/21 6:00 PM, Lætitia Avrot wrote:
>> However, when I finally got the time to look at it in detail, I found out
>> there was no way to solve the dependencies in the functions and procedures,
>> so that the exported file, when re-played could lead to invalid objects.
>>  ...
>> So, my question is: what do you think about such a feature? is it worth it?

> Yes, it is absolutely worth it to be able to extract just the functions
> from a database.  I have wanted this several times.

Selective dumps always have a risk of not being restorable on their
own; I don't see that "functions only" is noticeably less safe than
"just these tables", or other cases that we support already.

What I'm wondering about is how this might interact with the
discussion at [1].

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CAFj8pRB10wvW0CC9Xq=1XDs=zcqxer3cblcnza+qix4cuh-...@mail.gmail.com

Re: pg_dump new feature: exporting functions only. Bad or good idea ?

2021-03-17 Thread Fabrízio de Royes Mello

On Wed, Mar 17, 2021 at 2:00 PM Lætitia Avrot 
wrote:
>
> Hey hackers,
>
> I had this idea, that I raised and cherished like my baby to add a switch
in `pg_dump` to allow exporting stored functions (and procedures) only.
>
> However, when I finally got the time to look at it in detail, I found out
there was no way to solve the dependencies in the functions and procedures,
so that the exported file, when re-played could lead to invalid objects.
>
> So, I decided this would not make Postgres better and decide to walk off
this patch.
>
> Anyhow, when sharing my thoughts, several people told me to ask the
community about adding this feature because this could be useful in some
use cases. Another argument is that should you have all your functions in
one schema and your tables in another, exporting only the function schema
will lead to the same kind of file that could lead to invalid objects
created when the file is re-played against a database that does not have
the tables.
>
> Of course, the documentation would add a warning against object
invalidity should only the stored functions/procedures be exported.
>
> So, my question is: what do you think about such a feature? is it worth
it?
>

Make total sense since we already have --function=NAME(args) on pg_restore
and it doesn't solve dependencies... so we can add it to also export
function/procedure contents.

+1 for general idea.

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: [patch] [doc] Minor variable related cleanup and rewording of plpgsql docs

2021-03-17 Thread Tom Lane

I wrote:
> I took a stab at doing that, just to see what it might look like.
> I thought it comes out pretty well, really -- see what you think.

Hearing nothing further, I pushed that after another round of
copy-editing.  There's still plenty of time to revise it if
anybody has further comments.

regards, tom lane

Re: pg_dump new feature: exporting functions only. Bad or good idea ?

2021-03-17 Thread Vik Fearing

On 3/17/21 6:00 PM, Lætitia Avrot wrote:
> Hey hackers,
> 
> I had this idea, that I raised and cherished like my baby to add a switch
> in `pg_dump` to allow exporting stored functions (and procedures) only.
> 
> However, when I finally got the time to look at it in detail, I found out
> there was no way to solve the dependencies in the functions and procedures,
> so that the exported file, when re-played could lead to invalid objects.
> 
> So, I decided this would not make Postgres better and decide to walk off
> this patch.
> 
> Anyhow, when sharing my thoughts, several people told me to ask the
> community about adding this feature because this could be useful in some
> use cases. Another argument is that should you have all your functions in
> one schema and your tables in another, exporting only the function schema
> will lead to the same kind of file that could lead to invalid objects
> created when the file is re-played against a database that does not have
> the tables.
> 
> Of course, the documentation would add a warning against object invalidity
> should only the stored functions/procedures be exported.
> 
> So, my question is: what do you think about such a feature? is it worth it?

Yes, it is absolutely worth it to be able to extract just the functions
from a database.  I have wanted this several times.
-- 
Vik Fearing

pg_dump new feature: exporting functions only. Bad or good idea ?

2021-03-17 Thread Lætitia Avrot

Hey hackers,

I had this idea, that I raised and cherished like my baby to add a switch
in `pg_dump` to allow exporting stored functions (and procedures) only.

However, when I finally got the time to look at it in detail, I found out
there was no way to solve the dependencies in the functions and procedures,
so that the exported file, when re-played could lead to invalid objects.

So, I decided this would not make Postgres better and decide to walk off
this patch.

Anyhow, when sharing my thoughts, several people told me to ask the
community about adding this feature because this could be useful in some
use cases. Another argument is that should you have all your functions in
one schema and your tables in another, exporting only the function schema
will lead to the same kind of file that could lead to invalid objects
created when the file is re-played against a database that does not have
the tables.

Of course, the documentation would add a warning against object invalidity
should only the stored functions/procedures be exported.

So, my question is: what do you think about such a feature? is it worth it?

Have a nice day,

Lætitia

Re: Index Skip Scan (new UniqueKeys)

2021-03-17 Thread Dmitry Dolgov

> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote:
> Hi,
>
> I took a look at the new patch series, focusing mostly on the uniquekeys
> part. It'd be a bit tedious to explain all the review comments here, so
> attached is a patch series with a "review" patch for some of the parts.

Great, thanks.

> Most of it is fairly small (corrections to comments etc.), I'll go over
> the more serious part so that we can discuss it here. I'll keep it split
> per parts of the original patch series.
> I suggest looking for XXX and FIXME comments in all the review patches.
>
>
> 0001
> 
>
> 
>
> 0002
> 
>

In fact both 0001 & 0002 belong to another thread, which these days
span [1], [2]. I've included them only because they happened to be a
dependency for index skip scan following David suggestions, sorry if
it's confusing.

At the same time the author behind 0001 & 0002 is present in this thread
as well, maybe Andy can answer these comments right here and better than me.

> 0003
> 
>
> Just some comments/whitespace.
>
>
> 0004
> 
>
> I wonder why we don't include this in explain TEXT format? Seems it
> might make it harder to write regression tests for this? It's easier to
> just check that we deduced the right unique key(s) than having to
> construct an example where it actually changes the plan.

Yeah, good point. I believe originally it was like that to not make
explain too verbose for skip scans, but displaying prefix definitely
could be helpful for testing, so will do this (and address other
comments as well).

[1]: 
https://www.postgresql.org/message-id/flat/caku4awpqjaqjwq2x-ar9g3+zhrzu1k8hnp7a+_mluov-n5a...@mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/flat/caku4awru35c9g3ce15jmvwh6b2hzf4hf7czukrsiktv7akr...@mail.gmail.com

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Julien Rouhaud

On Wed, Mar 17, 2021 at 12:24:44PM -0400, Bruce Momjian wrote:
> On Wed, Mar 17, 2021 at 05:16:50PM +0100, Pavel Stehule wrote:
> > 
> > 
> > st 17. 3. 2021 v 17:03 odesílatel Tom Lane  napsal:
> > 
> > Bruce Momjian  writes:
> > > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
> > >> I still say that it's a serious mistake to sanctify a query ID
> > calculation
> > >> method that was designed only for pg_stat_statement's needs as the 
> > one
> > >> true way to do it.  But that's what exposing it in a core view would 
> > do.
> > 
> > > OK, I am fine with creating a new method, and maybe having
> > > pg_stat_statements use it.  Is that the direction we should be going 
> > in?
> > 
> > The point is that we've understood Query.queryId as something that
> > different extensions might calculate differently for their own needs.
> > In particular it's easy to imagine extensions that want an ID that is
> > less fuzzy than what pg_stat_statements wants.  We never had a plan for
> > how two such extensions could co-exist, but at least it was possible
> > to use one if you didn't use another.  If this gets moved into core
> > then there will basically be only one way that anyone can do it.
> > 
> > Maybe what we need is a design for allowing more than one query ID.
> > 
> > 
> > Theoretically there can be a hook for calculation of queryid, that can be by
> > used extension. Default can be assigned with a method that is used by
> > pg_stat_statements.
> 
> Yes, that is what the code patch says it does.
> 
> > I don't think it is possible to use more different query id for
> > pg_stat_statements so this solution can be simple.
> 
> Agreed.

Actually, putting the query identifer computation in the core makes it way more
tunable, even if it's conterintuitive.  What it means is that you can now chose
to use usual pgss' algorithm or a different one for log_line_prefix and
pg_stat_activity.queryid, but also that you can now use pgss with a different
query id algorithm.  That's another thing that user were asking for a long
time.

I originally suggested to make it clearer by having an enum GUC rather than a
boolean, say compute_queryid = [ none | core | external ], and if set to
external then a hook would be explicitely called.  Right now, "none" and
"external" are binded with compute_queryid = off, and depends on whether an
extension is computing a queryid during post_parse_analyse_hook.

It could later be extended to suit other needs if we ever come to some
agreement (for instance "legacy", "logical_replication_stable" or whatever
better name we can find for something that doesn't depend on Oid).

Re: libpq debug log

2021-03-17 Thread Alvaro Herrera

Hello

In pqTraceOutputString(), you can use the return value from fprintf to
move the cursor -- no need to count chars.

I still think that the message-type specific functions should print the
message type, rather than having the string arrays.

-- 
Álvaro Herrera   Valdivia, Chile
"La gente vulgar sólo piensa en pasar el tiempo;
el que tiene talento, en aprovecharlo"

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Bruce Momjian

On Wed, Mar 17, 2021 at 05:16:50PM +0100, Pavel Stehule wrote:
> 
> 
> st 17. 3. 2021 v 17:03 odesílatel Tom Lane  napsal:
> 
> Bruce Momjian  writes:
> > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
> >> I still say that it's a serious mistake to sanctify a query ID
> calculation
> >> method that was designed only for pg_stat_statement's needs as the one
> >> true way to do it.  But that's what exposing it in a core view would 
> do.
> 
> > OK, I am fine with creating a new method, and maybe having
> > pg_stat_statements use it.  Is that the direction we should be going in?
> 
> The point is that we've understood Query.queryId as something that
> different extensions might calculate differently for their own needs.
> In particular it's easy to imagine extensions that want an ID that is
> less fuzzy than what pg_stat_statements wants.  We never had a plan for
> how two such extensions could co-exist, but at least it was possible
> to use one if you didn't use another.  If this gets moved into core
> then there will basically be only one way that anyone can do it.
> 
> Maybe what we need is a design for allowing more than one query ID.
> 
> 
> Theoretically there can be a hook for calculation of queryid, that can be by
> used extension. Default can be assigned with a method that is used by
> pg_stat_statements.

Yes, that is what the code patch says it does.

> I don't think it is possible to use more different query id for
> pg_stat_statements so this solution can be simple.

Agreed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Pavel Stehule

st 17. 3. 2021 v 17:03 odesílatel Tom Lane  napsal:

> Bruce Momjian  writes:
> > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
> >> I still say that it's a serious mistake to sanctify a query ID
> calculation
> >> method that was designed only for pg_stat_statement's needs as the one
> >> true way to do it.  But that's what exposing it in a core view would do.
>
> > OK, I am fine with creating a new method, and maybe having
> > pg_stat_statements use it.  Is that the direction we should be going in?
>
> The point is that we've understood Query.queryId as something that
> different extensions might calculate differently for their own needs.
> In particular it's easy to imagine extensions that want an ID that is
> less fuzzy than what pg_stat_statements wants.  We never had a plan for
> how two such extensions could co-exist, but at least it was possible
> to use one if you didn't use another.  If this gets moved into core
> then there will basically be only one way that anyone can do it.
>
> Maybe what we need is a design for allowing more than one query ID.
>

Theoretically there can be a hook for calculation of queryid, that can be
by used extension. Default can be assigned with a method that is used by
pg_stat_statements.

I don't think it is possible to use more different query id for
pg_stat_statements so this solution can be simple.

regards

Pavel





>
> > I do think we need _some_ method in core if we are going to be exposing
> > this value in pg_stat_activity and log_line_prefix.
>
> I'm basically objecting to the conclusion that we should do either
> of those.  There is no way around the fact that it will break every
> user of Query.queryId other than pg_stat_statements, unless they
> are okay with whatever definition pg_stat_statements is using (which
> is a moving target BTW).
>
> regards, tom lane
>
>
>

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Bruce Momjian

On Wed, Mar 17, 2021 at 12:01:38PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
> >> I still say that it's a serious mistake to sanctify a query ID calculation
> >> method that was designed only for pg_stat_statement's needs as the one
> >> true way to do it.  But that's what exposing it in a core view would do.
> 
> > OK, I am fine with creating a new method, and maybe having
> > pg_stat_statements use it.  Is that the direction we should be going in?
> 
> The point is that we've understood Query.queryId as something that
> different extensions might calculate differently for their own needs.
> In particular it's easy to imagine extensions that want an ID that is
> less fuzzy than what pg_stat_statements wants.  We never had a plan for
> how two such extensions could co-exist, but at least it was possible
> to use one if you didn't use another.  If this gets moved into core
> then there will basically be only one way that anyone can do it.

Well, the patch docs say:

Enables or disables in core query identifier computation.arameter.  The
 extension requires a query
--> identifier to be computed.  Note that an external module can
--> alternatively be used if the in core query identifier computation
specification doesn't suit your need.  In this case, in core
computation must be disabled.  The default is off.

> Maybe what we need is a design for allowing more than one query ID.
> 
> > I do think we need _some_ method in core if we are going to be exposing
> > this value in pg_stat_activity and log_line_prefix.
> 
> I'm basically objecting to the conclusion that we should do either
> of those.  There is no way around the fact that it will break every
> user of Query.queryId other than pg_stat_statements, unless they
> are okay with whatever definition pg_stat_statements is using (which
> is a moving target BTW).

I thought the above doc patch feature avoided this problem because an
extension can override the build-in query id.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: WIP: WAL prefetch (another approach)

2021-03-17 Thread Tomas Vondra

On 2/15/21 12:18 AM, Stephen Frost wrote:
> Greetings,
> 
> ...
>
 I think there's potential for some significant optimization going
 forward, but I think it's basically optimization over what we're doing
 today. As this is already a nontrivial patch, I'd argue for doing so
 separately.
>>>
>>> This seems like a great optimization, albeit a fair bit of code, for a
>>> relatively uncommon use-case, specifically where full page writes are
>>> disabled or very large checkpoints.  As that's the case though, I would
>>> think it's reasonable to ask that it go out of its way to avoid slowing
>>> down the more common configurations, particularly since it's proposed to
>>> have it on by default (which I agree with, provided it ends up improving
>>> the common cases, which I think the suggestions above would certainly
>>> make it more likely to do).
>>
>> I'm OK to do some benchmarking, but it's not quite clear to me why does it
>> matter if the checkpoints are smaller than shared buffers? IMO what matters
>> is how "localized" the updates are, i.e. how likely it is to hit the same
>> page repeatedly (in a short amount of time). Regular pgbench is not very
>> suitable for that, but some non-uniform distribution should do the trick, I
>> think.
> 
> I suppose strictly speaking it'd be
> Min(wal_decode_buffer_size,checkpoint_size), but yes, you're right that
> it's more about the wal_decode_buffer_size than the checkpoint's size.
> Apologies for the confusion.  As suggested above, one way to benchmark
> this to really see if there's any issue would be to increase
> wal_decode_buffer_size to some pretty big size and then compare the
> performance vs. unpatched.  I'd think that could even be done with
> pgbench, so you're not having to arrange for the same pages to get
> updated over and over.
> 

What exactly would be the point of such benchmark? I don't think the
patch does prefetching based on wal_decode_buffer_size, that just says
how far ahead we decode - the prefetch distance I is defined by
maintenance_io_concurrency.

But it's not clear to me what exactly would the result say about the
necessity of the optimization at hand (skipping prefetches for blocks
with recent FPI). If the the maintenance_io_concurrency is very high,
the probability that a block is evicted prematurely grows, making the
prefetch useless in general. How does this say anything about the
problem at hand? Sure, we'll do unnecessary I/O, causing issues, but
that's a bit like complaining the engine gets very hot when driving on a
highway in reverse.

AFAICS to measure the worst case, you'd need a workload with a lot of
FPIs, and very little actual I/O. That means, data set that fits into
memory (either shared buffers or RAM), and short checkpoints. But that's
exactly the case where you don't need prefetching ...

>>> Perhaps this already improves the common cases and is worth the extra
>>> code on that basis, but I don't recall seeing much in the way of
>>> benchmarking in this thread for that case- that is, where FPIs are
>>> enabled and checkpoints are smaller than shared buffers.  Jakub's
>>> testing was done with FPWs disabled and Tomas's testing used checkpoints
>>> which were much larger than the size of shared buffers on the system
>>> doing the replay.  While it's certainly good that this patch improves
>>> those cases, we should also be looking out for the worst case and make
>>> sure that the patch doesn't degrade performance in that case.
>>
>> I'm with Andres on this. It's fine to leave some possible optimizations on
>> the table for the future. And even if some workloads are affected
>> negatively, it's still possible to disable the prefetching.
> 
> While I'm generally in favor of this argument, that a feature is
> particularly important and that it's worth slowing down the common cases
> to enable it, I dislike that it's applied inconsistently.  I'd certainly

If you have a workload where this happens to cause issues, you can just
disable that. IMHO that's a perfectly reasonable engineering approach,
where we get something that significantly improves 80% of the cases,
allow disabling it for cases where it might cause issues, and then
improve it in the next version.


> feel better about it if we had actual performance numbers to consider.
> I don't doubt the possibility that the extra prefetch's just don't
> amount to enough to matter but I have a hard time seeing them as not
> having some cost and without actually measuring it, it's hard to say
> what that cost is.
> 
> Without looking farther back than the last record, we could end up
> repeatedly asking for the same blocks to be prefetched too-
> 
> FPI for block 1
> FPI for block 2
> WAL record for block 1
> WAL record for block 2
> WAL record for block 1
> WAL record for block 2
> WAL record for block 1
> WAL record for block 2
> 
> ... etc.
> 
> Entirely possible my math is off, but seems like the worst case
> situation right now might end up with some 4500 u

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-17 Thread Pavel Stehule

st 17. 3. 2021 v 9:20 odesílatel Michael Paquier 
napsal:

> On Wed, Mar 17, 2021 at 02:06:57PM +0800, Julien Rouhaud wrote:
> > I also think that there should be a single usable top label, otherwise
> it will
> > lead to confusing code and it can be a source of bug.
>
> Okay, that's fine by me.  Could it be possible to come up with an
> approach that does not hijack the namespace list contruction and the
> lookup logic as much as it does now?  I get it that the patch is done
> this way because of the ordering of operations done for the initial ns
> list creation and the grammar parsing that adds the routine label on
> top of the root one, but I'd like to believe that there are more solid
> ways to achieve that, like for example something that decouples the
> root label and its alias, or something that associates an alias
> directly with its PLpgSQL_nsitem entry?
>

I am checking it now, and I don't see any easy solution. The namespace is a
one direction tree - only backward iteration from leafs to roots is
supported. At the moment, when I try to replace the name of root ns_item,
this root ns_item is referenced by the function's arguments (NS_TYPE_VAR
type). So anytime I need some  helper ns_item node, that can be a
persistent target for these nodes. In this case is a memory overhead of
just one ns_item.

orig_label <- argument1 <- argument2

The patch has to save the original orig_label because it can be referenced
from argument1 or by "FOUND" variable. New graphs looks like

new_label <- invalidated orig_label <- argument1 <- ...

This tree has a different direction than is usual, and then replacing the
root node is not simple.

Second solution (significantly more simple) is an additional pointer in
ns_item structure. In almost all cases this pointer will not be used.
Because ns_item uses a flexible array, then union cannot be used. I
implemented this in a patch marked as "alias-implementation".

What do you think about it?

Pavel







> --
> Michael
>
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index 9242c54329..ed8e774899 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -292,7 +292,9 @@ $$ LANGUAGE plpgsql;
   special variables such as FOUND (see
   ).  The outer block is
   labeled with the function's name, meaning that parameters and special
-  variables can be qualified with the function's name.
+  variables can be qualified with the function's name. The name of this outer
+  block can be changed by inserting special command 
+  #routine_label new_name at the start of the function.
  
 
 
@@ -435,6 +437,31 @@ $$ LANGUAGE plpgsql;
  
 
 
+
+ The function's argument can be qualified with the function name:
+
+CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$
+BEGIN
+RETURN sales_tax.subtotal * 0.06;
+END;
+$$ LANGUAGE plpgsql;
+
+
+
+
+ Sometimes the function name is too long and it can be more practical to use
+ some shorter label. The top namespace label can be changed (along with
+ the functions' arguments) using the option routine_label:
+
+CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$
+#routine_label s
+BEGIN
+RETURN s.subtotal * 0.06;
+END;
+$$ LANGUAGE plpgsql;
+
+
+
  
   Some more examples:
 
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index ce8d97447d..d32e050c32 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -378,6 +378,10 @@ do_compile(FunctionCallInfo fcinfo,
 	 */
 	plpgsql_ns_init();
 	plpgsql_ns_push(NameStr(procStruct->proname), PLPGSQL_LABEL_BLOCK);
+
+	/* save top ns for possibility to alter top label */
+	function->root_ns = plpgsql_ns_top();
+
 	plpgsql_DumpExecTree = false;
 	plpgsql_start_datums();
 
diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c
index 919b968826..7132da35d1 100644
--- a/src/pl/plpgsql/src/pl_funcs.c
+++ b/src/pl/plpgsql/src/pl_funcs.c
@@ -101,6 +101,7 @@ plpgsql_ns_additem(PLpgSQL_nsitem_type itemtype, int itemno, const char *name)
 	nse->itemtype = itemtype;
 	nse->itemno = itemno;
 	nse->prev = ns_top;
+	nse->alias = NULL;
 	strcpy(nse->name, name);
 	ns_top = nse;
 }
@@ -141,7 +142,7 @@ plpgsql_ns_lookup(PLpgSQL_nsitem *ns_cur, bool localmode,
 			 nsitem->itemtype != PLPGSQL_NSTYPE_LABEL;
 			 nsitem = nsitem->prev)
 		{
-			if (strcmp(nsitem->name, name1) == 0)
+			if (strcmp(nsitem->alias ? nsitem->alias : nsitem->name, name1) == 0)
 			{
 if (name2 == NULL ||
 	nsitem->itemtype != PLPGSQL_NSTYPE_VAR)
@@ -155,13 +156,13 @@ plpgsql_ns_lookup(PLpgSQL_nsitem *ns_cur, bool localmode,
 
 		/* Check this level for qualified match to variable name */
 		if (name2 != NULL &&
-			strcmp(nsitem->name, name1) == 0)
+			strcmp(nsitem->alias ? nsitem->alias : nsitem->name, name1) == 0)
 		{
 			for (nsitem = ns_cur;
  nsitem->itemtype != PLPGSQL_NSTYPE_LABEL;
  nsitem = nsitem->prev)
 			{
-if (st

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Tom Lane

Bruce Momjian  writes:
> On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
>> I still say that it's a serious mistake to sanctify a query ID calculation
>> method that was designed only for pg_stat_statement's needs as the one
>> true way to do it.  But that's what exposing it in a core view would do.

> OK, I am fine with creating a new method, and maybe having
> pg_stat_statements use it.  Is that the direction we should be going in?

The point is that we've understood Query.queryId as something that
different extensions might calculate differently for their own needs.
In particular it's easy to imagine extensions that want an ID that is
less fuzzy than what pg_stat_statements wants.  We never had a plan for
how two such extensions could co-exist, but at least it was possible
to use one if you didn't use another.  If this gets moved into core
then there will basically be only one way that anyone can do it.

Maybe what we need is a design for allowing more than one query ID.

> I do think we need _some_ method in core if we are going to be exposing
> this value in pg_stat_activity and log_line_prefix.

I'm basically objecting to the conclusion that we should do either
of those.  There is no way around the fact that it will break every
user of Query.queryId other than pg_stat_statements, unless they
are okay with whatever definition pg_stat_statements is using (which
is a moving target BTW).

regards, tom lane

Re: PoC/WIP: Extended statistics on expressions

2021-03-17 Thread Dean Rasheed

On Sun, 7 Mar 2021 at 21:10, Tomas Vondra  wrote:
>
> 2) ndistinct
>
> There's one thing that's bugging me, in how we handle "partial" matches.
> For each expression we track both the original expression and the Vars
> we extract from it. If we can't find a statistics matching the whole
> expression, we try to match those individual Vars, and we remove the
> matching ones from the list. And in the end we multiply the estimates
> for the remaining Vars.
>
> This works fine with one matching ndistinct statistics. Consider for example
>
>  GROUP BY (a+b), (c+d)
>
> with statistics on [(a+b),c] - that is, expression and one column.

I've just been going over this example, and I think it actually works
slightly differently from how you described, though I haven't worked
out the full general implications of that.

> We parse the expressions into two GroupExprInfo
>
>  {expr: (a+b), vars: [a, b]}
>  {expr: (c+d), vars: [c, d]}
>

Here, I think what you actually get, in the presence of stats on
[(a+b),c] is actually the following two GroupExprInfos:

  {expr: (a+b), vars: []}
  {expr: (c+d), vars: [c, d]}

because of the code immediately after this comment in estimate_num_groups():

/*
 * If examine_variable is able to deduce anything about the GROUP BY
 * expression, treat it as a single variable even if it's really more
 * complicated.
 */

As it happens, that makes no difference in this case, where there is
just this one stats object, but it does change things when there are
two stats objects.

> and the statistics matches the first item exactly (the expression). The
> second expression is not in the statistics, but we match "c". So we end
> up with an estimate for "(a+b), c" and have one remaining GroupExprInfo:
>
>  {expr: (c+d), vars: [d]}

Right.

> Without any other statistics we estimate that as ndistinct for "d", so
> we end up with
>
>  ndistinct((a+b), c) * ndistinct(d)
>
> which mostly makes sense. It assumes ndistinct(c+d) is product of the
> ndistinct estimates, but that's kinda what we've been always doing.

Yes, that appears to be what happens, and it's probably the best that
can be done with the available stats.

> But now consider we have another statistics on just (c+d). In the second
> loop we end up matching this expression exactly, so we end up with
>
>  ndistinct((a+b), c) * ndistinct((c+d))

In this case, with stats on (c+d) as well, the two GroupExprInfos
built at the start change to:

  {expr: (a+b), vars: []}
  {expr: (c+d), vars: []}

so it actually ends up not using any multivariate stats, but instead
uses the two univariate expression stats, giving

 ndistinct((a+b)) * ndistinct((c+d))

which actually seems pretty good, and gave very good estimates in the
simple test case I tried.

> i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think
> what we should do after the first loop is just discarding the whole
> expression and "expand" into per-variable GroupExprInfo, so in the
> second step we would not match the (c+d) statistics.

Not using the (c+d) stats would give either

 ndistinct((a+b)) * ndistinct(c) * ndistinct(d)

or

 ndistinct((a+b), c) * ndistinct(d)

depending on exactly how the algorithm was changed. In my test, these
both gave worse estimates, but there are probably other datasets for
which they might do better. It all depends on how much correlation
there is between (a+b) and c.

I suspect that there is no optimal strategy for handling overlapping
stats that works best for all datasets, but the current algorithm
seems to do a pretty decent job.

> Of course, maybe there's a better way to pick the statistics, but I
> think our conclusion so far was that people should just create
> statistics covering all the columns in the query, to not have to match
> multiple statistics like this.

Yes, I think that's always likely to work better, especially for
ndistinct stats, where all possible permutations of subsets of the
columns are included, so a single ndistinct stat can work well for a
range of different queries.

Regards,
Dean

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Bruce Momjian

On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > We are reaching the two-year mark on this feature, that everyone seems
> > to agree is needed.  Is any committer going to work on this to get it
> > into PG 14?  Should I take it?
> 
> I still say that it's a serious mistake to sanctify a query ID calculation
> method that was designed only for pg_stat_statement's needs as the one
> true way to do it.  But that's what exposing it in a core view would do.

OK, I am fine with creating a new method, and maybe having
pg_stat_statements use it.  Is that the direction we should be going in?
I do think we need _some_ method in core if we are going to be exposing
this value in pg_stat_activity and log_line_prefix.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Tom Lane

Bruce Momjian  writes:
> We are reaching the two-year mark on this feature, that everyone seems
> to agree is needed.  Is any committer going to work on this to get it
> into PG 14?  Should I take it?

I still say that it's a serious mistake to sanctify a query ID calculation
method that was designed only for pg_stat_statement's needs as the one
true way to do it.  But that's what exposing it in a core view would do.

regards, tom lane

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-17 Thread Bruce Momjian

On Sun, Mar 14, 2021 at 04:06:45PM +0800, Julien Rouhaud wrote:
> Recent conflict, thanks to cfbot.  v18 attached.

We are reaching the two-year mark on this feature, that everyone seems
to agree is needed.  Is any committer going to work on this to get it
into PG 14?  Should I take it?

I just read the thread and I didn't see any open issues.  Are there any?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Andres Freund

Hi,

On Wed, Mar 17, 2021, at 07:16, Tom Lane wrote:
> Andres Freund  writes:
> > On 2021-03-16 20:50:17 -0700, Andres Freund wrote:
> Meanwhile, I'm still trying to understand why valgrind is whining
> about the rd_indexcxt identifier strings.  AFAICS it shouldn't.

I found a way around that late last night. Need to mark the context itself as 
an allocation. But I made a mess on the way to that and need to clean the patch 
up before sending it (and need to drop my girlfriend off first).

Andres

RE: libpq debug log

2021-03-17 Thread iwata....@fujitsu.com

Hi Tsunakawa san, Alvaro san,

Thank you very much for your review. It helped me a lot to make the patch 
better.
I update patch to v26. 
This patch has been updated according to Tsunakawa san and Alvaro san review 
comments.

The output is following;
```
2021-03-17 14:43:16.411238  >   Terminate   4   
  
2021-03-17 14:43:16.441775  >   Query   155 "CREATE TABLESPACE 
regress_tblspacewith LOCATION 
'/home/iwata/pgsql/postgres/src/test/regress/testtablespace' WITH 
(some_nonexistent_parameter = true);"
2021-03-17 14:43:16.442935  <   ErrorResponse   124 S "ERROR" V "ERROR" C 
"22023" M "unrecognized parameter "some_nonexistent_parameter"" F 
"reloptions.c" L "1447" R "parseRelOptionsInternal" \x00 "Z"
2021-03-17 14:43:16.442977  <   ReadyForQuery   5 I 
  
2021-03-17 14:43:16.443042  >   Query   144 "CREATE TABLESPACE 
regress_tblspacewith LOCATION 
'/home/iwata/pgsql/postgres/src/test/regress/testtablespace' WITH 
(random_page_cost = 3.0);"
2021-03-17 14:43:16.444774  <   CommandComplete 22 "CREATE TABLESPACE"  
  
2021-03-17 14:43:16.444807  <   ReadyForQuery   5 I 
2021-03-17 14:43:16.444878  >   Query   81 "SELECT spcoptions FROM 
pg_tablespace WHERE spcname = 'regress_tblspacewith';"
2021-03-17 14:43:16.448117  <   RowDescription  35 1 "spcoptions" 1213 5 1009 
65535 -1 0
2021-03-17 14:43:16.448157  <   DataRow 32 1 22 '{random_page_cost=3.0}'
  
2021-03-17 14:43:16.448171  <   CommandComplete 13 "SELECT 1"   
```

> -Original Message-
> From: Tsunakawa, Takayuki/綱川 貴之 
> Sent: Tuesday, March 16, 2021 12:03 PM
 
 
> (1)
> -  Enables  tracing of the client/server communication to a debugging
> file stream.
> +  Enables tracing of the client/server communication to a debugging file
> +  stream.
> +  Only available when protocol version 3 or higher is used.
> 
> The last line is unnecessary now that v2 is not supported.

I removed last comment from documentation file.

> (2)
> @@ -113,6 +114,7 @@ install: all installdirs install-lib
>   $(INSTALL_DATA) $(srcdir)/libpq-fe.h '$(DESTDIR)$(includedir)'
>   $(INSTALL_DATA) $(srcdir)/libpq-events.h '$(DESTDIR)$(includedir)'
>   $(INSTALL_DATA) $(srcdir)/libpq-int.h
> '$(DESTDIR)$(includedir_internal)'
> + $(INSTALL_DATA) $(srcdir)/libpq-trace.h
> '$(DESTDIR)$(includedir_internal)'
>   $(INSTALL_DATA) $(srcdir)/pqexpbuffer.h
> '$(DESTDIR)$(includedir_internal)'
>   $(INSTALL_DATA) $(srcdir)/pg_service.conf.sample
> '$(DESTDIR)$(datadir)/pg_service.conf.sample'
> 
> Why is this necessary?

I think it is not necessary. I removed it.

> (3) libpq-trace.h
> +#ifdef __cplusplus
> +extern "C"
> +{
> 
> This is unnecessary because pqTraceOutputMessage() is for libpq's internal
> use.  This extern is for the user's C++ application to call public libpq
> functions.
> 
> +#include "libpq-fe.h"
> +#include "libpq-int.h"
> 
> I think these can be removed by declaring the struct and function as follows:
> 
> struct pg_conn;
> extern void pqTraceOutputMessage(struct pg_conn *conn, const char
> *message, bool toServer);
> 
> But... after doing the above, only this declaration is left in libpq-trace.h. 
>  Why
> don't you put your original declaration using PGconn in libpq-int.h and remove
> libpq-trace.h?

I remove this file.
I was wondering whether to delete this file during the development of v25 
patch. I leave it because it keep scalability.
If tracing process become update and it have a lot of extern function, leave 
this header file is meaningful.
However I think it does not happen. So I remove it.

> (4)
> + /* Trace message only when there is first 1 byte */
> + if (conn->Pfdebug && (conn->outCount < conn->outMsgStart))
> + pqTraceOutputMessage(conn, conn->outBuffer +
> conn->outCount, true);
> 
> () surrounding the condition after && can be removed.

I removed this (). And This if () statement has been modified according to the 
review comment (14).

> (5)
> +static const char *pqGetProtocolMsgType(unsigned char c,
> +
>   bool toServer);
> 
> This is unnecessary because the function definition precedes its only one call
> site.

Yes, I removed this definition.

> (6)
> + * We accumulate frontend message pieces in an array as the libpq code
> + writes
> + * them, and log the complete message when pqTraceOutputFeMsg is called.
> + * For backend, we print the pieces as soon as we receive them from the
> server.
> 
> The first paragraph is no longer true.  I think this entire comment is
> unnecessary.

I removed this explanation. 

> (7)
> +static const char *
> +pqGetProtocolMsgType(unsigned char c, bool toServer) {
> + if (toServer == true &&
> + c < lengthof(protocol_message_type_f) &&
> + protocol_message_type_f[c] != NULL)
> + return protocol_message_type_f[c];
> +
> + if (toServer == false &&
> + c < lengthof(protoco

Re: Calendar support in localization

2021-03-17 Thread Tom Lane

Robert Haas  writes:
> It's not very obvious how to scale this kind of approach to a wide
> variety of calendar types, and as Thomas says, it would much cooler to
> be able to handle all of the ones that ICU knows how to support rather
> than just one. But, the problem I see with using timestamptz is that
> it's not so obvious how to get a different output format ... unless, I
> guess, we can cram it into DateStyle. And it's also much less obvious
> how you get the other functions and operators to do what you want, if
> it's different.

Yeah, I'm afraid that it probably is different.  The most obvious
example is in operations involving type interval:
select now() + '1 month'::interval;
That should almost certainly give a different answer when using a
different calendar --- indeed the units of interest might not even
be the same.  (Do all human calendars use the concept of months?)

I don't feel like DateStyle is chartered to affect the behavior
of datetime operators; it's understood as tweaking the I/O behavior
only.  There might be more of a case for letting LC_TIME choose
this behavior, but I bet the relevant standards only contemplate
Gregorian calendars.  Also, the SQL spec says in so many words
that the SQL-defined datetime types follow the Gregorian calendar.

So on the whole, new datatypes and operators seem like the way
to go.  I concur that if ICU has solved this problem, piggybacking
on it seems more promising than writing our own code.

regards, tom lane

Re: Getting better results from valgrind leak tracking

2021-03-17 Thread Tom Lane

Andres Freund  writes:
> On 2021-03-16 20:50:17 -0700, Andres Freund wrote:
>> What I meant was that I didn't understand how there's not a leak
>> danger when compilation fails halfway through, given that the context
>> in question is below TopMemoryContext and that I didn't see a relevant
>> TRY block. But that probably there is something cleaning it up that I
>> didn't see.

> Looks like it's an actual leak:

Yeah, I believe that.  On the other hand, I'm not sure that such cases
represent any real problem for production usage.  I'm inclined to focus
on non-error scenarios first.

(Having said that, we probably have the ability to fix such things
relatively painlessly now, by reparenting an initially-temporary
context once we're done parsing.)

Meanwhile, I'm still trying to understand why valgrind is whining
about the rd_indexcxt identifier strings.  AFAICS it shouldn't.

regards, tom lane

Re: Calendar support in localization

2021-03-17 Thread Robert Haas

On Wed, Mar 17, 2021 at 9:54 AM Surafel Temesgen  wrote:
> As you mention above whatever the calendar type is we ended up storing an 
> integer that  represent the date so rather than re-implementing every 
> function and operation for every calendar we can use existing Gerigorian 
> implementation as a base implementation and if new calendar want to perform 
> same function or operation it translate to Gregorian one and use the existing 
> function and operation and translate to back to working calendar. In this 
> approach the only function we want for supporting a new calendar is a 
> translator from the new calendar to Gregorian one and from Gerigorian 
> calendar to the new calendar and may be input/ output function. What do you 
> think of this implementation?

I'm not sure what the best way of tackling this problem is, but I
wanted to mention another possible approach: instead of actually using
the timestamptz data type, have another data type that is
binary-compatible with timestamptz - that is, it's the same thing on
disk, so you can cast between the two data types for free. Then have
separate input/output functions for it, separate operators and
functions and so forth.

It's not very obvious how to scale this kind of approach to a wide
variety of calendar types, and as Thomas says, it would much cooler to
be able to handle all of the ones that ICU knows how to support rather
than just one. But, the problem I see with using timestamptz is that
it's not so obvious how to get a different output format ... unless, I
guess, we can cram it into DateStyle. And it's also much less obvious
how you get the other functions and operators to do what you want, if
it's different.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Calendar support in localization

2021-03-17 Thread Surafel Temesgen

On Tue, Mar 16, 2021 at 12:20 PM Thomas Munro 
wrote:

> On Wed, Mar 17, 2021 at 6:31 AM Surafel Temesgen 
> wrote:
> > Ethiopice calendar have 13 months so it can not be stored as date and
> timestamp type and you approach seems more complicated and i suggest to
> have this feature on the purpose of PostgreSQL popularities too not only
> for my need
>
> I know, but the DATE and TIMESTAMPTZ datatypes don't intrinsically
> know anything about months or other calendar concepts.  Internally,
> they are just a single number that counts the number of days or
> seconds since an arbitrary epoch time.  We are all in agreement about
> how many times the Earth has rotated since then*.  The calendar
> concepts such as "day", "month", "year", whether Gregorian, Ethiopic,
> Islamic, ... are all derivable from those numbers, if you know the
> rules.
>
>
Okay


>
>
> > I think you suggesting this by expecting the implementation is difficult
> but it's not that much difficult once you fully understand Gregorian
> calendar and the calendar you work on
>
> Yeah, I am sure it's all just a bunch of simple integer maths.  But
> I'm talking about things like software architecture, maintainability,
> cohesion, and getting maximum impact for the work we do.
>
> I may be missing some key detail though: why do you think it should be
> a different type?  The two reasons I can think of are: (1) the
> slightly tricky detail that the date apparently changes at 1:00am
> (which I don't think is a show stopper for this approach, I could
> elaborate), (2) you may want dates to be formatted on the screen with
> the Ethiopic calendar in common software like psql and GUI clients,
> which may be easier to arrange with different types, but that seems to
> be a cosmetic thing that could eventually be handled with tighter
> locale integration with ICU.  In the early stages you'd access
> calendar logic though special functions with names like
> icu_format_date(), or whatever.
>
>
As you mention above whatever the calendar type is we ended up storing an
integer that  represent the date so rather than re-implementing every
function and operation for every calendar we can use existing Gerigorian
implementation as a base implementation and if new calendar want to perform
same function or operation it translate to Gregorian one and use the
existing function and operation and translate to back to working calendar.
In this approach the only function we want for supporting a new calendar is
a translator from the new calendar to Gregorian one and from Gerigorian
calendar to the new calendar and may be input/ output function. What do you
think of this implementation?

regards
Surafel

Re: pgsql: Add libpq pipeline mode support to pgbench

2021-03-17 Thread Daniel Verite

Fabien COELHO wrote:

> For consistency with the existing \if … \endif, ISTM that it could have 
> been named \batch … \endbatch or \pipeline … \endpipeline?

"start" mirrors "end". To me, the analogy with \if-\endif is not
obvious.
Grammatically \if is meant to introduce the expression after it,
whereas \startpipeline takes no argument.
Functionally \startpipeline can be thought as "open the valve"
and \endpipeline "close the valve". They're "call-to-action" kinds of
commands, and in that sense quite different from the \if-\endif pair.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

Re: Get memory contexts of an arbitrary backend process

2021-03-17 Thread torikoshia


On 2021-03-05 14:22, Fujii Masao wrote:

On 2021/03/04 18:32, torikoshia wrote:

On 2021-01-14 19:11, torikoshia wrote:
Since pg_get_target_backend_memory_contexts() waits to dump memory 
and

it could lead dead lock as below.

  - session1
  BEGIN; TRUNCATE t;

  - session2
  BEGIN; TRUNCATE t; -- wait

  - session1
  SELECT * FROM pg_get_target_backend_memory_contexts(); --wait


Thanks for notifying me, Fujii-san.


Attached v8 patch that prohibited calling the function inside 
transactions.


Regrettably, this modification could not cope with the advisory lock 
and

I haven't come up with a good way to deal with it.

It seems to me that the architecture of the requestor waiting for the
dumper leads to this problem and complicates things.


Considering the discussion printing backtrace discussion[1], it seems
reasonable that the requestor just sends a signal and dumper dumps to
the log file.


+1


Thanks!

I remade the patch and introduced a function
pg_print_backend_memory_contexts(PID) which prints the memory contexts 
of

the specified PID to elog.

  =# SELECT pg_print_backend_memory_contexts(450855);

  ** log output **
  2021-03-17 15:21:01.942 JST [450855] LOG:  Printing memory contexts of 
PID 450855
  2021-03-17 15:21:01.942 JST [450855] LOG:  level: 0 TopMemoryContext: 
68720 total in 5 blocks; 16312 free (15 chunks); 52408 used
  2021-03-17 15:21:01.942 JST [450855] LOG:  level: 1 Prepared Queries: 
65536 total in 4 blocks; 35088 free (14 chunks); 30448 used
  2021-03-17 15:21:01.942 JST [450855] LOG:  level: 1 pgstat 
TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 
chunks); 6784 used

  ..(snip)..
  2021-03-17 15:21:01.942 JST [450855] LOG:  level: 2 CachedPlanSource: 
4096 total in 3 blocks; 680 free (0 chunks); 3416 used: PREPARE hoge_200 
AS SELECT * FROM pgbench_accounts WHERE aid = 
1...
  2021-03-17 15:21:01.942 JST [450855] LOG:  level: 3 CachedPlanQuery: 
4096 total in 3 blocks; 464 free (0 chunks); 3632 used

  ..(snip)..
  2021-03-17 15:21:01.945 JST [450855] LOG:  level: 1 Timezones: 104128 
total in 2 blocks; 2584 free (0 chunks); 101544 used
  2021-03-17 15:21:01.945 JST [450855] LOG:  level: 1 ErrorContext: 8192 
total in 1 blocks; 7928 free (5 chunks); 264 used
  2021-03-17 15:21:01.945 JST [450855] LOG:  Grand total: 2802080 bytes 
in 1399 blocks; 480568 free (178 chunks); 2321512 used



As above, the output is almost the same as MemoryContextStatsPrint()
except for the way of expression of the level.
MemoryContextStatsPrint() uses indents, but
pg_print_backend_memory_contexts() writes it as "level: %d".

Since there was discussion about enlarging StringInfo may cause
errors on OOM[1], this patch calls elog for each context.

As with MemoryContextStatsPrint(), each context shows 100
children at most.
I once thought it should be configurable, but something like
pg_print_backend_memory_contexts(PID, num_children) needs to send
the 'num_children' from requestor to dumper and it seems to require
another infrastructure.
Creating a new GUC for this seems overkill.
If MemoryContextStatsPrint(), i.e. showing 100 children at most is
enough, this hard limit may be acceptable.

Only superusers can call pg_print_backend_memory_contexts().

I'm going to add documentation and regression tests.


Any thoughts?


[1] 
https://www.postgresql.org/message-id/CAMsr%2BYGh%2Bsso5N6Q%2BFmYHLWC%3DBPCzA%2B5GbhYZSGruj2d0c7Vvg%40mail.gmail.com

"r_d/strengthen_perf/print_memcon.md" 110L, 5642C written


Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONdiff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e..e116f4a1be 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -30,6 +30,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/memutils.h"
 
 /*
  * The SIGUSR1 signal is multiplexed to support signaling multiple event
@@ -440,6 +441,20 @@ HandleProcSignalBarrierInterrupt(void)
 	/* latch will be set by procsignal_sigusr1_handler */
 }
 
+/*
+ * HandleProcSignalPrintMemoryContext
+ *
+ * Handle receipt of an interrupt indicating print memory context.
+ * Signal handler portion of interrupt handling.
+ */
+static void
+HandleProcSignalPrintMemoryContext(void)
+{
+	InterruptPending = true;
+	PrintMemoryContextPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
+
 /*
  * Perform global barrier related interrupt checking.
  *
@@ -580,6 +595,25 @@ ProcessProcSignalBarrier(void)
 	ConditionVariableBroadcast(&MyProcSignalSlot->pss_barrierCV);
 }
 
+/*
+ * ProcessPrintMemoryContextInterrupt
+ *		The portion of print memory context interrupt handling that runs
+ *		outside of the signal handler.
+ */
+void
+ProcessPrintMemoryContextInterrupt(void)
+{
+	PrintMemoryContextPending = false;
+
+	ereport(LOG,
+		(errmsg("Printing memory contexts of PID %d", MyProcPid)));
+
+	/* A har

1 2 >

1 - 100 of 117 matches

Mail list logo