Re: Creating a function for exposing memory usage of backend process

2020-06-25 Thread Kasahara Tatsuhito
Hi !

On Thu, Jun 18, 2020 at 12:56 PM Fujii Masao
 wrote:
> Agreed. The feature to view how local memory contexts are used in
> each process is very useful!
+1

> >=# SELECT * FROM pg_stat_get_backend_memory_context(PID);
>
> I'm afraid that this interface is not convenient when we want to monitor
> the usages of local memory contexts for all the processes. For example,
> I'd like to monitor how much memory is totally used to store prepared
> statements information. For that purpose, I wonder if it's more convenient
> to provide the view displaying the memory context usages for
> all the processes.
How about separating a function that examines memory consumption
trends for all processes and a function that examines memory
consumption for a particular phase of a particular process?

For the former, as Fujii said, the function shows the output limited
information for each context type. All processes calculation and
output the information at idle status.

I think the latter is useful for debugging and other purposes.
For example, imagine preparing a function for registration like the following.
=# SELECT pg_stat_get_backend_memory_context_regist (pid, context,
max_children, calc_point)

pid: A target process
context: The top level of the context of interest
max_children: Maximum number of output for the target context's children
 (similar to MemoryContextStatsInternal()'s max_children)
calc_point: Single or multiple position(s) to calculate and output
context information
 (Existing hooks such as planner_hook, executor_start, etc.. could be used. )

This function informs the target PID to output the information of the
specified context at the specified calc_point.
When the target PID process reaches the calc_point, it calculates and
output the context information one time to a file or DSM.

(Currently PostgreSQL has no formal ways of externally modifying the
parameters of a particular process, so it may need to be
implemented...)

Sometimes I want to know the memory usage in the planning phase or
others with a query_string and/or plan_tree that before target process
move to the idle status.
So it would be nice to retrieve memory usage at some arbitrary point in time !

Regards,

-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-25 Thread Amit Kapila
On Fri, Jun 26, 2020 at 10:39 AM Dilip Kumar  wrote:
>
> On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar  wrote:
> >
> > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila  wrote:
> > >
> > > Comments on other patches:
> > > =
> > > 5.
> > > > 3. On concurrent abort we are truncating all the changes including
> > > > some incomplete changes,  so later when we get the complete changes we
> > > > don't have the previous changes,  e.g, if we had specinsert in the
> > > > last stream and due to concurrent abort detection if we delete that
> > > > changes later we will get spec_confirm without spec insert.  We could
> > > > have simply avoided deleting all the changes, but I think the better
> > > > fix is once we detect the concurrent abort for any transaction, then
> > > > why do we need to collect the changes for that, we can simply avoid
> > > > that.  So I have put that fix. (0006)
> > > >
> > >
> > > On similar lines, I think we need to skip processing message, see else
> > > part of code in ReorderBufferQueueMessage.
> >
> > Basically, ReorderBufferQueueMessage also calls the
> > ReorderBufferQueueChange internally for transactional changes.

Yes, that is correct but I was thinking about the non-transactional
part due to the below code there.

else
{
ReorderBufferTXN *txn = NULL;
volatile Snapshot snapshot_now = snapshot;

if (xid != InvalidTransactionId)
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

Even though we are using txn here but I think we don't need to skip it
for aborted xacts because without patch as well such messages get
decoded irrespective of transaction status.  What do you think?

> >  But,
> > having said that, I realize the idea of skipping the changes in
> > ReorderBufferQueueChange is not good,  because by then we have already
> > allocated the memory for the change and the tuple and it's not a
> > correct to ReturnChanges because it will update the memory accounting.
> > So I think we can do it at a more centralized place and before we
> > process the change,  maybe in LogicalDecodingProcessRecord, before
> > going to the switch we can call a function from the reorderbuffer.c
> > layer to see whether this transaction is detected as aborted or not.
> > But I have to think more on this line that can we skip all the
> > processing of that record or not.
> >
> > Your other comments look fine to me so I will send in the next patch
> > set and reply on them individually.
>
> I think we can not put this check, in the higher-level functions like
> LogicalDecodingProcessRecord or DecodeXXXOp because we need to process
> that xid at least for abort,  so I think it is good to keep the check,
> inside ReorderBufferQueueChange only and we can free the memory of the
> change if the abort is detected.  Also, if just skip those changes in
> ReorderBufferQueueChange then the effect will be localized to that
> particular transaction which is already aborted.
>

Fair enough and for cases like non-transactional part of
ReorderBufferQueueMessage, I think we anyway need to process the
message irrespective of transaction status.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

2020-06-25 Thread Maciek Sakrejda
On Wed, Jun 24, 2020 at 2:44 AM Amit Kapila  wrote:
> So, that leads to loops as 2 on "Parallel Seq Scan" and "Nested Loop" nodes.  
> Does this make sense now?

Yes, I think we're on the same page. Thanks for the additional details.

It turns out that the plan I sent at the top of the thread is actually
an older plan we had saved, all the way from April 2018. We're fairly
certain this was Postgres 10, but not sure what point release. I tried
to reproduce this on 10, 11, 12, and 13 beta, but I am now seeing
similar results to yours: Buffers and I/O Timings are rolled up into
the parallel leader, and that is propagated as expected to the Gather.
Sorry for the confusion.

On Wed, Jun 24, 2020 at 3:18 AM Maciek Sakrejda  wrote:
>So we should be seeing an average, not a sum, right?

And here I missed that the documentation specifies rows and actual
time as per-loop, but other metrics are not--they're just cumulative.
So actual time and rows are still per-"loop" values, but while rows
values are additive (the Gather combines rows from the parallel leader
and the workers), the actual time is not because the whole point is
that this work happens in parallel.

I'll report back if I can reproduce the weird numbers we saw in that
original plan or find out exactly what Postgres version it was from.

Thanks,
Maciek




Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-25 Thread Masahiko Sawada
On Tue, 23 Jun 2020 at 13:26, Amit Kapila  wrote:
>
> On Tue, Jun 23, 2020 at 9:03 AM Masahiko Sawada
>  wrote:
> >
> >
> > I've attached the latest version patches. I've incorporated the review
> > comments I got so far and improved locking strategy.
> >
>
> Thanks for updating the patch.
>
> > Please review it.
> >
>
> I think at this stage it is important that we do some study of various
> approaches to achieve this work and come up with a comparison of the
> pros and cons of each approach (a) what this patch provides, (b) what
> is implemented in Global Snapshots patch [1], (c) if possible, what is
> implemented in Postgres-XL.  I fear that if go too far in spending
> effort on this and later discovered that it can be better done via
> some other available patch/work (maybe due to a reasons like that
> approach can easily extended to provide atomic visibility or the
> design is more robust, etc.) then it can lead to a lot of rework.

Yeah, I have no objection to that plan but I think we also need to
keep in mind that (b), (c), and whatever we are thinking about global
consistency are talking about only PostgreSQL (and postgres_fdw). On
the other hand, this patch needs to implement the feature that can
resolve the atomic commit problem more generically, because the
foreign server might be using oracle_fdw, mysql_fdw, or other FDWs
connecting database systems supporting 2PC.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-25 Thread Dilip Kumar
On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar  wrote:
>
> On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila  wrote:
> >
> > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar  wrote:
> > >
> > > Here is the POC patch to discuss the idea of a cleanup of shared
> > > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > > the list of shared fileset.  First time when the list is NULL I am
> > > registering the cleanup function with on_proc_exit routine.  After
> > > that for subsequent fileset, I am just appending it to filesetlist.
> > > There is also an interface to unregister the shared file set from the
> > > cleanup list and that is done by the caller whenever we are deleting
> > > the shared fileset manually.  While explaining it here, I think there
> > > could be one issue if we delete all the element from the list will
> > > become NULL and on next SharedFileSetInit we will again register the
> > > function.  Maybe that is not a problem but we can avoid registering
> > > multiple times by using some flag in the file
> > >
> >
> > I don't understand what you mean by "using some flag in the file".
> >
> > Review comments on various patches.
> >
> > poc_shared_fileset_cleanup_on_procexit
> > =
> > 1.
> > - ent->subxact_fileset =
> > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> > + MemoryContext oldctx;
> >
> > + /* Shared fileset handle must be allocated in the persistent context */
> > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
> >   SharedFileSetInit(ent->subxact_fileset, NULL);
> > + MemoryContextSwitchTo(oldctx);
> >   fd = BufFileCreateShared(ent->subxact_fileset, path);
> >
> > Why is this change required for this patch and why we only cover
> > SharedFileSetInit in the Apply context and not BufFileCreateShared?
> > The comment is also not very clear on this point.
>
> Added the comments for the same.
>
> > 2.
> > +void
> > +SharedFileSetUnregister(SharedFileSet *input_fileset)
> > +{
> > + bool found = false;
> > + ListCell *l;
> > +
> > + Assert(filesetlist != NULL);
> > +
> > + /* Loop over all the pending shared fileset entry */
> > + foreach (l, filesetlist)
> > + {
> > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> > +
> > + /* remove the entry from the list and delete the underlying files */
> > + if (input_fileset->number == fileset->number)
> > + {
> > + SharedFileSetDeleteAll(fileset);
> > + filesetlist = list_delete_cell(filesetlist, l);
> >
> > Why are we calling SharedFileSetDeleteAll here when in the caller we
> > have already deleted the fileset as per below code?
> > BufFileDeleteShared(ent->stream_fileset, path);
> > + SharedFileSetUnregister(ent->stream_fileset);
>
> That's wrong I have removed this.
>
>
> > I think it will be good if somehow we can remove the fileset from
> > filesetlist during BufFileDeleteShared.  If that is possible, then we
> > don't need a separate API for SharedFileSetUnregister.
>
> I have done as discussed on later replies, basically called
> SharedFileSetUnregister from BufFileDeleteShared.
>
> > 3.
> > +static List * filesetlist = NULL;
> > +
> >  static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
> > +static void SharedFileSetOnProcExit(int status, Datum arg);
> >  static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
> > tablespace);
> >  static void SharedFilePath(char *path, SharedFileSet *fileset, const
> > char *name);
> >  static Oid ChooseTablespace(const SharedFileSet *fileset, const char 
> > *name);
> > @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment 
> > *seg)
> >   /* Register our cleanup callback. */
> >   if (seg)
> >   on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> > + else
> > + {
> > + if (filesetlist == NULL)
> > + on_proc_exit(SharedFileSetOnProcExit, 0);
> >
> > We use NIL for list initialization and comparison.  See lock_files usage.
>
> Done
>
> > 4.
> > +SharedFileSetOnProcExit(int status, Datum arg)
> > +{
> > + ListCell *l;
> > +
> > + /* Loop over all the pending  shared fileset entry */
> > + foreach (l, filesetlist)
> > + {
> > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> > + SharedFileSetDeleteAll(fileset);
> > + }
> >
> > We can initialize filesetlist as NIL after the for loop as it will
> > make the code look clean.
>
> Right.
>
> > Comments on other patches:
> > =
> > 5.
> > > 3. On concurrent abort we are truncating all the changes including
> > > some incomplete changes,  so later when we get the complete changes we
> > > don't have the previous changes,  e.g, if we had specinsert in the
> > > last stream and due to concurrent abort detection if we delete that
> > > changes later we will get spec_confirm without spec insert.  We could
> > > have simply avoided deleting all the changes, but I think the better
> > > fix is once we detect the concurrent abort for any 

Re: min_safe_lsn column in pg_replication_slots view

2020-06-25 Thread Amit Kapila
On Fri, Jun 26, 2020 at 4:54 AM Alvaro Herrera  wrote:
>
> On 2020-Jun-26, Michael Paquier wrote:
>
> > On Thu, Jun 25, 2020 at 11:24:27AM -0400, Alvaro Herrera wrote:
> > > I don't understand the proposal.  Michael posted a patch that adds
> > > pg_wal_oldest_lsn(), and you say we should apply the patch except the
> > > part that adds that function -- so what part would be applying?
> >
> > I have sent last week a patch about only the removal of min_safe_lsn:
> > https://www.postgresql.org/message-id/20200619121552.gh453...@paquier.xyz
> > So this applies to this part.
>
> Well, I oppose that because it leaves us with no way to monitor slot
> limits.  In his opening email, Masao-san proposed to simply change the
> value by adding 1.  How you go from adding 1 to a column to removing
> the column completely with no recourse, is beyond me.
>
> Let me summarize the situation and possible ways forward as I see them.
> If I'm mistaken, please correct me.
>
> Problems:
> i)  pg_replication_slot.min_safe_lsn has a weird definition in that all
> replication slots show the same value
>

It is also not clear how the user can make use of that value?

> ii) min_safe_lsn cannot be used with pg_walfile_name, because it returns
> the name of the previous segment.
>
> Proposed solutions:
>
> a) Do nothing -- keep the min_safe_lsn column as is.  Warn users that
>pg_walfile_name should not be used with this column.
> b) Redefine min_safe_lsn to be lsn+1, so that pg_walfile_name can be used
>and return a useful value.
> c) Remove min_safe_lsn; add functions that expose the same value
> d) Remove min_safe_lsn; add a new view that exposes the same value and
>possibly others
>
> e) Replace min_safe_lsn with a "distance" column, which reports
>restart_lsn - oldest valid LSN
>(Note that you no longer have an LSN in this scenario, so you can't
>call pg_walfile_name.)
>

Can we consider an option to "Remove min_safe_lsn; document how a user
can monitor the distance"?  We have a function to get current WAL
insert location and other things required are available either via
view or as guc variable values.  The reason I am thinking of this
option is that it might be better to get some more feedback on what is
the most appropriate value to display.  However, I am okay if we can
reach a consensus on one of the above options.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-25 Thread Amul Sul
On Wed, Jun 24, 2020 at 1:54 PM tushar  wrote:
>
> On 6/22/20 11:59 AM, Amul Sul wrote:
> > 2. Now skipping the startup checkpoint if the system is read-only mode, as
> > discussed [2].
>
> I am not able to perform pg_checksums o/p after shutting down my server
> in read only  mode .
>
> Steps -
>
> 1.initdb (./initdb -k -D data)
> 2.start the server(./pg_ctl -D data start)
> 3.connect to psql (./psql postgres)
> 4.Fire query (alter system read only;)
> 5.shutdown the server(./pg_ctl -D data stop)
> 6.pg_checksums
>
> [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
> pg_checksums: error: cluster must be shut down
> [edb@tushar-ldap-docker bin]$
>
> Result - (when server is not in read only)
>
> [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
> Checksum operation completed
> Files scanned:  916
> Blocks scanned: 2976
> Bad checksums:  0
> Data checksum version: 1
>
I think that's expected since the server isn't clean shutdown, similar error can
be seen with any server which has been shutdown in immediate mode
(pg_clt -D data_dir -m i).

Regards,
Amul




Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-25 Thread Michael Paquier
On Thu, Jun 25, 2020 at 10:14:00PM +0900, Inoue, Hiroshi wrote:
> I seem to have invalidated KEYSET-DRIVEN cursors used in positioned-update
> test.  It was introduced by the commit 4a272fd but was invalidated by the
> commit 2be35a6.
> 
> I don't object to the removal of currtid(2) because keyset-driven cursors in
> psqlodbc are changed into static cursors in many cases and I've hardly ever
> heard a complaint about it.

Hmm.  I am not sure that this completely answers my original concern
though.  In short, don't we still have corner cases where
keyset-driven cursors are not changed into static cursors, meaning
that currtid2() could get used?  The removal of the in-core functions
would hurt applications using that, meaning that we should at least
provide an equivalent of currtid2() in the worse case as a contrib
module, no?  If the code paths of currtid2() are reachable, shouldn't
we also make sure that they are still reached in the regression tests
of the driver, meaning that the driver code needs more coverage?  I
have been looking at the tests and tried to tweak them using
SQLSetPos() so as the code paths involving currtid2() get reached, but 
I am not really able to do so.  It does not mean that that currtid2()
never gets reached, it just means that I am not able to be sure that
this part can be safely removed from the Postgres backend code :( 

From what I can see on this thread, we could just remove currtid() per
the arguments of the RETURNING ctid clause supported since PG 8.2, but
it would make more sense to me to just remove both currtid/currtid2()
at once.
--
Michael


signature.asc
Description: PGP signature


Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Bruce Momjian
On Fri, Jun 26, 2020 at 01:53:57AM +0200, Tomas Vondra wrote:
> I'm not saying it's not beneficial to use different limits for different
> nodes. Some nodes are less sensitive to the size (e.g. sorting often
> gets faster with smaller work_mem). But I think we should instead have a
> per-session limit, and the planner should "distribute" the memory to
> different nodes. It's a hard problem, of course.

Yeah, I am actually confused why we haven't developed a global memory
allocation strategy and continue to use per-session work_mem.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Backpatch b61d161c14 (Introduce vacuum errcontext ...)

2020-06-25 Thread Amit Kapila
On Fri, Jun 26, 2020 at 7:25 AM Justin Pryzby  wrote:
>
>
> > I have done some testing with both the patches and would like to do
> > more unless there are objections with these.
>
> Comments:
>
> > * The index name is saved only during this phase and restored 
> > immediately
>
> => I wouldn't say "only" since it's saved during lazy_vacuum: index AND 
> cleanup.
>
> >update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *oldpos, int 
> >phase,
>
> => You called your struct "LVSavedErrInfo" but the variables are still called
> "pos".  I would call it olderrinfo or just old.
>

Fixed both of the above comments. I used the variable name as saved_err_info.

> Also, this doesn't (re)rename the "cbarg" stuff that Alvaro didn't like, which
> was my 0001 patch.
>

If I am not missing anything then that change was in
lazy_cleanup_index and after this patch, it won't be required because
we are using a different variable name.

I have combined both the patches now.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0001-Improve-vacuum-error-context-handling.v1.patch
Description: Binary data


Re: Backpatch b61d161c14 (Introduce vacuum errcontext ...)

2020-06-25 Thread Justin Pryzby
On Thu, Jun 25, 2020 at 02:31:51PM +0530, Amit Kapila wrote:
> I have looked at both the patches (using separate variables (by
> Justin) and using a struct (by Andres)) and found the second one bit
> better.

Thanks for looking.

> I have improved some comments in the code and for now, kept as two
> patches (a) one for improving the error info for index (mostly
> Justin's patch based on Tom's idea) and (b) the other to generally
> improve the code in this area (mostly Andres's patch).

And thanks for separate patchen :)

> I have done some testing with both the patches and would like to do
> more unless there are objections with these.

Comments:

> * The index name is saved only during this phase and restored 
> immediately

=> I wouldn't say "only" since it's saved during lazy_vacuum: index AND cleanup.

>update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *oldpos, int 
>phase,

=> You called your struct "LVSavedErrInfo" but the variables are still called
"pos".  I would call it olderrinfo or just old.

Also, this doesn't (re)rename the "cbarg" stuff that Alvaro didn't like, which
was my 0001 patch.

-- 
Justin




Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-06-25 Thread David Rowley
On Fri, 26 Jun 2020 at 04:35, Andres Freund  wrote:
> On 2020-06-25 13:50:37 +1200, David Rowley wrote:
> > In the attached, I've come up with a way that works. Basically I've
> > just added a function named errstart_cold() that is attributed with
> > __attribute__((cold)), which will hint to the compiler to keep
> > branches which call this function in a cold path.
>
> I recall you trying this before? Has that gotten easier because we
> evolved ereport()/elog(), or has gcc become smarter, or ...?

Yeah, I appear to have tried it and found it not to work in [1]. I can
only assume GCC got smarter in regards to how it marks a path as cold.
Previously it seemed not do due to the do/while(0). I'm pretty sure
back when I tested last that ditching the do while made it work, just
we can't really get rid of it.

> > To make the compiler properly mark just >= ERROR calls as cold, and
> > only when the elevel is a constant, I modified the ereport_domain
> > macro to do:
> >
> > if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
> > errstart_cold(elevel, domain) : \
> > errstart(elevel, domain)) \
>
> I think it'd be good to not just do this for ERROR, but also for <=
> DEBUG1. I recall seing quite a few debug elogs that made the code worse
> just by "being there".

I think that case is different. We don't want to move the entire elog
path into the cold path for that. We'd only want to hint that errstart
is unlikely to return true if elevel <= DEBUG1

> > diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
> > index e976201030..8076e8af24 100644
> > --- a/src/backend/utils/error/elog.c
> > +++ b/src/backend/utils/error/elog.c
> > @@ -219,6 +219,19 @@ err_gettext(const char *str)
> >  #endif
> >  }
> >
> > +#if defined(HAVE_PG_ATTRIBUTE_HOT_AND_COLD) && 
> > defined(HAVE__BUILTIN_CONSTANT_P)
> > +/*
> > + * errstart_cold
> > + *   A simple wrapper around errstart, but hinted to be cold so 
> > that the
> > + *   compiler is more likely to put error code in a cold area away 
> > from the
> > + *   main function body.
> > + */
> > +bool
> > +pg_attribute_cold errstart_cold(int elevel, const char *domain)
> > +{
> > + return errstart(elevel, domain);
> > +}
> > +#endif
>
> Hm. Would it make sense to have this be a static inline?

I thought about that but didn't try it to ensure it still worked ok. I
didn't think it was that important to make sure we don't get the extra
function hop for ERRORs. It seemed like a case we'd not want to really
optimise for.

> >  /*
> >   * errstart --- begin an error-reporting cycle
> > diff --git a/src/include/c.h b/src/include/c.h
> > index d72b23afe4..087b8af6cb 100644
> > --- a/src/include/c.h
> > +++ b/src/include/c.h
> > @@ -178,6 +178,21 @@
> >  #define pg_noinline
> >  #endif
> >
> > +/*
> > + * Marking certain functions as "hot" or "cold" can be useful to assist the
> > + * compiler in arranging the assembly code in a more efficient way.
> > + * These are supported from GCC >= 4.3 and clang >= 3.2
> > + */
> > +#if (defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && 
> > __GNUC_MINOR__ >= 3))) || \
> > + (defined(__clang__) && (__clang_major__ > 3 || (__clang_major__ == 3 
> > && __clang_minor__ >= 2)))
> > +#define HAVE_PG_ATTRIBUTE_HOT_AND_COLD 1
> > +#define pg_attribute_hot __attribute__((hot))
> > +#define pg_attribute_cold __attribute__((cold))
> > +#else
> > +#define pg_attribute_hot
> > +#define pg_attribute_cold
> > +#endif
>
> Wonder if we should start using __has_attribute() for things like this.
>
> https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fattribute.html
>
> I.e. we could do something like
> #ifndef __has_attribute
> #define __has_attribute(attribute) 0
> #endif
>
> #if __has_attribute(hot)
> #define pg_attribute_hot __attribute__((hot))
> #else
> #define pg_attribute_hot
> #endif
>
> clang added __has_attribute in 2.9 (2010), gcc added it in 5 (2014), so
> I don't think we'd loose too much.

Thanks for pointing that out. Seems like a good idea to me. I don't
think we'll upset too many people running GCC 4.4 to 5.0. I can't
imagine many people serious about performance will be using a
PostgreSQL version that'll be released in 2021 with a pre 2014
compiler.

> >  #ifdef HAVE__BUILTIN_CONSTANT_P
> > +#ifdef HAVE_PG_ATTRIBUTE_HOT_AND_COLD
> > +#define ereport_domain(elevel, domain, ...)  \
> > + do { \
> > + pg_prevent_errno_in_scope(); \
> > + if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
> > +  errstart_cold(elevel, domain) : \
> > +  errstart(elevel, domain)) \
> > + __VA_ARGS__, errfinish(__FILE__, __LINE__, 
> > PG_FUNCNAME_MACRO); \
> > + if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
> > + pg_unreachable(); \
> > + } while(0)
> > +#else/* 
> > !HAVE_PG_ATTRIBUTE_HOT_AND_COLD */
> >  

Re: Failures with installcheck and low work_mem value in 13~

2020-06-25 Thread Michael Paquier
On Thu, Jun 25, 2020 at 12:15:54PM -0700, Peter Geoghegan wrote:
> The RMT met today. We determined that it wasn't worth adjusting this
> test to pass with non-standard work_mem values.

Okay, thanks for the feedback.  We'll see how it works out.
--
Michael


signature.asc
Description: PGP signature


Re: xid wraparound danger due to INDEX_CLEANUP false

2020-06-25 Thread Peter Geoghegan
On Thu, Jun 25, 2020 at 8:28 AM Robert Haas  wrote:
> I'm not really convinced. I agree that from a theoretical point of
> view an index can have arbitrary needs and is the arbiter of its own
> needs, but when I pull the emergency break, I want the vehicle to
> stop, not think about stopping.

Making this theoretical argument in the context of this discussion was
probably a mistake. I agree that this is the emergency break, and it
needs to work like one.

It might be worth considering some compromise in the event of using
the "vacuum_index_cleanup" reloption (i.e. when the user has set it to
'off'), provided there is good reason to believe that we're not in an
emergency -- I mentioned this to Masahiko just now. I admit that that
isn't very appealing for other reasons, but it's worth considering a
way of ameliorating the problem in back branches. We really ought to
change how recycling works, so that it happens at the tail end of the
same VACUUM operation that deleted the pages -- but that cannot be
backpatched.

It might be that the most appropriate mitigation in the back branches
is a log message that reports on the fact that we've probably leaked
pages due to this issue. Plus some documentation. Though even that
would require calling nbtree to check if that is actually true (by
checking the metapage), so it still requires backpatching something
close to Masahiko's draft patch.

> I don't think I believe this. All you need is one small range-deletion, right?

Right.

> > Then there's question 2. The intuition behind the approach from
> > Sawada-san's patch was that allowing wraparound here feels wrong -- it
> > should be in the AM's hands. However, it's not like I can point to
> > some ironclad guarantee about not leaking deleted pages that existed
> > before the INDEX_CLEANUP feature. We know that the FSM is not crash
> > safe, and that's that. Is this really all that different? Maybe it is,
> > but it seems like a quantitative difference to me.
>
> I don't think I believe this, either. In the real-world example to
> which you linked, the user ran REINDEX afterward to recover from index
> bloat, and we could advise other people who use this option that it
> may leak space that a subsequent VACUUM may fail to recover, and
> therefore they too should consider REINDEX.

I was talking about the intuition behind the design. I did not intend
to suggest that nbtree should ignore "INDEX_CLEANUP = off" regardless
of the consequences.

I am sure about this much: The design embodied by Masahiko's patch is
clearly a better one overall, even if it doesn't fix the problem on
its own. I agree that we cannot allow nbtree to ignore "INDEX_CLEANUP
= off", even if that means leaking pages that could otherwise be
recycled. I'm not sure what we should do about any of this in the back
branches, though. I wish I had a simple idea about what to do there.

--
Peter Geoghegan




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2020-06-25 Thread Tomas Vondra

On Thu, Jun 25, 2020 at 03:09:44PM -0700, Melanie Plageman wrote:

On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra 
wrote:


I started looking at the patch to refresh my knowledge both of this
patch and parallel hash join, but I think it needs a rebase. The
changes in 7897e3bb90 apparently touched some of the code.



Thanks so much for the review, Tomas!

I've attached a rebased patch which also contains updates discussed
below.



Thanks.




I assume
you're working on a patch addressing the remaining TODOS, right?



I wanted to get some feedback on the patch before working through the
TODOs to make sure I was on the right track.

Now that you are reviewing this, I will focus all my attention
on addressing your feedback. If there are any TODOs that you feel are
most important, let me know, so I can start with those.

Otherwise, I will prioritize parallel batch 0 spilling.


Feel free to work on the batch 0 spilling, please. I still need to get
familiar with various parts of the parallel hash join etc. so I don't
have any immediate feedback which TODOs to work on first.


David Kimura plans to do a bit of work on on parallel hash join batch 0
spilling tomorrow. Whatever is left after that, I will pick up next
week. Parallel hash join batch 0 spilling is the last large TODO that I
had.

My plan was to then focus on the feedback (either about which TODOs are
most important or outside of the TODOs I've identified) I get from you
and anyone else who reviews this.



OK.



I see you've switched to "stripe" naming - I find that a bit confusing,
because when I hear stripe I think about RAID, where it means pieces of
data interleaved and stored on different devices. But maybe that's just
me and it's a good name. Maybe it'd be better to keep the naming and
only tweak it at the end, not to disrupt reviews unnecessarily.



I hear you about "stripe". I still quite like it, especially as compared
to its predecessor (originally, I called them chunks -- which is
impossible given that SharedTuplestoreChunks are a thing).



I don't think using chunks in one place means we can't use it elsewhere
in a different context. I'm sure we have "chunks" in other places. But
let's not bikeshed on this too much.


For ease of review, as you mentioned, I will keep the name for now. I am
open to changing it later, though.

I've been soliciting ideas for alternatives and, so far, folks have
suggested "stride", "step", "flock", "herd", "cohort", and "school". I'm
still on team "stripe" though, as it stands.



;-)





nodeHash.c
--


1) MultiExecPrivateHash says this

   /*
* Not subject to skew optimization, so either insert normally
* or save to batch file if it belongs to another stripe
*/

I wonder what it means to "belong to another stripe". I understand what
that means for batches, which are identified by batchno computed from
the hash value. But I thought "stripes" are just work_mem-sized pieces
of a batch, so I don't quite understand this. Especially when the code
does not actually check "which stripe" the row belongs to.



I agree this was confusing.

"belongs to another stripe" meant here that if batch 0 falls back and we
are still loading it, once we've filled up work_mem, we need to start
saving those tuples to a spill file for batch 0. I've changed the
comment to this:

-* or save to batch file if it belongs to another stripe
+   * or save to batch file if batch 0 falls back and we have
+   * already filled the hashtable up to space_allowed.



OK. Silly question - what does "batch 0 falls back" mean? Does it mean
that we realized the hash table for batch 0 would not fit into work_mem,
so we switched to the "hashloop" strategy?




2) I find the fields hashloop_fallback rather confusing. We have one in
HashJoinTable (and it's array of BufFile items) and another one in
ParallelHashJoinBatch (this time just bool).

I think HashJoinTable should be renamed to hashloopBatchFile (similarly
to the other BufFile arrays).



I think you are right about the name. I've changed the name in
HashJoinTableData to hashloopBatchFile.

The array of BufFiles hashloop_fallback was only used by serial
hashjoin. The boolean hashloop_fallback variable is used only by
parallel hashjoin.

The reason I had them named the same thing is that I thought it would be
nice to have a variable with the same name to indicate if a batch "fell
back" for both parallel and serial hashjoin--especially since we check
it in the main hashjoin state machine used by parallel and serial
hashjoin.

In serial hashjoin, the BufFiles aren't identified by name, so I kept
them in that array. In parallel hashjoin, each ParallelHashJoinBatch has
the status saved (in the struct).
So, both represented the fall back status of a batch.

However, I agree with you, so I've renamed the serial one to
hashloopBatchFile.



OK



Although I'm not sure why we even need
this file, when we have innerBatchFile? BufFile(s) are not exactly free,
in fact 

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-06-25 Thread Peter Geoghegan
On Thu, Jun 25, 2020 at 6:59 AM Masahiko Sawada
 wrote:
> I think that with the approach implemented in my patch, it could be a
> problem for the user that the user cannot easily know in advance
> whether vacuum with INDEX_CLEANUP false will perform index cleanup,
> even if page deletion doesn’t happen in most cases.

I was unclear. I agree that the VACUUM command with "INDEX_CLEANUP =
off" is an emergency mechanism that should be fully respected, even
when that means that we'll leak deleted pages.

Perhaps it would make sense to behave differently when the index is on
a table that has "vacuum_index_cleanup = off" set, and the vacuum is
started by autovacuum, and is not an anti-wraparound vacuum. That
doesn't seem all that appealing now that I write it down, though,
because it's a non-obvious behavioral difference among cases that
users probably expect to behave similarly. On the other hand, what
user knows that there is something called an aggressive vacuum, which
isn't exactly the same thing as an anti-wraparound vacuum?

I find it hard to decide what the least-worst thing is for the
backbranches. What do you think?

> I don’t come up with a good solution to keep us safe against XID
> wraparound yet but it seems to me that it’s better to have an option
> that forces index cleanup not to happen.

I don't think that there is a good solution that is suitable for
backpatching. The real solution is to redesign the recycling along the
lines I described.

I don't think that it's terrible that we can leak deleted pages,
especially considering the way that users are expected to use the
INDEX_CLEANUP feature. I would like to be sure that the problem is
well understood, though -- we should at least have a plan for Postgres
v14.

> I thought that btbulkdelete and/or btvacuumcleanup can register an
> autovacuum work item to recycle the page that gets deleted but it
> might not able to recycle those pages enough because the autovacuum
> work items could be taken just after vacuum. And if page deletion is
> relatively a rare case in practice, we might be able to take an
> optimistic approach that vacuum registers deleted pages to FSM on the
> deletion and a process who takes a free page checks if the page is
> really recyclable. Anyway, I’ll try to think more about this.

Right -- just putting the pages in the FSM immediately, and making it
a problem that we deal with within _bt_getbuf() is an alternative
approach that might be better.

--
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Tomas Vondra

On Thu, Jun 25, 2020 at 01:17:56PM -0400, Bruce Momjian wrote:

On Thu, Jun 25, 2020 at 11:46:54AM -0400, Robert Haas wrote:

On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian  wrote:
> I think my main point is that work_mem was not being honored for
> hash-agg before, but now that PG 13 can do it, we are again allowing
> work_mem not to apply in certain cases.  I am wondering if our hard
> limit for work_mem is the issue, and we should make that more flexible
> for all uses.

I mean, that's pretty much what we're talking about here, isn't it? It
seems like in your previous two replies you were opposed to separating
the plan-type limit from the execution-time limit, but that idea is
precisely a way of being more flexible (and extending it to other plan
nodes is a way of making it more flexible for more use cases).


I think it is was Tom who was complaining about plan vs. execution time
control.


As I think you know, if you have a system where the workload varies a
lot, you may sometimes be using 0 copies of work_mem and at other
times 1000 or more copies, so the value has to be chosen
conservatively as a percentage of system memory, else you start
swapping or the OOM killer gets involved. On the other hand, some plan
nodes get a lot less efficient when the amount of memory available
falls below some threshold, so you can't just set this to a tiny value
and forget about it. Because the first problem is so bad, most people
set the value relatively conservatively and just live with the
performance consequences. But this also means that they have memory
left over most of the time, so the idea of letting a node burst above
its work_mem allocation when something unexpected happens isn't crazy:
as long as only a few nodes do that here and there, rather than, say,
all the nodes doing it all at the same time, it's actually fine. If we
had a smarter system that could dole out more work_mem to nodes that
would really benefit from it and less to nodes where it isn't likely
to make much difference, that would be similar in spirit but even
better.


I think the issue is that in PG 13 work_mem controls sorts and hashes
with a new hard limit for hash aggregation:


https://www.postgresql.org/docs/12/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY

Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash
tables are used in hash joins, hash-based aggregation, and hash-based
processing of IN subqueries.

In pre-PG 13, we "controlled" it by avoiding hash-based aggregation if
was expected it to exceed work_mem, but if we assumed it would be less
than work_mem and it was more, we exceeded work_mem allocation for that
node.  In PG 13, we "limit" memory to work_mem and spill to disk if we
exceed it.

We should really have always documented that hash agg could exceed
work_mem for misestimation, and if we add a hash_agg work_mem
misestimation bypass setting we should document this setting in work_mem
as well.



I don't think that would change anything, really. For the users the
consequences would be still exactly the same, and they wouldn't even be
in position to check if they are affected.

So just documenting that hashagg does not respect work_mem at runtime
would be nice, but it would not make any difference for v13, just like
documenting a bug is not really the same thing as fixing it.


But then the question is why do we allow this bypass only for hash agg?
Should work_mem have a settings for ORDER BY, merge join, hash join, and
hash agg, e.g.:

work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB'

Yeah, crazy syntax, but you get the idea.  I understand some nodes are
more sensitive to disk spill than others, so shouldn't we be controlling
this at the work_mem level, rather than for a specific node type like
hash agg?  We could allow for misestimation over allocation of hash agg
work_mem by splitting up the hash agg values:

work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB 
hash_agg_max=200MB'

but _avoiding_ hash agg if it is estimated to exceed work mem and spill
to disk is not something to logically control at the work mem level,
which leads so something like David Rowley suggested, but with different
names:

enable_hashagg = on | soft | avoid | off

where 'on' and 'off' are the current PG 13 behavior, 'soft' means to
treat work_mem as a soft limit and allow it to exceed work mem for
misestimation, and 'avoid' means to avoid hash agg if it is estimated to
exceed work mem.  Both 'soft' and 'avoid' don't spill to disk.

David's original terms of "trynospill" and "neverspill" were focused on
spilling, not on its interaction with work_mem, and I found that
confusing.

Frankly, if it took me this long to get my head around this, I am
unclear how many people will understand this tuning feature enough to
actually use it.



Yeah. I agree with Andres we this may be a real issue, and that adding
some sort of "escape hatch" for 

Re: Enabling B-Tree deduplication by default

2020-06-25 Thread Peter Geoghegan
On Thu, Jan 30, 2020 at 11:40 AM Peter Geoghegan  wrote:
> I think that I should commit the patch without the GUC tentatively.
> Just have the storage parameter, so that everyone gets the
> optimization without asking for it. We can then review the decision to
> enable deduplication generally after the feature has been in the tree
> for several months.

This is how things work in the committed patch (commit 0d861bbb):
There is a B-Tree storage parameter named deduplicate_items, which is
enabled by default. In general, users will get deduplication unless
they opt out, including in unique indexes (though note that we're more
selective about triggering a deduplication patch in unique indexes
[1]).

> There is no need to make a final decision about whether or not the
> optimization gets enabled before committing the patch.

It's now time to make a final decision on this. Does anyone have any
reason to believe that leaving deduplication enabled by default is the
wrong way to go?

Note that using deduplication isn't strictly better than not using
deduplication for all indexes in all workloads; that's why it's
possible to disable the optimization. This thread has lots of
information about the reasons why enabling deduplication by default
seems appropriate, all of which still apply. Note that there have been
no bug reports involving deduplication since it was committed on
February 26th, with the exception of some minor issues that I reported
and fixed.

The view of the RMT is that the feature should remain enabled by
default (i.e. no changes are required). Of course, I am a member of
the RMT this year, as well as one of the authors of the patch. I am
hardly an impartial voice here. I believe that that did not sway the
decision making process of the RMT in this instance. If there are no
objections in the next week or so, then I'll close out the relevant
open item.

[1] 
https://www.postgresql.org/docs/devel/btree-implementation.html#BTREE-DEDUPLICATION
-- See "Tip"
-- 
Peter Geoghegan




Re: min_safe_lsn column in pg_replication_slots view

2020-06-25 Thread Alvaro Herrera
On 2020-Jun-26, Michael Paquier wrote:

> On Thu, Jun 25, 2020 at 11:24:27AM -0400, Alvaro Herrera wrote:
> > I don't understand the proposal.  Michael posted a patch that adds
> > pg_wal_oldest_lsn(), and you say we should apply the patch except the
> > part that adds that function -- so what part would be applying?
> 
> I have sent last week a patch about only the removal of min_safe_lsn:
> https://www.postgresql.org/message-id/20200619121552.gh453...@paquier.xyz
> So this applies to this part.

Well, I oppose that because it leaves us with no way to monitor slot
limits.  In his opening email, Masao-san proposed to simply change the
value by adding 1.  How you go from adding 1 to a column to removing
the column completely with no recourse, is beyond me.

Let me summarize the situation and possible ways forward as I see them.
If I'm mistaken, please correct me.

Problems:
i)  pg_replication_slot.min_safe_lsn has a weird definition in that all
replication slots show the same value
ii) min_safe_lsn cannot be used with pg_walfile_name, because it returns
the name of the previous segment.

Proposed solutions:

a) Do nothing -- keep the min_safe_lsn column as is.  Warn users that
   pg_walfile_name should not be used with this column.
b) Redefine min_safe_lsn to be lsn+1, so that pg_walfile_name can be used
   and return a useful value.
c) Remove min_safe_lsn; add functions that expose the same value
d) Remove min_safe_lsn; add a new view that exposes the same value and
   possibly others
e) Replace min_safe_lsn with a "distance" column, which reports
   restart_lsn - oldest valid LSN
   (Note that you no longer have an LSN in this scenario, so you can't
   call pg_walfile_name.)

The original patch implemented (e); it was changed to its current
definition because of this[1] comment.  My proposal is to put it back.

[1] https://postgr.es/m/20171106132050.6apzynxrqrzgh...@alap3.anarazel.de

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Tomas Vondra

On Thu, Jun 25, 2020 at 09:42:33AM -0700, Jeff Davis wrote:

On Wed, 2020-06-24 at 12:31 -0700, Andres Freund wrote:

nodeAgg.c already treats those separately:

void
hash_agg_set_limits(double hashentrysize, uint64 input_groups, int
used_bits,
Size *mem_limit, uint64
*ngroups_limit,
int *num_partitions)
{
int npartitions;
Sizepartition_mem;

/* if not expected to spill, use all of work_mem */
if (input_groups * hashentrysize < work_mem * 1024L)
{
if (num_partitions != NULL)
*num_partitions = 0;
*mem_limit = work_mem * 1024L;
*ngroups_limit = *mem_limit / hashentrysize;
return;
}


The reason this code exists is to decide how much of work_mem to set
aside for spilling (each spill partition needs an IO buffer).

The alternative would be to fix the number of partitions before
processing a batch, which didn't seem ideal. Or, we could just ignore
the memory required for IO buffers, like HashJoin.



I think the conclusion from the recent HashJoin discussions is that not
accounting for BufFiles is an issue, and we want to fix it. So repeating
that for HashAgg would be a mistake, IMHO.


Granted, this is an example where an underestimate can give an
advantage, but I don't think we want to extend the concept into other
areas.



I agree.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Tomas Vondra

On Thu, Jun 25, 2020 at 11:16:23AM -0700, Andres Freund wrote:

Hi,

On 2020-06-25 10:44:42 -0700, Jeff Davis wrote:

There are only two possible paths: HashAgg and Sort+Group, and we need
to pick one. If the planner expects one to spill, it is likely to
expect the other to spill. If one spills in the executor, then the
other is likely to spill, too. (I'm ignoring the case with a lot of
tuples and few groups because that doesn't seem relevant.)


There's also ordered index scan + Group. Which will often be vastly
better than Sort+Group, but still slower than HashAgg.



Imagine that there was only one path available to choose. Would you
suggest the same thing, that unexpected spills can exceed work_mem but
expected spills can't?


I'm not saying what I propose is perfect, but I've yet to hear a better
proposal. Given that there *are* different ways to implement
aggregation, and that we use expected costs to choose, I think the
assumed costs are relevant.



I share Jeff's opinion that this is quite counter-intuitive and we'll
have a hard time explaining it to users.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Tomas Vondra

On Thu, Jun 25, 2020 at 02:28:02PM -0700, Jeff Davis wrote:

On Thu, 2020-06-25 at 15:56 -0400, Bruce Momjian wrote:

It is my understanding that spill of sorts is mostly read
sequentially,
while hash reads are random.  Is that right?  Is that not being
costed
properly?


I don't think there's a major problem with the cost model, but it could
probably use some tweaking.

Hash writes are random. The hash reads should be mostly sequential (for
large partitions it will be 128-block extents, or 1MB). The cost model
assumes 50% sequential and 50% random.



The important bit here is that while the logical writes are random,
those are effectively combined in page cache and the physical writes are
pretty sequential. So I think the cost model is fairly reasonable.

Note: Judging by iosnoop stats shared in the thread linked by Jeff.


Sorts are written sequentially and read randomly, but there's
prefetching to keep the reads from being too random. The cost model
assumes 75% sequential and 25% random.

Overall, the IO pattern is better for Sort, but not dramatically so.
Tomas Vondra did some nice analysis here:


https://www.postgresql.org/message-id/20200525021045.dilgcsmgiu4l5jpa@development

That resulted in getting the prealloc and projection patches in.

Regards,
Jeff Davis




regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Alvaro Herrera
On 2020-Jun-25, Andres Freund wrote:

> >My point here is that maybe we don't need to offer a GUC to explicitly
> >turn spilling off; it seems sufficient to let users change work_mem so
> >that spilling will naturally not occur.  Why do we need more?
> 
> That's not really a useful escape hatch, because I'll often lead to
> other nodes using more memory.

Ah -- other nodes in the same query -- you're right, that's not good.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: should libpq also require TLSv1.2 by default?

2020-06-25 Thread Bruce Momjian
On Thu, Jun 25, 2020 at 06:44:05PM -0400, Tom Lane wrote:
> BTW, the server-side report of the problem looks like
> 
> LOG:  could not accept SSL connection: wrong version number

I like that one.  ;-)

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: min_safe_lsn column in pg_replication_slots view

2020-06-25 Thread Michael Paquier
On Thu, Jun 25, 2020 at 11:24:27AM -0400, Alvaro Herrera wrote:
> I don't understand the proposal.  Michael posted a patch that adds
> pg_wal_oldest_lsn(), and you say we should apply the patch except the
> part that adds that function -- so what part would be applying?

I have sent last week a patch about only the removal of min_safe_lsn:
https://www.postgresql.org/message-id/20200619121552.gh453...@paquier.xyz
So this applies to this part.

> If the proposal is to apply just the hunk in pg_get_replication_slots
> that removes min_safe_lsn, and do nothing else in pg13, then I don't like
> it.  The feature exposes a way to monitor slots w.r.t. the maximum slot
> size; I'm okay if you prefer to express that in a different way, but I
> don't like the idea of shipping pg13 without any way to monitor it.

From what I can see, it seems to me that we have a lot of views of how
to tackle the matter.  That gives an idea that we are not really happy
with the current state of things, and usually a sign that we may want
to redesign it, going back to this issue for v14.

My 2c.
--
Michael


signature.asc
Description: PGP signature


Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Andres Freund
Hi, 

On June 25, 2020 3:44:22 PM PDT, Alvaro Herrera  
wrote:
>On 2020-Jun-25, Andres Freund wrote:
>
>> > What are people doing for those cases already?  Do we have an
>> > real-world queries that are a problem in PG 13 for this?
>> 
>> I don't know about real world, but it's pretty easy to come up with
>> examples.
>> 
>> query:
>> SELECT a, array_agg(b) FROM (SELECT generate_series(1, 1)) a(a),
>(SELECT generate_series(1, 1)) b(b) GROUP BY a HAVING
>array_length(array_agg(b), 1) = 0;
>> 
>> work_mem = 4MB
>> 
>> 12  18470.012 ms
>> HEAD44635.210 ms
>> 
>> HEAD causes ~2.8GB of file IO, 12 doesn't cause any. If you're IO
>> bandwidth constrained, this could be quite bad.
>
>... however, you can pretty much get the previous performance back by
>increasing work_mem.  I just tried your example here, and I get 32
>seconds of runtime for work_mem 4MB, and 13.5 seconds for work_mem 1GB
>(this one spills about 800 MB); if I increase that again to 1.7GB I get
>no spilling and 9 seconds of runtime.  (For comparison, 12 takes 15.7
>seconds regardless of work_mem).
>
>My point here is that maybe we don't need to offer a GUC to explicitly
>turn spilling off; it seems sufficient to let users change work_mem so
>that spilling will naturally not occur.  Why do we need more?

That's not really a useful escape hatch, because I'll often lead to other nodes 
using more memory.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Alvaro Herrera
On 2020-Jun-25, Andres Freund wrote:

> > What are people doing for those cases already?  Do we have an
> > real-world queries that are a problem in PG 13 for this?
> 
> I don't know about real world, but it's pretty easy to come up with
> examples.
> 
> query:
> SELECT a, array_agg(b) FROM (SELECT generate_series(1, 1)) a(a), (SELECT 
> generate_series(1, 1)) b(b) GROUP BY a HAVING array_length(array_agg(b), 
> 1) = 0;
> 
> work_mem = 4MB
> 
> 12  18470.012 ms
> HEAD44635.210 ms
> 
> HEAD causes ~2.8GB of file IO, 12 doesn't cause any. If you're IO
> bandwidth constrained, this could be quite bad.

... however, you can pretty much get the previous performance back by
increasing work_mem.  I just tried your example here, and I get 32
seconds of runtime for work_mem 4MB, and 13.5 seconds for work_mem 1GB
(this one spills about 800 MB); if I increase that again to 1.7GB I get
no spilling and 9 seconds of runtime.  (For comparison, 12 takes 15.7
seconds regardless of work_mem).

My point here is that maybe we don't need to offer a GUC to explicitly
turn spilling off; it seems sufficient to let users change work_mem so
that spilling will naturally not occur.  Why do we need more?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: should libpq also require TLSv1.2 by default?

2020-06-25 Thread Tom Lane
Daniel Gustafsson  writes:
>> On 24 Jun 2020, at 10:46, Magnus Hagander  wrote:
>> It might also be worth noting that it's not really "any protocol version", 
>> it means it will be "whatever the openssl configuration says", I think? For 
>> example, debian buster sets:
>> 
>> [system_default_sect]
>> MinProtocol = TLSv1.2
>> 
>> Which I believe means that if your libpq app is running on debian buster, it 
>> will be min v1.2 already

> Correct, that being said I'm not sure how common it is for distributions to 
> set
> a default protocol version.  The macOS versions I have handy doesn't enforce a
> default version, nor does Ubuntu 20.04, FreeBSD 12 or OpenBSD 6.5 AFAICT.

Yeah, this.  I experimented with connecting current libpq to a 9.2-vintage
server I'd built with openssl 0.9.8x, and was surprised to find I couldn't
do so unless I explicitly set "ssl_min_protocol_version=tlsv1".  After
some digging I verified that that's because RHEL8's openssl.cnf sets

MinProtocol = TLSv1.2
MaxProtocol = TLSv1.3

Interestingly, Fedora 32 is laxer:

MinProtocol = TLSv1
MaxProtocol = TLSv1.3

I concur with Daniel's finding that current macOS and FreeBSD don't
enforce anything in this area.  Nonetheless, for a pretty significant
fraction of the world, our claim that the default is to not enforce
any minimum protocol version is a lie.

My feeling now is that we'd be better off defaulting
ssl_min_protocol_version to something nonempty, just to make this
behavior platform-independent.  We certainly can't leave the docs
as they are.

Also, I confirm that the failure looks like

$ psql -h ... -d "dbname=postgres sslmode=require"
psql: error: could not connect to server: SSL error: unsupported protocol

While that's not *that* awful, if you realize that "protocol" means
TLS version, many people probably won't without a hint.  It does not
help any that the message doesn't mention either the offered TLS version
or the version limits being enforced.  I'm not sure we can do anything
about the former, but reducing the number of variables affecting the
latter seems like a smart idea.

BTW, the server-side report of the problem looks like

LOG:  could not accept SSL connection: wrong version number

regards, tom lane




Re: create database with template doesn't copy database ACL

2020-06-25 Thread Bruce Momjian
On Tue, Jun 16, 2020 at 06:10:54AM -0400, Bruce Momjian wrote:
> On Mon, Jun 15, 2020 at 10:10:32AM -0400, Bruce Momjian wrote:
> > On Mon, Jun 15, 2020 at 12:14:55AM -0400, Tom Lane wrote:
> > > Bruce Momjian  writes:
> > > > Well, I thought we copied everything except things tha can be specified
> > > > as different in CREATE DATABASE, though I can see why we would not copy
> > > > them.  Should we document this or issue a notice about not copying
> > > > non-default database attributes?
> > > 
> > > We do not need a notice for behavior that (a) has stood for twenty years
> > > or so, and (b) is considerably less broken than any alternative would be.
> > > If you feel the docs need improvement, have at that.
> > 
> > Well, I realize it has been this way for a long time, and that no one
> > else has complained, but there should be a way for people to know what
> > is being copied from the template and what is not.  Do we have a clear
> > description of what is copied and skipped?
> 
> We already mentioned that ALTER DATABASE settings are not copied, so the
> attached patch adds a mention that GRANT-level permissions are not
> copied either.

Patch applied to all supported versions.  Thanks for the discussion.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Jeff Davis
On Thu, 2020-06-25 at 15:56 -0400, Bruce Momjian wrote:
> It is my understanding that spill of sorts is mostly read
> sequentially,
> while hash reads are random.  Is that right?  Is that not being
> costed
> properly?

I don't think there's a major problem with the cost model, but it could
probably use some tweaking.

Hash writes are random. The hash reads should be mostly sequential (for
large partitions it will be 128-block extents, or 1MB). The cost model
assumes 50% sequential and 50% random.

Sorts are written sequentially and read randomly, but there's
prefetching to keep the reads from being too random. The cost model
assumes 75% sequential and 25% random.

Overall, the IO pattern is better for Sort, but not dramatically so.
Tomas Vondra did some nice analysis here:


https://www.postgresql.org/message-id/20200525021045.dilgcsmgiu4l5jpa@development

That resulted in getting the prealloc and projection patches in.

Regards,
Jeff Davis






Re: CUBE_MAX_DIM

2020-06-25 Thread Alastair McKinley
> From: Tom Lane 
> Sent: 25 June 2020 17:43
>  
> Alastair McKinley  writes:
> > I know that Cube in it's current form isn't suitable for nearest-neighbour 
> > searching these vectors in their raw form (I have tried recompilation with 
> > higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes 
> > can be useful for these applications.  There are other pre-processing 
> > techniques that can be used to improved the speed of the search, but it 
> > still ends up with a kNN search in a high-ish dimensional space.
> 
> Is there a way to fix the numerical instability involved?  If we could do
> that, then we'd definitely have a use-case justifying the work to make
> cube toastable.

I am not that familiar with the nature of the numerical instability, but it 
might be worth noting for additional context that for the NN use case:

- The value of each dimension is likely to be between 0 and 1 
- The L1 distance is meaningful for high numbers of dimensions, which 
*possibly* suffers less from the numeric issues than euclidean distance.

The numerical stability isn't the only issue for high dimensional kNN, the GiST 
search performance currently degrades with increasing N towards sequential scan 
performance, although maybe they are related?

> regards, tom lane

Best regards, 
Alastair



Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Andres Freund
Hi,

On 2020-06-25 14:25:12 -0400, Bruce Momjian wrote:
> I am still trying to get my head around why the spill is going to be so
> much work to adjust for hash agg than our other spillable nodes.

Aggregates are the classical case used to process large amounts of
data. For larger amounts of data sorted input (be it via explicit sort
or ordered index scan) isn't an attractive option. IOW hash-agg is the
common case.  There's also fewer stats for halfway accurately estimating
the number of groups and the size of the transition state - a sort /
hash join doesn't have an equivalent to the variably sized transition
value.


> What are people doing for those cases already?  Do we have an
> real-world queries that are a problem in PG 13 for this?

I don't know about real world, but it's pretty easy to come up with
examples.

query:
SELECT a, array_agg(b) FROM (SELECT generate_series(1, 1)) a(a), (SELECT 
generate_series(1, 1)) b(b) GROUP BY a HAVING array_length(array_agg(b), 1) 
= 0;

work_mem = 4MB

12  18470.012 ms
HEAD44635.210 ms

HEAD causes ~2.8GB of file IO, 12 doesn't cause any. If you're IO
bandwidth constrained, this could be quite bad.

Obviously this is contrived, and a pretty extreme case. But if you
imagine this happening on a system where disk IO isn't super fast
(e.g. just about any cloud provider).

An even more extreme version of the above is this:


query: SELECT a, array_agg(b) FROM (SELECT generate_series(1, 5)) a(a), 
(SELECT generate_series(1, 1)) b(b) GROUP BY a HAVING 
array_length(array_agg(b), 1) = 0;

work_mem = 16MB
12  81598.965 ms
HEAD210772.360 ms

temporary tablespace on magnetic disk (raid 0 of two 7.2k server
spinning disks)

12  81136.530 ms
HEAD   225182.560 ms

The disks are busy in some periods, but still keep up. If I however make
the transition state a bit bigger:

query: SELECT a, array_agg(b), count(c), max(d),max(e) FROM (SELECT 
generate_series(1, 1)) a(a), (SELECT generate_series(1, 5000)::text, 
repeat(random()::text, 10), repeat(random()::text, 10), repeat(random()::text, 
10)) b(b,c,d,e) GROUP BY a HAVING array_length(array_agg(b), 1) = 0;

12  28164.865 ms

fast ssd:
HEAD92520.680 ms

magnetic:
HEAD183968.538 ms

(no reads, there's plenty enough memory. Just writes because the age /
amount thresholds for dirty data are reached)

In the magnetic case we're IO bottlenecked nearly the whole time.


Just to be clear: I think this is a completely over-the-top example. But
I do think it shows the problem to some degree at least.

Greetings,

Andres Freund




Re: should libpq also require TLSv1.2 by default?

2020-06-25 Thread Tom Lane
Robert Haas  writes:
> I wonder how much of a problem this really is.

Yeah.  I find Robert's points about that pretty persuasive: by now
needing to connect to a server without TLSv1.2 support, *and* needing to
do so with SSL on, ought to be a tiny niche use case (much smaller than
the number of people who would like a more secure default).  If we can
make the error message about this be reasonably clear then I don't have
an objection to changing libpq's default.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Bruce Momjian
On Thu, Jun 25, 2020 at 03:24:42PM -0400, Bruce Momjian wrote:
> On Thu, Jun 25, 2020 at 11:10:58AM -0700, Jeff Davis wrote:
> > On Wed, 2020-06-24 at 13:29 -0400, Tom Lane wrote:
> > > If we feel we need something to let people have the v12 behavior
> > > back, let's have
> > > (1) enable_hashagg on/off --- controls planner, same as it ever was
> > > (2) enable_hashagg_spill on/off --- controls executor by disabling
> > > spill
> > > 
> > > But I'm not really convinced that we need (2).
> > 
> > If we're not going to have a planner GUC, one alternative is to just
> > penalize the disk costs of HashAgg for a release or two. It would only
> > affect the cost of HashAgg paths that are expected to spill, which
> > weren't even generated in previous releases.
> > 
> > In other words, multiply the disk costs by enough that the planner will
> > usually not choose HashAgg if expected to spill unless the average
> > group size is quite large (i.e. there are a lot more tuples than
> > groups, but still enough groups to spill).
> 
> Well, the big question is whether this costing is actually more accurate
> than what we have now.  What I am hearing is that spilling hash agg is
> expensive, so whatever we can do to reflect the actual costs seems like
> a win.  If it can be done, it certainly seems better than a cost setting
> few people will use.

It is my understanding that spill of sorts is mostly read sequentially,
while hash reads are random.  Is that right?  Is that not being costed
properly?

That doesn't fix the misestimation case, but increasing work mem does
allow pre-PG 13 behavior there.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Open Item: Should non-text EXPLAIN always show properties?

2020-06-25 Thread Tom Lane
James Coleman  writes:
> On Thu, Jun 25, 2020 at 12:33 PM Tom Lane  wrote:
>> I think the right way to think about this is that we are building
>> an output data structure according to a schema that should be fixed
>> for any particular plan shape.  If event X happened zero times in
>> a given execution, but it could have happened in a different execution
>> of the same plan, then we should print X with a zero count.  If X
>> could not happen period in this plan, we should omit X's entry.

> Do we print zeroes for memory usage when all sorts ended up spilling
> to disk then?

I did not claim that the pre-existing code adheres to this model
completely faithfully ;-).  But we ought to have a clear mental
picture of what it is we're trying to achieve.  If you don't like
the above design, propose a different one.

regards, tom lane




Re: should libpq also require TLSv1.2 by default?

2020-06-25 Thread Robert Haas
On Wed, Jun 24, 2020 at 9:50 PM Michael Paquier  wrote:
> Yeah, and I would not be surprised to see cases where people complain
> to us about that when attempting to reach one of their old boxes,
> breaking some stuff they have been relying on for years by forcing the
> addition of a tls_min_server_protocol in the connection string.  It is
> a more important step that we enforce TLSv1.2 on the server side
> actually, and libpq just follows up automatically with that.

I wonder how much of a problem this really is. A few quick Google
searches suggests that support for TLSv1.2 was added to OpenSSL in
v1.0.1, which was released in March 2012. If packagers adopted that
version for the following PostgreSQL release, they would have had
TLSv1.2 support from PostgreSQL 9.2 onward. Some people may have taken
longer to adopt it, but even if they waited a year or two, all
versions that they built with older OpenSSL versions would now be out
of support. It doesn't seem that likely that there are going to be
that many people using pg_dump to upgrade directly from PostgreSQL 9.2
-- or even 9.4 -- to PostgreSQL 13. Skipping six or eight major
versions in a single upgrade is a little unusual, in my experience.
And even if someone does want to do that, we haven't broken it; it'll
still work fine if they are connecting through a UNIX socket, and if
not, they can disable SSL or just specify that they're OK with an
older protocol version. That doesn't seem like a big deal, especially
if we can adopt Tom's suggestion of giving them a warning about what
went wrong.

Let's also consider the other side of this argument, which is that a
decent number of PostgreSQL users are operating in environments where
they are required for regulatory compliance to prohibit the use of
TLSv1.0 and TLSv1.1. Those people will be happy if that is the default
on both the client and the server side by default. They will probably
be somewhat happy anyway, because now we have an option for it, which
we didn't before. But they'll be more happy if it's the default. Now,
we can't please everybody here. Is it more important to please people
who would like insecure TLS versions disabled by default, or to please
people who want to use insecure TLS versions to back up old servers?
Seems debatable. Based on my own experience, I'd guess there are more
users who want to avoid insecure TLS versions than there are users who
want to use them to back up very old servers, so I'd tentatively favor
changing the default. However, I don't know whether my experiences are
representative.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Open Item: Should non-text EXPLAIN always show properties?

2020-06-25 Thread James Coleman
On Thu, Jun 25, 2020 at 12:33 PM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Thu, Jun 25, 2020 at 8:42 AM James Coleman  wrote:
> >> Yesterday I'd replied [1] to Justin's proposal for this WRT
> >> incremental sort and expressed my opinion that including both
> >> unnecessarily (i.e., including disk when an in-memory sort was used)
> >> is undesirable and confusing and leads to shortcuts I believe to be
> >> bad habits when using the data programmatically.
>
> > +1.
>
> I think the policy about non-text output formats is "all applicable
> fields should be included automatically".  But the key word there is
> "applicable".  Are disk-sort numbers applicable when no disk sort
> happened?
>
> I think the right way to think about this is that we are building
> an output data structure according to a schema that should be fixed
> for any particular plan shape.  If event X happened zero times in
> a given execution, but it could have happened in a different execution
> of the same plan, then we should print X with a zero count.  If X
> could not happen period in this plan, we should omit X's entry.
>
> So the real question here is whether the disk vs memory decision is
> plan time vs run time.  AFAIK it's run time, which leads me to think
> we ought to print the zeroes.

Do we print zeroes for memory usage when all sorts ended up spilling
to disk then? That might be the current behavior; I'd have to check.
Because that's a lie, but we don't have any better information
currently (which is unfortunate, but hardly in scope for fixing here.)

James




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Bruce Momjian
On Thu, Jun 25, 2020 at 11:10:58AM -0700, Jeff Davis wrote:
> On Wed, 2020-06-24 at 13:29 -0400, Tom Lane wrote:
> > If we feel we need something to let people have the v12 behavior
> > back, let's have
> > (1) enable_hashagg on/off --- controls planner, same as it ever was
> > (2) enable_hashagg_spill on/off --- controls executor by disabling
> > spill
> > 
> > But I'm not really convinced that we need (2).
> 
> If we're not going to have a planner GUC, one alternative is to just
> penalize the disk costs of HashAgg for a release or two. It would only
> affect the cost of HashAgg paths that are expected to spill, which
> weren't even generated in previous releases.
> 
> In other words, multiply the disk costs by enough that the planner will
> usually not choose HashAgg if expected to spill unless the average
> group size is quite large (i.e. there are a lot more tuples than
> groups, but still enough groups to spill).

Well, the big question is whether this costing is actually more accurate
than what we have now.  What I am hearing is that spilling hash agg is
expensive, so whatever we can do to reflect the actual costs seems like
a win.  If it can be done, it certainly seems better than a cost setting
few people will use.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Failures with installcheck and low work_mem value in 13~

2020-06-25 Thread Peter Geoghegan
On Fri, Jun 19, 2020 at 7:48 PM Michael Paquier  wrote:
> We cared about such plan stability that in the past FWIW, see for
> example c588df9 as work_mem is a setting that people like to change.
> Why should this be different?  work_mem is a popular configuration
> setting.

The RMT met today. We determined that it wasn't worth adjusting this
test to pass with non-standard work_mem values.

"make installcheck" also fails with lower random_page_cost settings.
There doesn't seem to be any reason to permit a non-standard setting
to cause installcheck to fail elsewhere, while not tolerating the same
issue here, with work_mem.

Thanks
--
Peter Geoghegan




Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-25 Thread Robert Haas
On Thu, Jun 25, 2020 at 12:07 PM Tom Lane  wrote:
> Yeah, the multi-insert case is a plausible reason that hadn't been
> mentioned before.  On the other hand, you can already do that pretty
> painlessly:

Sure, but it means if you're writing code to generate queries
programmatically, then you have to handle the 0-column case completely
differently from all the others. Seems like unnecessary pain for no
real reason.

I mean, I generally agree that if the standard says that syntax X
means Y, we should either make X mean Y, or not support X. But if the
standard says that X has no meaning at all, I don't think it's a
problem for us to make it mean something logical. If we thought
otherwise, we'd have to rip out support for indexes, which would
probably not be a winning move. Now, various people, including you and
I, have made the point that it's bad to give a meaning to some piece
of syntax that is not current part of the standard but might become
part of the standard in the future, because then we might end up with
the standard saying that X means one thing and PostgreSQL thinking
that it means something else. However, that can quickly turn into an
argument against doing anything that we happen not to like, even if
the reason we don't like it has more to do with needing a Snickers bar
than any underlying engineering reality. In a case like this, it's
hard to imagine that () can reasonably mean anything other than a
0-column tuple. It's not impossible that someone could invent another
interpretation, and there's been much discussion on this list about
how the SQL standards committee is more likely than you'd think to
come up with unusual ideas, but I still don't think it's a bad gamble,
especially given the MySQL/MariaDB precedent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Weird failures on lorikeet

2020-06-25 Thread Andrew Dunstan


On 6/25/20 12:52 PM, Alvaro Herrera wrote:
> Is there something going on with lorikeet again?  I see this:
>
> 2020-06-25 01:55:13.380 EDT [5ef43c40.21e0:85] pg_regress/typed_table LOG:  
> statement: SELECT c.oid::pg_catalog.regclass
>   FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i
>   WHERE c.oid = i.inhparent AND i.inhrelid = '21420'
> AND c.relkind != 'p' AND c.relkind != 'I'
>   ORDER BY inhseqno;
> *** starting debugger for pid 4456, tid 2644
> 2020-06-25 01:55:13.393 EDT [5ef43c40.21e0:86] pg_regress/typed_table LOG:  
> statement: SELECT c.oid::pg_catalog.regclass, c.relkind, 
> pg_catalog.pg_get_expr(c.relpartbound, c.oid)
>   FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i
>   WHERE c.oid = i.inhrelid AND i.inhparent = '21420'
>   ORDER BY pg_catalog.pg_get_expr(c.relpartbound, c.oid) = 'DEFAULT', 
> c.oid::pg_catalog.regclass::pg_catalog.text;
>   1 [main] postgres 4456 try_to_debug: Failed to start debugger, Win32 
> error 2
> 2020-06-25 01:55:13.455 EDT [5ef43c40.21e0:87] pg_regress/typed_table LOG:  
> statement: CREATE TABLE persons3 OF person_type (
>   PRIMARY KEY (id),
>   name NOT NULL DEFAULT ''
>   );
> *** continuing pid 4456 from debugger call (0)
> *** starting debugger for pid 4456, tid 2644
>   48849 [main] postgres 4456 try_to_debug: Failed to start debugger, Win32 
> error 2
> *** continuing pid 4456 from debugger call (0)
> 2020-06-25 01:55:18.181 EDT [5ef43bd6.2824:4] LOG:  server process (PID 4456) 
> was terminated by signal 11: Segmentation fault
> 2020-06-25 01:55:18.181 EDT [5ef43bd6.2824:5] DETAIL:  Failed process was 
> running: drop table some_tab cascade;
>


I don't know what that's about. I'll reboot the machine presently and
see if it persists.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Bruce Momjian
On Thu, Jun 25, 2020 at 11:02:30AM -0700, Jeff Davis wrote:
> If we have the GUCs there, then at least if someone comes to the
> mailing list with a problem, we can offer them a temporary solution,
> and have time to try to avoid the problem in a future release (tweaking
> estimates, cost model, defaults, etc.).
> 
> One idea is to have undocumented GUCs. That way we don't have to
> support them forever, and we are more likely to hear problem reports.

Uh, our track record of adding GUCs just in case is not good, and
removing them is even harder.  Undocumented sounds interesting but then
how do we even document when we remove it?  I don't think we want to go
there.  Oracle has done that, and I don't think the user experience is
good.

Maybe we should just continue though beta, add an incompatibility item
to the PG 13 release notes, and see what feedback we get.  We know
increasing work_mem gets us the exceed work_mem behavior, but that
affects other nodes too, and I can't think of a way to avoid if spill is
predicted except to disable hash agg for that query.

I am still trying to get my head around why the spill is going to be so
much work to adjust for hash agg than our other spillable nodes.  What
are people doing for those cases already?  Do we have an real-world
queries that are a problem in PG 13 for this?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Andres Freund
Hi,

On 2020-06-25 10:44:42 -0700, Jeff Davis wrote:
> There are only two possible paths: HashAgg and Sort+Group, and we need
> to pick one. If the planner expects one to spill, it is likely to
> expect the other to spill. If one spills in the executor, then the
> other is likely to spill, too. (I'm ignoring the case with a lot of
> tuples and few groups because that doesn't seem relevant.)

There's also ordered index scan + Group. Which will often be vastly
better than Sort+Group, but still slower than HashAgg.


> Imagine that there was only one path available to choose. Would you
> suggest the same thing, that unexpected spills can exceed work_mem but
> expected spills can't?

I'm not saying what I propose is perfect, but I've yet to hear a better
proposal. Given that there *are* different ways to implement
aggregation, and that we use expected costs to choose, I think the
assumed costs are relevant.

Greetings,

Andres Freund




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Jeff Davis
On Wed, 2020-06-24 at 13:29 -0400, Tom Lane wrote:
> If we feel we need something to let people have the v12 behavior
> back, let's have
> (1) enable_hashagg on/off --- controls planner, same as it ever was
> (2) enable_hashagg_spill on/off --- controls executor by disabling
> spill
> 
> But I'm not really convinced that we need (2).

If we're not going to have a planner GUC, one alternative is to just
penalize the disk costs of HashAgg for a release or two. It would only
affect the cost of HashAgg paths that are expected to spill, which
weren't even generated in previous releases.

In other words, multiply the disk costs by enough that the planner will
usually not choose HashAgg if expected to spill unless the average
group size is quite large (i.e. there are a lot more tuples than
groups, but still enough groups to spill).

As we learn more and optimize more, we can reduce or eliminate the
penalty in a future release. I'm not sure exactly what the penalty
would be, though.

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Jeff Davis
On Thu, 2020-06-25 at 13:17 -0400, Bruce Momjian wrote:
> Frankly, if it took me this long to get my head around this, I am
> unclear how many people will understand this tuning feature enough to
> actually use it.

The way I think about it is that v13 HashAgg is much more consistent
with the way we do everything else: the planner costs it (including any
spilling that is expected), and the executor executes it (including any
spilling that is required to obey work_mem).

In earlier versions, HashAgg was weird. If we add GUCs to get that
weird behavior back, then the GUCs will necessarily be weird; and
therefore hard to document.

I would feel more comfortable with some kind of GUC escape hatch (or
two). GROUP BY is just too common, and I don't think we can ignore the
potential for users experiencing a regression of some type (even if, in
principle, the v13 version is better).

If we have the GUCs there, then at least if someone comes to the
mailing list with a problem, we can offer them a temporary solution,
and have time to try to avoid the problem in a future release (tweaking
estimates, cost model, defaults, etc.).

One idea is to have undocumented GUCs. That way we don't have to
support them forever, and we are more likely to hear problem reports.

Regards,
Jeff Davis






Remove a redundant condition check

2020-06-25 Thread Ádám Balogh
Hello,

A one line change to remove a duplicate check. This duplicate check was 
detected during testing my contribution to a static code analysis tool. There 
is no functional change, no new tests needed.

Regards,

Ádám Balogh
CodeChecker Team
Ericsson Hungary


v1-0001-Remove-redundant-condition.patch
Description: v1-0001-Remove-redundant-condition.patch


Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Robert Haas
On Thu, Jun 25, 2020 at 1:15 PM Jeff Davis  wrote:
> Unexpected things (meaning underestimates) are not independent. All the
> queries are based on the same stats, so if you have a lot of similar
> queries, they will all get the same underestimate at once, and all be
> surprised when they need to spill at once, and then all decide they are
> entitled to ignore work_mem at once.

Yeah, that's a risk. But what is proposed is a configuration setting,
so people can adjust it depending on what they think is likely to
happen in their environment.

> That sounds more useful and probably not too hard to implement in a
> crude form. Just have a shared counter in memory representing GB. If a
> node is about to spill, it could try to decrement the counter by N, and
> if it succeeds, it gets to exceed work_mem by N more GB.

That's a neat idea, although GB seems too coarse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Jeff Davis
On Thu, 2020-06-25 at 09:37 -0700, Andres Freund wrote:
> > Let's say you have work_mem=32MB and a query that's expected to use
> > 16MB of memory. In reality, it uses 64MB of memory. So you are
> > saying
> > this query would get to use all 64MB of memory, right?
> > 
> > But then you run ANALYZE. Now the query is (correctly) expected to
> > use
> > 64MB of memory. Are you saying this query, executed again with
> > better
> > stats, would only get to use 32MB of memory, and therefore run
> > slower?
> 
> Yes. I think that's ok, because it was taken into account from a
> costing
> perspective int he second case.

What do you mean by "taken into account"?

There are only two possible paths: HashAgg and Sort+Group, and we need
to pick one. If the planner expects one to spill, it is likely to
expect the other to spill. If one spills in the executor, then the
other is likely to spill, too. (I'm ignoring the case with a lot of
tuples and few groups because that doesn't seem relevant.)

Imagine that there was only one path available to choose. Would you
suggest the same thing, that unexpected spills can exceed work_mem but
expected spills can't?
Regards,
Jeff Davis






Re: [patch] demote

2020-06-25 Thread Jehan-Guillaume de Rorthais
Hi,

Here is a summary of my work during the last few days on this demote approach.

Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the
commit message and as FIXME in code.

The patch is not finished or bug-free yet, I'm still not very happy with the
coding style, it probably lack some more code documentation, but a lot has
changed since v1. It's still a PoC to push the discussion a bit further after
being myself silent for some days.

The patch is currently relying on a demote checkpoint. I understand a forced
checkpoint overhead can be massive and cause major wait/downtime. But I keep
this for a later step. Maybe we should be able to cancel a running checkpoint?
Or leave it to its synching work but discard the result without wirting it to
XLog?

I hadn't time to investigate Robert's concern about shared memory for snapshot
during recovery.

The patch doesn't deal with prepared xact yet. Testing "start->demote->promote"
raise an assert if some prepared xact exist. I suppose I will rollback them
during demote in next patch version.

I'm not sure how to divide this patch in multiple small independent steps. I
suppose I can split it like:

1. add demote checkpoint
2. support demote: mostly postmaster, startup/xlog and checkpointer related
   code
3. cli using pg_ctl demote

...But I'm not sure it worth it.

Regards,
>From 03c41dd706648cd20df90a128db64eee6b6dad97 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Fri, 10 Apr 2020 18:01:45 +0200
Subject: [PATCH] Demote PoC

Changes:

* creates a demote checkpoint
* use DB_DEMOTING state in controlfile
* try to handle subsystems init correctly during demote
* keep some sub-processes alive:
  stat collector, checkpointer, bgwriter and optionally archiver or wal
  senders
* add signal PMSIGNAL_DEMOTING to start the startup process after the
  demote checkpoint
* ShutdownXLOG takes a boolean arg to handle demote differently

Trivial manual tests:

* make check : OK
* make check-world : OK
* start in production -> demote -> demote: OK
* start in production -> demote -> stop : OK
* start in production -> demote -> promote : NOK (2PC, see TODO)
  but OK with no prepared xact.

Discuss/Todo:

* rollback prepared xact
* cancel/kill active/idle in xact R/W backends
  * pg_demote() function?
* some more code reviewing around StartupXlog
* investigate snapshots shmem needs/init during recovery compare to
  production
* add tap tests
* add doc
* how to handle checkpoint?
---
 src/backend/access/rmgrdesc/xlogdesc.c  |   9 +-
 src/backend/access/transam/xlog.c   | 287 +++-
 src/backend/postmaster/checkpointer.c   |  22 ++
 src/backend/postmaster/postmaster.c | 250 -
 src/backend/storage/ipc/procsignal.c|   4 +
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/bin/pg_ctl/pg_ctl.c | 111 +
 src/include/access/xlog.h   |  18 +-
 src/include/catalog/pg_control.h|   2 +
 src/include/libpq/libpq-be.h|   7 +-
 src/include/postmaster/bgwriter.h   |   1 +
 src/include/storage/pmsignal.h  |   1 +
 src/include/storage/procsignal.h|   1 +
 src/include/utils/pidfile.h |   1 +
 14 files changed, 537 insertions(+), 179 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 1cd97852e8..5aeaff18f8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -40,7 +40,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
 	if (info == XLOG_CHECKPOINT_SHUTDOWN ||
-		info == XLOG_CHECKPOINT_ONLINE)
+		info == XLOG_CHECKPOINT_ONLINE ||
+		info == XLOG_CHECKPOINT_DEMOTE)
 	{
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
@@ -65,7 +66,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		 checkpoint->oldestCommitTsXid,
 		 checkpoint->newestCommitTsXid,
 		 checkpoint->oldestActiveXid,
-		 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
+		 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" :
+			(info == XLOG_CHECKPOINT_DEMOTE)? "demote" : "online");
 	}
 	else if (info == XLOG_NEXTOID)
 	{
@@ -185,6 +187,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_DEMOTE:
+			id = "CHECKPOINT_DEMOTE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e455384b5b..0e18e546ba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6301,6 +6301,13 @@ CheckRequiredParameterValues(void)
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
+/*
+ * FIXME demote: part of the code here assume there's no other active
+ * processes before signal PMSIGNAL_RECOVERY_STARTED is sent.
+ *
+ * FIXME demote: rollback prepared xact during demote?
+ 

Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Bruce Momjian
On Thu, Jun 25, 2020 at 11:46:54AM -0400, Robert Haas wrote:
> On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian  wrote:
> > I think my main point is that work_mem was not being honored for
> > hash-agg before, but now that PG 13 can do it, we are again allowing
> > work_mem not to apply in certain cases.  I am wondering if our hard
> > limit for work_mem is the issue, and we should make that more flexible
> > for all uses.
> 
> I mean, that's pretty much what we're talking about here, isn't it? It
> seems like in your previous two replies you were opposed to separating
> the plan-type limit from the execution-time limit, but that idea is
> precisely a way of being more flexible (and extending it to other plan
> nodes is a way of making it more flexible for more use cases).

I think it is was Tom who was complaining about plan vs. execution time
control.

> As I think you know, if you have a system where the workload varies a
> lot, you may sometimes be using 0 copies of work_mem and at other
> times 1000 or more copies, so the value has to be chosen
> conservatively as a percentage of system memory, else you start
> swapping or the OOM killer gets involved. On the other hand, some plan
> nodes get a lot less efficient when the amount of memory available
> falls below some threshold, so you can't just set this to a tiny value
> and forget about it. Because the first problem is so bad, most people
> set the value relatively conservatively and just live with the
> performance consequences. But this also means that they have memory
> left over most of the time, so the idea of letting a node burst above
> its work_mem allocation when something unexpected happens isn't crazy:
> as long as only a few nodes do that here and there, rather than, say,
> all the nodes doing it all at the same time, it's actually fine. If we
> had a smarter system that could dole out more work_mem to nodes that
> would really benefit from it and less to nodes where it isn't likely
> to make much difference, that would be similar in spirit but even
> better.

I think the issue is that in PG 13 work_mem controls sorts and hashes
with a new hard limit for hash aggregation:


https://www.postgresql.org/docs/12/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY

Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash
tables are used in hash joins, hash-based aggregation, and hash-based
processing of IN subqueries.

In pre-PG 13, we "controlled" it by avoiding hash-based aggregation if
was expected it to exceed work_mem, but if we assumed it would be less
than work_mem and it was more, we exceeded work_mem allocation for that
node.  In PG 13, we "limit" memory to work_mem and spill to disk if we
exceed it.

We should really have always documented that hash agg could exceed
work_mem for misestimation, and if we add a hash_agg work_mem
misestimation bypass setting we should document this setting in work_mem
as well.

But then the question is why do we allow this bypass only for hash agg? 
Should work_mem have a settings for ORDER BY, merge join, hash join, and
hash agg, e.g.:

work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB'

Yeah, crazy syntax, but you get the idea.  I understand some nodes are
more sensitive to disk spill than others, so shouldn't we be controlling
this at the work_mem level, rather than for a specific node type like
hash agg?  We could allow for misestimation over allocation of hash agg
work_mem by splitting up the hash agg values:

work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB 
hash_agg_max=200MB'

but _avoiding_ hash agg if it is estimated to exceed work mem and spill
to disk is not something to logically control at the work mem level,
which leads so something like David Rowley suggested, but with different
names:

enable_hashagg = on | soft | avoid | off

where 'on' and 'off' are the current PG 13 behavior, 'soft' means to
treat work_mem as a soft limit and allow it to exceed work mem for
misestimation, and 'avoid' means to avoid hash agg if it is estimated to
exceed work mem.  Both 'soft' and 'avoid' don't spill to disk.

David's original terms of "trynospill" and "neverspill" were focused on
spilling, not on its interaction with work_mem, and I found that
confusing.

Frankly, if it took me this long to get my head around this, I am
unclear how many people will understand this tuning feature enough to
actually use it.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Jeff Davis
On Thu, 2020-06-25 at 11:46 -0400, Robert Haas wrote:
> Because the first problem is so bad, most people
> set the value relatively conservatively and just live with the
> performance consequences. But this also means that they have memory
> left over most of the time, so the idea of letting a node burst above
> its work_mem allocation when something unexpected happens isn't
> crazy:
> as long as only a few nodes do that here and there, rather than, say,
> all the nodes doing it all at the same time, it's actually fine.

Unexpected things (meaning underestimates) are not independent. All the
queries are based on the same stats, so if you have a lot of similar
queries, they will all get the same underestimate at once, and all be
surprised when they need to spill at once, and then all decide they are
entitled to ignore work_mem at once.

>  If we
> had a smarter system that could dole out more work_mem to nodes that
> would really benefit from it and less to nodes where it isn't likely
> to make much difference, that would be similar in spirit but even
> better.

That sounds more useful and probably not too hard to implement in a
crude form. Just have a shared counter in memory representing GB. If a
node is about to spill, it could try to decrement the counter by N, and
if it succeeds, it gets to exceed work_mem by N more GB.

Regards,
Jeff Davis






Re: Weird failures on lorikeet

2020-06-25 Thread Alvaro Herrera
Is there something going on with lorikeet again?  I see this:

2020-06-25 01:55:13.380 EDT [5ef43c40.21e0:85] pg_regress/typed_table LOG:  
statement: SELECT c.oid::pg_catalog.regclass
FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i
WHERE c.oid = i.inhparent AND i.inhrelid = '21420'
  AND c.relkind != 'p' AND c.relkind != 'I'
ORDER BY inhseqno;
*** starting debugger for pid 4456, tid 2644
2020-06-25 01:55:13.393 EDT [5ef43c40.21e0:86] pg_regress/typed_table LOG:  
statement: SELECT c.oid::pg_catalog.regclass, c.relkind, 
pg_catalog.pg_get_expr(c.relpartbound, c.oid)
FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i
WHERE c.oid = i.inhrelid AND i.inhparent = '21420'
ORDER BY pg_catalog.pg_get_expr(c.relpartbound, c.oid) = 'DEFAULT', 
c.oid::pg_catalog.regclass::pg_catalog.text;
  1 [main] postgres 4456 try_to_debug: Failed to start debugger, Win32 
error 2
2020-06-25 01:55:13.455 EDT [5ef43c40.21e0:87] pg_regress/typed_table LOG:  
statement: CREATE TABLE persons3 OF person_type (
PRIMARY KEY (id),
name NOT NULL DEFAULT ''
);
*** continuing pid 4456 from debugger call (0)
*** starting debugger for pid 4456, tid 2644
  48849 [main] postgres 4456 try_to_debug: Failed to start debugger, Win32 
error 2
*** continuing pid 4456 from debugger call (0)
2020-06-25 01:55:18.181 EDT [5ef43bd6.2824:4] LOG:  server process (PID 4456) 
was terminated by signal 11: Segmentation fault
2020-06-25 01:55:18.181 EDT [5ef43bd6.2824:5] DETAIL:  Failed process was 
running: drop table some_tab cascade;


On 2018-Mar-06, Thomas Munro wrote:
> On Thu, Feb 22, 2018 at 7:06 AM, Andres Freund  wrote:

> > I noticed your animal lorikeet failed in the last two runs:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2018-02-21%2009%3A47%3A17
> > TRAP: FailedAssertion("!(((PageHeader) (page))->pd_special >= 
> > (__builtin_offsetof (PageHeaderData, pd_linp)))", File: 
> > "/home/andrew/bf64/root/HEAD/pgsql/src/include/storage/bufpage.h", Line: 
> > 313)
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2018-02-20%2012%3A46%3A17
> > TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", File: 
> > "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/pmsignal.c",
> >  Line: 229)
> > 2018-02-20 08:07:14.054 EST [5a8c1c3b.21d0:3] LOG:  select() failed in 
> > postmaster: Bad address
> > 2018-02-20 08:07:14.073 EST [5a8c1c3b.21d0:4] LOG:  database system is shut 
> > down

> Bad memory?  Assorted output from recent runs:
> 
> + ERROR:  index "pg_toast_28546_index" contains corrupted page at block 1
> 
> TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)",
> File: 
> "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/pmsignal.c",
> Line: 229)
> 
> TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File:
> "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/pgstat.c",
> Line: 871)


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: CUBE_MAX_DIM

2020-06-25 Thread Tom Lane
Alastair McKinley  writes:
> I know that Cube in it's current form isn't suitable for nearest-neighbour 
> searching these vectors in their raw form (I have tried recompilation with 
> higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes 
> can be useful for these applications.  There are other pre-processing 
> techniques that can be used to improved the speed of the search, but it still 
> ends up with a kNN search in a high-ish dimensional space.

Is there a way to fix the numerical instability involved?  If we could do
that, then we'd definitely have a use-case justifying the work to make
cube toastable.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Jeff Davis
On Wed, 2020-06-24 at 12:31 -0700, Andres Freund wrote:
> nodeAgg.c already treats those separately:
> 
> void
> hash_agg_set_limits(double hashentrysize, uint64 input_groups, int
> used_bits,
>   Size *mem_limit, uint64
> *ngroups_limit,
>   int *num_partitions)
> {
>   int npartitions;
>   Sizepartition_mem;
> 
>   /* if not expected to spill, use all of work_mem */
>   if (input_groups * hashentrysize < work_mem * 1024L)
>   {
>   if (num_partitions != NULL)
>   *num_partitions = 0;
>   *mem_limit = work_mem * 1024L;
>   *ngroups_limit = *mem_limit / hashentrysize;
>   return;
>   }

The reason this code exists is to decide how much of work_mem to set
aside for spilling (each spill partition needs an IO buffer).

The alternative would be to fix the number of partitions before
processing a batch, which didn't seem ideal. Or, we could just ignore
the memory required for IO buffers, like HashJoin.

Granted, this is an example where an underestimate can give an
advantage, but I don't think we want to extend the concept into other
areas.

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Andres Freund
On 2020-06-25 09:24:52 -0700, Jeff Davis wrote:
> On Wed, 2020-06-24 at 12:14 -0700, Andres Freund wrote:
> > E.g. if the plan isn't expected to spill,
> > only spill at 10 x work_mem or something like that.
> 
> Let's say you have work_mem=32MB and a query that's expected to use
> 16MB of memory. In reality, it uses 64MB of memory. So you are saying
> this query would get to use all 64MB of memory, right?
> 
> But then you run ANALYZE. Now the query is (correctly) expected to use
> 64MB of memory. Are you saying this query, executed again with better
> stats, would only get to use 32MB of memory, and therefore run slower?

Yes. I think that's ok, because it was taken into account from a costing
perspective int he second case.




Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-06-25 Thread Andres Freund
Hi,

Thanks for picking this up again!

On 2020-06-25 13:50:37 +1200, David Rowley wrote:
> In the attached, I've come up with a way that works. Basically I've
> just added a function named errstart_cold() that is attributed with
> __attribute__((cold)), which will hint to the compiler to keep
> branches which call this function in a cold path.

I recall you trying this before? Has that gotten easier because we
evolved ereport()/elog(), or has gcc become smarter, or ...?


> To make the compiler properly mark just >= ERROR calls as cold, and
> only when the elevel is a constant, I modified the ereport_domain
> macro to do:
> 
> if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
> errstart_cold(elevel, domain) : \
> errstart(elevel, domain)) \

I think it'd be good to not just do this for ERROR, but also for <=
DEBUG1. I recall seing quite a few debug elogs that made the code worse
just by "being there".

I suspect that doing this for DEBUG* could also improve the code for
clang, because we obviously don't have __builtin_unreachable after those.


> I see no reason why the compiler shouldn't always fold that constant
> expression at compile-time and correctly select the correct version of
> the function for the job.  (I also tried using __builtin_choose_expr()
> but GCC complained when the elevel was not constant, even with the
> __builtin_constant_p() test in the condition)

I think it has to be constant in all paths for that...


> Master:
> drowley@amd3990x:~$ pgbench -S -T 120 postgres
> tps = 25245.903255 (excluding connections establishing)
> tps = 26144.454208 (excluding connections establishing)
> tps = 25931.850518 (excluding connections establishing)
> 
> Master + elog_ereport_attribute_cold.patch
> drowley@amd3990x:~$ pgbench -S -T 120 postgres
> tps = 28351.480631 (excluding connections establishing)
> tps = 27763.656557 (excluding connections establishing)
> tps = 28896.427929 (excluding connections establishing)

That is pretty damn cool.


> diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
> index e976201030..8076e8af24 100644
> --- a/src/backend/utils/error/elog.c
> +++ b/src/backend/utils/error/elog.c
> @@ -219,6 +219,19 @@ err_gettext(const char *str)
>  #endif
>  }
>  
> +#if defined(HAVE_PG_ATTRIBUTE_HOT_AND_COLD) && 
> defined(HAVE__BUILTIN_CONSTANT_P)
> +/*
> + * errstart_cold
> + *   A simple wrapper around errstart, but hinted to be cold so that 
> the
> + *   compiler is more likely to put error code in a cold area away 
> from the
> + *   main function body.
> + */
> +bool
> +pg_attribute_cold errstart_cold(int elevel, const char *domain)
> +{
> + return errstart(elevel, domain);
> +}
> +#endif

Hm. Would it make sense to have this be a static inline?


>  /*
>   * errstart --- begin an error-reporting cycle
> diff --git a/src/include/c.h b/src/include/c.h
> index d72b23afe4..087b8af6cb 100644
> --- a/src/include/c.h
> +++ b/src/include/c.h
> @@ -178,6 +178,21 @@
>  #define pg_noinline
>  #endif
>  
> +/*
> + * Marking certain functions as "hot" or "cold" can be useful to assist the
> + * compiler in arranging the assembly code in a more efficient way.
> + * These are supported from GCC >= 4.3 and clang >= 3.2
> + */
> +#if (defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ 
> >= 3))) || \
> + (defined(__clang__) && (__clang_major__ > 3 || (__clang_major__ == 3 && 
> __clang_minor__ >= 2)))
> +#define HAVE_PG_ATTRIBUTE_HOT_AND_COLD 1
> +#define pg_attribute_hot __attribute__((hot))
> +#define pg_attribute_cold __attribute__((cold))
> +#else
> +#define pg_attribute_hot
> +#define pg_attribute_cold
> +#endif

Wonder if we should start using __has_attribute() for things like this.

https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fattribute.html

I.e. we could do something like
#ifndef __has_attribute
#define __has_attribute(attribute) 0
#endif

#if __has_attribute(hot)
#define pg_attribute_hot __attribute__((hot))
#else
#define pg_attribute_hot
#endif

clang added __has_attribute in 2.9 (2010), gcc added it in 5 (2014), so
I don't think we'd loose too much.




>  #ifdef HAVE__BUILTIN_CONSTANT_P
> +#ifdef HAVE_PG_ATTRIBUTE_HOT_AND_COLD
> +#define ereport_domain(elevel, domain, ...)  \
> + do { \
> + pg_prevent_errno_in_scope(); \
> + if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
> +  errstart_cold(elevel, domain) : \
> +  errstart(elevel, domain)) \
> + __VA_ARGS__, errfinish(__FILE__, __LINE__, 
> PG_FUNCNAME_MACRO); \
> + if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
> + pg_unreachable(); \
> + } while(0)
> +#else/* 
> !HAVE_PG_ATTRIBUTE_HOT_AND_COLD */
>  #define ereport_domain(elevel, domain, ...)  \
>   do { \
>   pg_prevent_errno_in_scope(); \
> @@ -129,6 +141,7 @@
>   

Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-25 Thread Isaac Morland
On Wed, 24 Jun 2020 at 08:18, Fabien COELHO  wrote:

> I would like to create an "all defaults" row, i.e. a row composed of the
> default values for all attributes, so I wrote:
>
>INSERT INTO t() VALUES ();
>
> This is forbidden by postgres, and also sqlite.
>

This is not the only area where empty tuples are not supported. Consider:

PRIMARY KEY ()

This should mean the table may only contain a single row, but is not
supported.

Also, GROUP BY supports grouping by no columns, but not in a systematic
way: Using aggregate functions with no explicit GROUP BY clause will result
in grouping by no columns (i.e., entire result set is one group); I also
found that I could GROUP BY NULL::integer, abusing the column number
syntax. But things like GROUP BY ROLLUP () are not supported.

On the plus side, empty rows are supported, although the explicit ROW
keyword is required.


Re: Open Item: Should non-text EXPLAIN always show properties?

2020-06-25 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jun 25, 2020 at 8:42 AM James Coleman  wrote:
>> Yesterday I'd replied [1] to Justin's proposal for this WRT
>> incremental sort and expressed my opinion that including both
>> unnecessarily (i.e., including disk when an in-memory sort was used)
>> is undesirable and confusing and leads to shortcuts I believe to be
>> bad habits when using the data programmatically.

> +1.

I think the policy about non-text output formats is "all applicable
fields should be included automatically".  But the key word there is
"applicable".  Are disk-sort numbers applicable when no disk sort
happened?

I think the right way to think about this is that we are building
an output data structure according to a schema that should be fixed
for any particular plan shape.  If event X happened zero times in
a given execution, but it could have happened in a different execution
of the same plan, then we should print X with a zero count.  If X
could not happen period in this plan, we should omit X's entry.

So the real question here is whether the disk vs memory decision is
plan time vs run time.  AFAIK it's run time, which leads me to think
we ought to print the zeroes.

regards, tom lane




Re: CUBE_MAX_DIM

2020-06-25 Thread Alastair McKinley
> 
> Devrim =?ISO-8859-1?Q?G=FCnd=FCz?=  writes:
> > Someone contacted me about increasing CUBE_MAX_DIM
> > in contrib/cube/cubedata.h (in the community RPMs). The current value
> > is 100 with the following comment:
> 
> > * This limit is pretty arbitrary, but don't make it so large that you
> > * risk overflow in sizing calculations.
> 
> > They said they use 500, and never had a problem.
> 
> I guess I'm wondering what's the use-case.  100 already seems an order of
> magnitude more than anyone could want.  Or, if it's not enough, why does
> raising the limit just 5x enable any large set of new applications?

The dimensionality of embeddings generated by deep neural networks can be high. 
Google BERT has 768 dimensions for example.

I know that Cube in it's current form isn't suitable for nearest-neighbour 
searching these vectors in their raw form (I have tried recompilation with 
higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can 
be useful for these applications.  There are other pre-processing techniques 
that can be used to improved the speed of the search, but it still ends up with 
a kNN search in a high-ish dimensional space.

> The practical issue here is that, since the data requires 16 bytes per
> dimension (plus a little bit of overhead), we'd be talking about
> increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000
> bytes.  And cube is not toastable, so that couldn't be compressed or
> shoved out-of-line.  Maybe your OP never had a problem with it, but
> plenty of use-cases would have "tuple too large" failures due to not
> having room on a heap page for whatever other data they want in the row.
> 
> Even a non-toastable 2KB field is going to give the tuple toaster
> algorithm problems, as it'll end up shoving every other toastable field
> out-of-line in an ultimately vain attempt to bring the tuple size below
> 2KB.  So I'm really quite hesitant to raise CUBE_MAX_DIM much past where
> it is now without any other changes.
> 
> A more credible proposal would be to make cube toast-aware and then
> raise the limit to ~1GB ... but that would take a significant amount
> of work, and we still haven't got a use-case justifying it.
> 
> I think I'd counsel storing such data as plain float8 arrays, which
> do have the necessary storage infrastructure.  Is there something
> about the cube operators that's particularly missing?
> 

The indexable nearest-neighbour searches are one of the great cube features not 
available with float8 arrays.

> regards, tom lane

Best regards, 
Alastair









Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Jeff Davis
On Wed, 2020-06-24 at 12:14 -0700, Andres Freund wrote:
> E.g. if the plan isn't expected to spill,
> only spill at 10 x work_mem or something like that.

Let's say you have work_mem=32MB and a query that's expected to use
16MB of memory. In reality, it uses 64MB of memory. So you are saying
this query would get to use all 64MB of memory, right?

But then you run ANALYZE. Now the query is (correctly) expected to use
64MB of memory. Are you saying this query, executed again with better
stats, would only get to use 32MB of memory, and therefore run slower?

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Jeff Davis
On Wed, 2020-06-24 at 15:28 -0400, Robert Haas wrote:
> On Wed, Jun 24, 2020 at 3:14 PM Andres Freund 
> wrote:
> > FWIW, my gut feeling is that we'll end up have to separate the
> > "execution time" spilling from using plain work mem, because it'll
> > trigger spilling too often. E.g. if the plan isn't expected to
> > spill,
> > only spill at 10 x work_mem or something like that.  Or we'll need
> > better management of temp file data when there's plenty memory
> > available.

...

> I think that's actually pretty appealing. Separating the memory we
> plan to use from the memory we're willing to use before spilling
> seems
> like a good idea in general, and I think we should probably also do
> it
> in other places - like sorts.

I'm trying to make sense of this. Let's say there are two GUCs:
planner_work_mem=16MB and executor_work_mem=32MB.

And let's say a query comes along and generates a HashAgg path, and the
planner (correctly) thinks if you put all the groups in memory at once,
it would be 24MB. Then the planner, using planner_work_mem, would think
spilling was necessary, and generate a cost that involves spilling.

Then it's going to generate a Sort+Group path, as well. And perhaps it
estimates that sorting all of the tuples in memory would also take
24MB, so it generates a cost that involves spilling to disk.

But it has to choose one of them. We've penalized plans at risk of
spilling to disk, but what's the point? The planner needs to choose one
of them, and both are at risk of spilling to disk.

Regards,
Jeff Davis






Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-25 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jun 25, 2020 at 12:56 AM Fabien COELHO  wrote:
>> It also means that if for some reason someone wants to insert several such
>> all-default rows, they have to repeat the insert, as "VALUES (), ();"
>> would not work, so it is also losing a corner-corner case capability
>> without obvious reason.

> That, and a desire to make things work in PostgreSQL that work in
> MySQL, seems like a good-enough reason to me, but YMMV.

Yeah, the multi-insert case is a plausible reason that hadn't been
mentioned before.  On the other hand, you can already do that pretty
painlessly:

regression=# create table foo(x float8 default random());
CREATE TABLE
regression=# insert into foo select from generate_series(1,10);
INSERT 0 10
regression=# table foo;
  x  
-
 0.08414037203059621
  0.2921176461398325
  0.8760821189460586
  0.6266325419285828
  0.9946880079739273
  0.4547070342142696
 0.09683985675118834
  0.3172576600666268
  0.5122428845812195
  0.8823697407826394
(10 rows)

So I'm still not convinced we should do this.  "MySQL is incapable
of conforming to the standard" is a really lousy reason for us to do
something.

Anyway, to answer Fabien's question about why things are like this:
the standard doesn't allow zero-column tables, so most of these
syntactic edge cases are forbidden on that ground.  We decided we
didn't like that restriction (because, for example, it creates a
painful special case for DROP COLUMN).  So we've adjusted a minimal
set of syntactic edge cases to go along with that semantic change.
There's room to argue that INSERT's edge case should be included,
but there's also room to argue that it doesn't need to be.

regards, tom lane




Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-06-25 Thread Robert Haas
On Wed, Jun 24, 2020 at 9:51 PM David Rowley  wrote:
> $ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
>
> Which is about a 1.42% increase in performance. That's not exactly
> groundbreaking, but pretty useful to have if that happens to apply
> across the board for execution performance.
>
> For pgbench -S:
>
> My results were a bit noisier than the TPCH test, but the results I
> obtained did show about a 10% increase in performance:

This is pretty cool, particularly because it affects single-client
performance. It seems like a lot of ideas people have had about
speeding up pgbench performance - including me - have improved
performance under concurrency at the cost of very slightly degrading
single-client performance. It would be nice to claw some of that back.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Default setting for enable_hashagg_disk

2020-06-25 Thread Robert Haas
On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian  wrote:
> I think my main point is that work_mem was not being honored for
> hash-agg before, but now that PG 13 can do it, we are again allowing
> work_mem not to apply in certain cases.  I am wondering if our hard
> limit for work_mem is the issue, and we should make that more flexible
> for all uses.

I mean, that's pretty much what we're talking about here, isn't it? It
seems like in your previous two replies you were opposed to separating
the plan-type limit from the execution-time limit, but that idea is
precisely a way of being more flexible (and extending it to other plan
nodes is a way of making it more flexible for more use cases).

As I think you know, if you have a system where the workload varies a
lot, you may sometimes be using 0 copies of work_mem and at other
times 1000 or more copies, so the value has to be chosen
conservatively as a percentage of system memory, else you start
swapping or the OOM killer gets involved. On the other hand, some plan
nodes get a lot less efficient when the amount of memory available
falls below some threshold, so you can't just set this to a tiny value
and forget about it. Because the first problem is so bad, most people
set the value relatively conservatively and just live with the
performance consequences. But this also means that they have memory
left over most of the time, so the idea of letting a node burst above
its work_mem allocation when something unexpected happens isn't crazy:
as long as only a few nodes do that here and there, rather than, say,
all the nodes doing it all at the same time, it's actually fine. If we
had a smarter system that could dole out more work_mem to nodes that
would really benefit from it and less to nodes where it isn't likely
to make much difference, that would be similar in spirit but even
better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PostgreSQL and big data - FDW

2020-06-25 Thread Bruce Momjian
On Thu, Jun 25, 2020 at 07:02:37AM +, ROS Didier wrote:
> Hi Stephen
> 
> My EDF company is very interested in this feature (KERBEROS authentication 
> method and hdfs_fdw ). 
> Is it possible to know how many days of development does this represent ? who 
> can develop this implementation ? what cost ?

Uh, the only thing I can suggest is to contact one of the larger
Postgres support companies (ones that have developers who understand the
server code, or at least the FDW code), and ask them for estimates.  The
community really can't supply any of that, unless you want to do the
work and want source code tips.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Open Item: Should non-text EXPLAIN always show properties?

2020-06-25 Thread Robert Haas
On Thu, Jun 25, 2020 at 8:42 AM James Coleman  wrote:
> Yesterday I'd replied [1] to Justin's proposal for this WRT
> incremental sort and expressed my opinion that including both
> unnecessarily (i.e., including disk when an in-memory sort was used)
> is undesirable and confusing and leads to shortcuts I believe to be
> bad habits when using the data programmatically.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Parallel Seq Scan vs kernel read ahead

2020-06-25 Thread Robert Haas
On Tue, Jun 23, 2020 at 11:53 PM David Rowley  wrote:
> In summary, based on these tests, I don't think we're making anything
> worse in regards to synchronize_seqscans if we cap the maximum number
> of blocks to allocate to each worker at once to 8192. Perhaps there's
> some argument for using something smaller than that for servers with
> very little RAM, but I don't personally think so as it still depends
> on the table size and It's hard to imagine tables in the hundreds of
> GBs on servers that struggle with chunk allocations of 16MB.  The
> table needs to be at least ~70GB to get a 8192 chunk size with the
> current v2 patch settings.

Nice research. That makes me happy. I had a feeling the maximum useful
chunk size ought to be more in this range than the larger values we
were discussing before, but I didn't even think about the effect on
synchronized scans.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: xid wraparound danger due to INDEX_CLEANUP false

2020-06-25 Thread Robert Haas
On Wed, Jun 24, 2020 at 4:02 PM Peter Geoghegan  wrote:
> > Apparently, what we're really doing here is that even when
> > INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples.
> > That seems sad, but if it's what we have to do then it at least needs
> > comments explaining it.
>
> +1. Though I think that AMs should technically have the right to
> consider it advisory.

I'm not really convinced. I agree that from a theoretical point of
view an index can have arbitrary needs and is the arbiter of its own
needs, but when I pull the emergency break, I want the vehicle to
stop, not think about stopping. There's a fine argument for the idea
that depressing the regular brake pedal entitles the vehicle to
exercise some discretion, and on modern cars it does (think ABS, if
nothing else). But pulling the emergency break is a statement that I
wish to override any contrary judgement about whether stopping is a
good idea. I think this option is rightly viewed as an emergency
break, and giving AMs the right to decide that we'll instead pull off
at the next exit doesn't sit well with me. At the end of the day, the
human being should be in charge, not the program.

(Great, now Skynet will be gunning for me...)

> 1. How often is that likely to happen in The Real World™?
>
> 2. If we fail to do cleanup and leak already-deleted pages, how bad is
> that? ( Both in general, and in the worst case.)
>
> I'll hazard a guess for 1: I think that it might not come up that
> often. Page deletion is often something that we hardly ever need. And,
> unlike some DB systems, we only do it when pages are fully empty
> (which, as it turns out, isn't necessarily better than our simple
> approach [1]). I tend to think it's unlikely to happen in cases where
> INDEX_CLEANUP is used, because those are cases that also must not have
> that much index churn to begin with.

I don't think I believe this. All you need is one small range-deletion, right?

> Then there's question 2. The intuition behind the approach from
> Sawada-san's patch was that allowing wraparound here feels wrong -- it
> should be in the AM's hands. However, it's not like I can point to
> some ironclad guarantee about not leaking deleted pages that existed
> before the INDEX_CLEANUP feature. We know that the FSM is not crash
> safe, and that's that. Is this really all that different? Maybe it is,
> but it seems like a quantitative difference to me.

I don't think I believe this, either. In the real-world example to
which you linked, the user ran REINDEX afterward to recover from index
bloat, and we could advise other people who use this option that it
may leak space that a subsequent VACUUM may fail to recover, and
therefore they too should consider REINDEX. Bloat sucks and I hate it,
but in the vehicle analogy from up above, it's the equivalent of
getting lost while driving someplace. It is inconvenient and may cause
you many problems, but you will not be dead. Running out of XIDs is a
brick wall. Either the car stops or you hit the wall. Ideally you can
manage to both not get lost and also not hit a brick wall, but in an
emergency situation where you have to choose either to get lost or to
hit a brick wall, there's only one right answer. As bad as bloat is,
and it's really bad, there are users who manage to run incredibly
bloated databases for long periods of time just because the stuff that
gets slow is either stuff that they're not doing at all, or only doing
in batch jobs where it's OK if it runs super-slow and where it may
even be possible to disable the batch job altogether, at least for a
while. The set of users who can survive running out of XIDs is limited
to those who can get by with just read-only queries, and that's
practically nobody. I have yet to encounter a customer who didn't
consider running out of XIDs to be an emergency.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: min_safe_lsn column in pg_replication_slots view

2020-06-25 Thread Alvaro Herrera
On 2020-Jun-25, Amit Kapila wrote:

> +1.  I also think let's drop the idea of exposing a function for this
> value and revert the min_safe_lsn part of the work as proposed by
> Michael above [1] excluding the function pg_wal_oldest_lsn() in that
> patch.  Then, we can expose this as a new stat for PG14.  I feel it
> would be better to display this stat in a new view (something like
> pg_stat_replication_slots) as discussed in another thread [2].  Does
> that make sense?

I don't understand the proposal.  Michael posted a patch that adds
pg_wal_oldest_lsn(), and you say we should apply the patch except the
part that adds that function -- so what part would be applying?

If the proposal is to apply just the hunk in pg_get_replication_slots
that removes min_safe_lsn, and do nothing else in pg13, then I don't like
it.  The feature exposes a way to monitor slots w.r.t. the maximum slot
size; I'm okay if you prefer to express that in a different way, but I
don't like the idea of shipping pg13 without any way to monitor it.

As reported by Masao-san, the current min_safe_lsn has a definitional
problem when used with pg_walfile_name(), but we've established that
that's because pg_walfile_name() has a special-case definition, not
because min_safe_lsn itself is bogus.  If we're looking for a minimal
change that can fix this problem, let's increment one byte, which should
fix that issue, no?

I also see that some people complain that all slots return the same
value and therefore this column is redundant.  To that argument I say
that it's not unreasonable that we'll add a slot-specific size limit;
and if we do, we'll be happy we had slot-specific min safe LSN; see e.g.
https://postgr.es/m/20170301160610.wc7ez3vihmial...@alap3.anarazel.de

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: CUBE_MAX_DIM

2020-06-25 Thread Tom Lane
Devrim =?ISO-8859-1?Q?G=FCnd=FCz?=  writes:
> Someone contacted me about increasing CUBE_MAX_DIM
> in contrib/cube/cubedata.h (in the community RPMs). The current value
> is 100 with the following comment:

> * This limit is pretty arbitrary, but don't make it so large that you
> * risk overflow in sizing calculations.

> They said they use 500, and never had a problem.

I guess I'm wondering what's the use-case.  100 already seems an order of
magnitude more than anyone could want.  Or, if it's not enough, why does
raising the limit just 5x enable any large set of new applications?

The practical issue here is that, since the data requires 16 bytes per
dimension (plus a little bit of overhead), we'd be talking about
increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000
bytes.  And cube is not toastable, so that couldn't be compressed or
shoved out-of-line.  Maybe your OP never had a problem with it, but
plenty of use-cases would have "tuple too large" failures due to not
having room on a heap page for whatever other data they want in the row.

Even a non-toastable 2KB field is going to give the tuple toaster
algorithm problems, as it'll end up shoving every other toastable field
out-of-line in an ultimately vain attempt to bring the tuple size below
2KB.  So I'm really quite hesitant to raise CUBE_MAX_DIM much past where
it is now without any other changes.

A more credible proposal would be to make cube toast-aware and then
raise the limit to ~1GB ... but that would take a significant amount
of work, and we still haven't got a use-case justifying it.

I think I'd counsel storing such data as plain float8 arrays, which
do have the necessary storage infrastructure.  Is there something
about the cube operators that's particularly missing?

regards, tom lane




Re: improving basebackup.c's file-reading code

2020-06-25 Thread Robert Haas
On Thu, Jun 25, 2020 at 10:29 AM Daniel Gustafsson  wrote:
> As this went in, can we close the 2020-07 CF entry or is there anything left 
> in
> the patchseries?

Done. Thanks for the reminder.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-25 Thread Robert Haas
On Thu, Jun 25, 2020 at 12:56 AM Fabien COELHO  wrote:
> It also means that if for some reason someone wants to insert several such
> all-default rows, they have to repeat the insert, as "VALUES (), ();"
> would not work, so it is also losing a corner-corner case capability
> without obvious reason.

That, and a desire to make things work in PostgreSQL that work in
MySQL, seems like a good-enough reason to me, but YMMV.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: improving basebackup.c's file-reading code

2020-06-25 Thread Daniel Gustafsson
> On 17 Jun 2020, at 17:45, Robert Haas  wrote:
> 
> On Wed, Apr 29, 2020 at 5:52 AM Hamid Akhtar  wrote:
>> The idea and the patch looks good to me.
>> 
>> It makes sense to change fread to basebackup_read_file which internally 
>> calls pg_pread which is a portable function as opposed to read. As you've 
>> mentioned, this is much better for error handling.
> 
> Thanks for the review. I have now committed the patch, after rebasing
> and adjusting one comment slightly.

As this went in, can we close the 2020-07 CF entry or is there anything left in
the patchseries?

cheers ./daniel



Re: More tzdb fun: POSIXRULES is being deprecated upstream

2020-06-25 Thread Tom Lane
Peter Eisentraut  writes:
> What you are saying is, instead of the OS dropping POSIXRULES support, 
> it would be better if we dropped it first and release-noted that. 
> However, I don't agree with the premise of that.  OSes with long-term 
> support aren't going to drop it.

You might be right, or you might not.  I think the tzdata distribution is
in a weird gray area so far as long-term-support platforms are concerned:
they have to keep updating it, no matter how far it diverges from what
they originally shipped with.  Maybe they will figure out that they're not
required to drop POSIXRULES just because upstream did.  Or maybe they will
go with the flow on that, figuring that it's not any worse than any
politically-driven time zone change.

I wouldn't be a bit surprised if it ends up depending on whether the
particular distro is using IANA's makefile more or less verbatim.
In Red Hat's case I found that they'd have to take positive action to
drop POSIXRULES, so I'd agree that it won't happen there for a long time,
and not in any existing RHEL release.  In some other distros, it might
take explicit addition of a patch to keep from dropping POSIXRULES, in
which case I think there'd be quite good odds that that won't happen
and the changeover occurs with the next IANA zone updates.

The nasty thing about that scenario from our perspective is that it
means that the same timezone spec means different things on different
platforms, even ones nominally using the same tzdata release.  Do we
want to deal with that, or take pre-emptive action to prevent it?

(You could argue that that hazard already exists for people who are
intentionally using nonstandard posixrules files.  But I think the
set of such people can be counted without running out of fingers.
If there's some evidence to the contrary I'd like to see it.)

I'm also worried about what the endgame looks like.  It seems clear
that at some point IANA is going to remove their code's support for
reading a posixrules file.  Eggert hasn't tipped his hand as to when
he thinks that might happen, but I wouldn't care to bet that it's
more than five years away.  I don't want to find ourselves in a
situation where we have to maintain code that upstream has nuked.
If they only do something comparable to the patch I posted, it
wouldn't be so bad; but if they then undertake any significant
follow-on cleanup we'd be in a very bad place for tracking them.

regards, tom lane




Re: xid wraparound danger due to INDEX_CLEANUP false

2020-06-25 Thread Masahiko Sawada
On Thu, 25 Jun 2020 at 05:02, Peter Geoghegan  wrote:
>
> On Wed, Jun 24, 2020 at 10:21 AM Robert Haas  wrote:
> > Sorry, I'm so far behind on my email. Argh.
>
> That's okay.
>
> > I think, especially on the blog post you linked, that we should aim to
> > have INDEX_CLEANUP OFF mode do the minimum possible amount of work
> > while still keeping us safe against transaction ID wraparound. So if,
> > for example, it's desirable but not imperative for BRIN to
> > resummarize, then it's OK normally but should be skipped with
> > INDEX_CLEANUP OFF.
>
> I agree that that's very important.
>
> > Apparently, what we're really doing here is that even when
> > INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples.
> > That seems sad, but if it's what we have to do then it at least needs
> > comments explaining it.
>
> +1. Though I think that AMs should technically have the right to
> consider it advisory.
>
> > As for the btree portion of the change, I expect you understand this
> > better than I do, so I'm reluctant to stick my neck out, but it seems
> > that what the patch does is force index cleanup to happen even when
> > INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and
> > the btree index has at least 1 recyclable page. My first reaction is
> > to wonder whether that doesn't nerf this feature into oblivion. Isn't
> > it likely that an index that is being vacuumed for wraparound will
> > have a recyclable page someplace? If the presence of that 1 recyclable
> > page causes the "help, I'm about to run out of XIDs, please do the
> > least possible work" flag to become a no-op, I don't think users are
> > going to be too happy with that. Maybe I am misunderstanding.
>
> I have mixed feelings about it myself. These are valid concerns.
>
> This is a problem to the extent that the user has a table that they'd
> like to use INDEX_CLEANUP with, that has indexes that regularly
> require cleanup due to page deletion. ISTM, then, that the really
> relevant high level design questions for this patch are:
>
> 1. How often is that likely to happen in The Real World™?
>
> 2. If we fail to do cleanup and leak already-deleted pages, how bad is
> that? ( Both in general, and in the worst case.)
>
> I'll hazard a guess for 1: I think that it might not come up that
> often. Page deletion is often something that we hardly ever need. And,
> unlike some DB systems, we only do it when pages are fully empty
> (which, as it turns out, isn't necessarily better than our simple
> approach [1]). I tend to think it's unlikely to happen in cases where
> INDEX_CLEANUP is used, because those are cases that also must not have
> that much index churn to begin with.
>
> Then there's question 2. The intuition behind the approach from
> Sawada-san's patch was that allowing wraparound here feels wrong -- it
> should be in the AM's hands. However, it's not like I can point to
> some ironclad guarantee about not leaking deleted pages that existed
> before the INDEX_CLEANUP feature. We know that the FSM is not crash
> safe, and that's that. Is this really all that different? Maybe it is,
> but it seems like a quantitative difference to me.

I think that with the approach implemented in my patch, it could be a
problem for the user that the user cannot easily know in advance
whether vacuum with INDEX_CLEANUP false will perform index cleanup,
even if page deletion doesn’t happen in most cases. They can check the
vacuum will be a wraparound vacuum but it’s relatively hard for users
to check in advance there are recyclable pages in the btree index.
This will be a restriction for users, especially those who want to use
INDEX_CLEANUP feature to speed up an impending XID wraparound vacuum
described on the blog post that Peter shared.

I don’t come up with a good solution to keep us safe against XID
wraparound yet but it seems to me that it’s better to have an option
that forces index cleanup not to happen.

>
> I'm kind of arguing against myself even as I try to advance my
> original argument. If workloads that use INDEX_CLEANUP don't need to
> delete and recycle pages in any case, then why should we care that
> those same workloads might leak pages on account of the wraparound
> hazard?

I had the same impression at first.

> The real reason that I want to push the mechanism down into index
> access methods is because that design is clearly better overall; it
> just so happens that the specific way in which we currently defer
> recycling in nbtree makes very little sense, so it's harder to see the
> big picture. The xid-cleanup design that we have was approximately the
> easiest way to do it, so that's what we got. We should figure out a
> way to recycle the pages at something close to the earliest possible
> opportunity, without having to perform a full scan on the index
> relation within btvacuumscan().

+1

> Maybe we can use the autovacuum work
> item mechanism for that. For indexes that get VACUUMed once a week on
> 

Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-25 Thread Fabien COELHO



Bonjour Vik,


It's forbidden because the SQL standard forbids it.


Ok, that is definitely a reason. I'm not sure it is a good reason, though.



It's a very good reason.  It might not be good *enough*, but it is a
good reason.


Ok for good, although paradoxically not "good enough":-)

Why would the standard forbid it? From the language design point of 
view[...]


Don't go there.  There is nothing but pain there.


Hmmm. I like to understand. Basically it is my job.

Otherwise, yes and no. Postgres could decide (has sometimes decided) to 
extend the syntax or semantics wrt the standard if it makes sense, so that 
when a syntax is allowed by the standard it does what the standard says, 
which I would call positive compliance and I would support that, but keep 
some freedom elsewhere.


--
Fabien.




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-25 Thread Dilip Kumar
On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila  wrote:
>
> On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar  wrote:
> >
> > Here is the POC patch to discuss the idea of a cleanup of shared
> > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > the list of shared fileset.  First time when the list is NULL I am
> > registering the cleanup function with on_proc_exit routine.  After
> > that for subsequent fileset, I am just appending it to filesetlist.
> > There is also an interface to unregister the shared file set from the
> > cleanup list and that is done by the caller whenever we are deleting
> > the shared fileset manually.  While explaining it here, I think there
> > could be one issue if we delete all the element from the list will
> > become NULL and on next SharedFileSetInit we will again register the
> > function.  Maybe that is not a problem but we can avoid registering
> > multiple times by using some flag in the file
> >
>
> I don't understand what you mean by "using some flag in the file".
>
> Review comments on various patches.
>
> poc_shared_fileset_cleanup_on_procexit
> =
> 1.
> - ent->subxact_fileset =
> - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> + MemoryContext oldctx;
>
> + /* Shared fileset handle must be allocated in the persistent context */
> + oldctx = MemoryContextSwitchTo(ApplyContext);
> + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
>   SharedFileSetInit(ent->subxact_fileset, NULL);
> + MemoryContextSwitchTo(oldctx);
>   fd = BufFileCreateShared(ent->subxact_fileset, path);
>
> Why is this change required for this patch and why we only cover
> SharedFileSetInit in the Apply context and not BufFileCreateShared?
> The comment is also not very clear on this point.

Added the comments for the same.

> 2.
> +void
> +SharedFileSetUnregister(SharedFileSet *input_fileset)
> +{
> + bool found = false;
> + ListCell *l;
> +
> + Assert(filesetlist != NULL);
> +
> + /* Loop over all the pending shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> +
> + /* remove the entry from the list and delete the underlying files */
> + if (input_fileset->number == fileset->number)
> + {
> + SharedFileSetDeleteAll(fileset);
> + filesetlist = list_delete_cell(filesetlist, l);
>
> Why are we calling SharedFileSetDeleteAll here when in the caller we
> have already deleted the fileset as per below code?
> BufFileDeleteShared(ent->stream_fileset, path);
> + SharedFileSetUnregister(ent->stream_fileset);

That's wrong I have removed this.


> I think it will be good if somehow we can remove the fileset from
> filesetlist during BufFileDeleteShared.  If that is possible, then we
> don't need a separate API for SharedFileSetUnregister.

I have done as discussed on later replies, basically called
SharedFileSetUnregister from BufFileDeleteShared.

> 3.
> +static List * filesetlist = NULL;
> +
>  static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
> +static void SharedFileSetOnProcExit(int status, Datum arg);
>  static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
> tablespace);
>  static void SharedFilePath(char *path, SharedFileSet *fileset, const
> char *name);
>  static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
> @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
>   /* Register our cleanup callback. */
>   if (seg)
>   on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> + else
> + {
> + if (filesetlist == NULL)
> + on_proc_exit(SharedFileSetOnProcExit, 0);
>
> We use NIL for list initialization and comparison.  See lock_files usage.

Done

> 4.
> +SharedFileSetOnProcExit(int status, Datum arg)
> +{
> + ListCell *l;
> +
> + /* Loop over all the pending  shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> + SharedFileSetDeleteAll(fileset);
> + }
>
> We can initialize filesetlist as NIL after the for loop as it will
> make the code look clean.

Right.

> Comments on other patches:
> =
> 5.
> > 3. On concurrent abort we are truncating all the changes including
> > some incomplete changes,  so later when we get the complete changes we
> > don't have the previous changes,  e.g, if we had specinsert in the
> > last stream and due to concurrent abort detection if we delete that
> > changes later we will get spec_confirm without spec insert.  We could
> > have simply avoided deleting all the changes, but I think the better
> > fix is once we detect the concurrent abort for any transaction, then
> > why do we need to collect the changes for that, we can simply avoid
> > that.  So I have put that fix. (0006)
> >
>
> On similar lines, I think we need to skip processing message, see else
> part of code in ReorderBufferQueueMessage.

Basically, ReorderBufferQueueMessage also calls the

Re: CUBE_MAX_DIM

2020-06-25 Thread Komяpa
Hello,

The problem with higher dimension cubes is that starting with
dimensionality of ~52 the "distance" metrics in 64-bit float have less than
a single bit per dimension in mantissa, making cubes indistinguishable.
Developers for facial recognition software had a chat about that on russian
postgres telegram group https://t.me/pgsql. Their problem was that they had
128-dimensional points, recompiled postgres - distances weren't helpful,
and GIST KNN severely degraded to almost full scans. They had to change the
number of facial features to smaller in order to make KNN search work.

Floating point overflow isn't that much of a risk per se, worst
case scenario it becomes an Infinity or 0 which are usually acceptable in
those contexts.

While mathematically possible, there are implementation issues with higher
dimension cubes. I'm ok with raising the limit if such nuances get a
mention in docs.

On Thu, Jun 25, 2020 at 1:01 PM Devrim Gündüz  wrote:

>
> Hi,
>
> Someone contacted me about increasing CUBE_MAX_DIM
> in contrib/cube/cubedata.h (in the community RPMs). The current value
> is 100 with the following comment:
>
> * This limit is pretty arbitrary, but don't make it so large that you
> * risk overflow in sizing calculations.
>
>
> They said they use 500, and never had a problem. I never added such
> patches to the RPMS, and will not -- but wanted to ask if we can safely
> increase it in upstream?
>
> Regards,
>
> --
> Devrim Gündüz
> Open Source Solution Architect, Red Hat Certified Engineer
> Twitter: @DevrimGunduz , @DevrimGunduzTR
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-25 Thread Inoue, Hiroshi

Hi,

I seem to have invalidated KEYSET-DRIVEN cursors used in 
positioned-update test .
It was introduced by the commit 4a272fd but was invalidated by the 
commit 2be35a6.


I don't object to the removal of currtid(2) because keyset-driven 
cursors in psqlodbc are changed into static cursors in many cases and 
I've hardly ever heard a complaint about it.


regards,
Hiroshi Inoue

On 2020/06/24 11:11, Michael Paquier wrote:

On Tue, Jun 23, 2020 at 02:02:33PM +0900, Michael Paquier wrote:

Actually, while reviewing the code, the only code path where we use
currtid2() involves positioned_load() and LATEST_TUPLE_LOAD.  And the
only location where this happens is in SC_pos_reload_with_key(), where
I don't actually see how it would be possible to not have a keyset and
still use a CTID, which would led to LATEST_TUPLE_LOAD being used.  So
could it be possible that the code paths of currtid2() are actually
just dead code?

I have dug more into this one, and we actually stressed this code path
quite a lot up to commit d9cb23f in the ODBC driver, with tests
cursor-block-delete, positioned-update and bulkoperations particularly
when calling SQLSetPos().  However, 86e2e7a has reworked the code in
such a way that we visibly don't use anymore CTIDs if we don't have a
keyset, and that combinations of various options like UseDeclareFetch
or UpdatableCursors don't trigger this code path anymore.  In short,
currtid2() does not get used.  Inoue-san, Saito-san, what do you
think?  I am adding also Tsunakawa-san in CC who has some experience
in this area.
--
Michael






Re: More tzdb fun: POSIXRULES is being deprecated upstream

2020-06-25 Thread Peter Eisentraut

On 2020-06-19 21:55, Tom Lane wrote:

Yeah, we can do nothing in the back branches and hope that that doesn't
happen for the remaining lifespan of v12.  But I wonder whether that
doesn't amount to sticking our heads in the sand.

I suppose it'd be possible to have a release-note entry in the back
branches that isn't tied to any actual code change on our part, but just
warns that such a tzdata change might happen at some unpredictable future
time.  That feels weird and squishy though; and people would likely have
forgotten it by the time the change actually hits them.


In my mind, this isn't really that different from other external 
libraries making API changes.  But we are not going to forcibly remove 
Python 2 support in PostgreSQL 9.6 just because it's no longer supported 
upstream.  If Debian or RHEL $veryold want to keep maintaining Python 2, 
they are free to do so, and users thereof are free to continue using it. 
 Similarly, Debian or RHEL $veryold are surely not going to drop a 
whole class of time zone codes from their stable distribution just 
because upstream is phasing it out.


What you are saying is, instead of the OS dropping POSIXRULES support, 
it would be better if we dropped it first and release-noted that. 
However, I don't agree with the premise of that.  OSes with long-term 
support aren't going to drop it.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Open Item: Should non-text EXPLAIN always show properties?

2020-06-25 Thread James Coleman
On Thu, Jun 25, 2020 at 5:15 AM David Rowley  wrote:
>
> Over on [1] Justin mentions that the non-text EXPLAIN ANALYZE should
> always show the "Disk Usage" and "HashAgg Batches" properties.  I
> agree with this. show_wal_usage() is a good example of how we normally
> do things.  We try to keep the text format as humanly readable as
> possible but don't really expect humans to be commonly reading the
> other supported formats, so we care less about including additional
> details there.
>
> There's also an open item regarding this for Incremental Sort, so I've
> CC'd James and Tomas here. This seems like a good place to discuss
> both.

Yesterday I'd replied [1] to Justin's proposal for this WRT
incremental sort and expressed my opinion that including both
unnecessarily (i.e., including disk when an in-memory sort was used)
is undesirable and confusing and leads to shortcuts I believe to be
bad habits when using the data programmatically.

On a somewhat related note, memory can be 0 but that doesn't mean no
memory was used: it's a result of how tuplesort.c doesn't properly
track memory usage when it switches to disk sort. The same isn't true
in reverse (we don't have 0 disk when disk was used), but I do think
it does show the idea that showing "empty" data isn't an inherent
good.

If there's a clear established pattern and/or most others seem to
prefer Justin's proposed approach, then I'm not going to fight it
hard. I just don't think it's the best approach.

James

[1] 
https://www.postgresql.org/message-id/CAAaqYe-LswZFUL4k5Dr6%3DEN6MJG1HurggcH4QzUs6UFqBbnQzQ%40mail.gmail.com




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-25 Thread Robert Haas
On Wed, Jun 24, 2020 at 3:41 PM Alvaro Herrera  wrote:
> People (specifically the jdbc driver) *are* using this feature in this
> way, but they didn't realize they were doing it.  It was an accident and
> they didn't notice.

But you don't know that that's true of everyone using this feature,
and even if it were, so what? Breaking a feature that someone didn't
know they were using is just as much of a break as breaking a feature
someone DID know they were using.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-25 Thread Amit Kapila
On Tue, Jun 23, 2020 at 6:39 PM Amit Kapila  wrote:
>
> On Tue, Jun 23, 2020 at 3:48 PM Tomas Vondra
>  wrote:
> >
> > On Tue, Jun 23, 2020 at 10:58:18AM +0530, Amit Kapila wrote:
> > >On Tue, Jun 23, 2020 at 9:32 AM Masahiko Sawada
> > > wrote:
> > >>
> > >> On Sun, 21 Jun 2020 at 06:57, Tomas Vondra 
> > >>  wrote:
> > >> >
> > >> > >
> > >> > >What if the decoding has been performed by multiple backends using the
> > >> > >same slot?  In that case, it will be difficult to make the judgment
> > >> > >for the value of logical_decoding_work_mem based on stats.  It would
> > >> > >make sense if we provide a way to set logical_decoding_work_mem for a
> > >> > >slot but not sure if that is better than what we have now.
> > >> > >
> > >>
> > >> I thought that the stats are relevant to what
> > >> logical_decoding_work_mem value was but not with who performed logical
> > >> decoding. So even if multiple backends perform logical decoding using
> > >> the same slot, the user can directly use stats as long as
> > >> logical_decoding_work_mem value doesn’t change.
> > >>

Today, I thought about it again, and if we consider the point that
logical_decoding_work_mem value doesn’t change much then having the
stats at slot-level would also allow computing
logical_decoding_work_mem based on stats.  Do you think it is a
reasonable assumption that users won't change
logical_decoding_work_mem for different processes (WALSender, etc.)?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: min_safe_lsn column in pg_replication_slots view

2020-06-25 Thread Amit Kapila
On Wed, Jun 24, 2020 at 8:45 PM Alvaro Herrera  wrote:
>
> On 2020-Jun-24, Fujii Masao wrote:
>
> > On 2020/06/24 8:39, Alvaro Herrera wrote:
>
> > > I think we should publish the value from wal_keep_segments separately
> > > from max_slot_wal_keep_size.  ISTM that the user might decide to change
> > > or remove wal_keep_segments and be suddenly at risk of losing slots
> > > because of overlooking that it was wal_keep_segments, not
> > > max_slot_wal_keep_size, that was protecting them.
> >
> > You mean to have two functions that returns
> >
> > 1. "current WAL LSN - wal_keep_segments * 16MB"
> > 2. "current WAL LSN - max_slot_wal_keep_size"
>
> Hmm, but all the values there are easily findable.  What would be the
> point in repeating it?
>
> Maybe we should disregard this line of thinking and go back to
> Horiguchi-san's original proposal, to wit use the "distance to
> breakage", as also supported now by Amit Kapila[1] (unless I
> misunderstand him).
>

+1.  I also think let's drop the idea of exposing a function for this
value and revert the min_safe_lsn part of the work as proposed by
Michael above [1] excluding the function pg_wal_oldest_lsn() in that
patch.  Then, we can expose this as a new stat for PG14.  I feel it
would be better to display this stat in a new view (something like
pg_stat_replication_slots) as discussed in another thread [2].  Does
that make sense?

[1] - https://www.postgresql.org/message-id/20200622054950.GC50978%40paquier.xyz
[2] - 
https://www.postgresql.org/message-id/CA%2Bfd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




CUBE_MAX_DIM

2020-06-25 Thread Devrim Gündüz

Hi,

Someone contacted me about increasing CUBE_MAX_DIM
in contrib/cube/cubedata.h (in the community RPMs). The current value
is 100 with the following comment:

* This limit is pretty arbitrary, but don't make it so large that you
* risk overflow in sizing calculations.


They said they use 500, and never had a problem. I never added such patches to 
the RPMS, and will not -- but wanted to ask if we can safely increase it in 
upstream?

Regards,

-- 
Devrim Gündüz
Open Source Solution Architect, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR


signature.asc
Description: This is a digitally signed message part


Re: Online checksums patch - once again

2020-06-25 Thread Daniel Gustafsson
> On 22 Jun 2020, at 18:29, Robert Haas  wrote:
> 
> On Mon, Jun 22, 2020 at 8:27 AM Daniel Gustafsson  wrote:
>> Restartability is implemented by keeping state in pg_class.  I opted for a 
>> bool
>> which is cleared as the first step of checksum enable, since it offers fewer
>> synchronization cornercases I think.
> 
> Unless you take AccessExclusiveLock on the table, this probably needs
> to be three-valued. Or maybe I am misunderstanding the design...

Sorry being a bit thick, can you elaborate which case you're thinking about?
CREATE TABLE sets the attribute according to the value of data_checksums, and
before enabling checksums (and before changing data_checksums to inprogress)
the bgworker will update all relhaschecksums from true (if any) to false.  Once
the state is set to inprogress all new relations will set relhaschecksums to
true.

The attached v19 fixes a few doc issues I had missed.

cheers ./daniel



online_checksums19.patch
Description: Binary data


Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-25 Thread David Rowley
On Thu, 25 Jun 2020 at 16:56, Fabien COELHO  wrote:
> It also means that if for some reason someone wants to insert several such
> all-default rows, they have to repeat the insert, as "VALUES (), ();"
> would not work, so it is also losing a corner-corner case capability
> without obvious reason.

This is not a vote in either direction but just wanted to say that
during 7e413a0f8 where multi-row inserts were added to pg_dump, a
special case had to be added to support tables with no columns.  We
cannot do multi-inserts for that so are forced to fall back on
one-row-per-INSERT.

However, even if we had this syntax I imagine it would be unlikely
we'd change pg_dump to use it since we want to be as standard
compliant as possible when dumping INSERTs since it appears the only
genuine use-case for that is for importing the data into some other
relational database.

David




Open Item: Should non-text EXPLAIN always show properties?

2020-06-25 Thread David Rowley
Over on [1] Justin mentions that the non-text EXPLAIN ANALYZE should
always show the "Disk Usage" and "HashAgg Batches" properties.  I
agree with this. show_wal_usage() is a good example of how we normally
do things.  We try to keep the text format as humanly readable as
possible but don't really expect humans to be commonly reading the
other supported formats, so we care less about including additional
details there.

There's also an open item regarding this for Incremental Sort, so I've
CC'd James and Tomas here. This seems like a good place to discuss
both.

I've attached a small patch that changes the Hash Aggregate behaviour
to always show these properties for non-text formats.

Does anyone object to this?

David

[1] https://www.postgresql.org/message-id/20200619040624.GA17995%40telsasoft.com


more_hash_agg_explain_fixes.patch
Description: Binary data


Re: Missing some ifndef FRONTEND at the top of logging.c and file_utils.c

2020-06-25 Thread Daniel Gustafsson
> On 25 Jun 2020, at 10:07, Michael Paquier  wrote:

> So, shouldn't that stuff be added as per the attached?

That makes sense, logging.c and file_utils.c are indeed only part of
libpgcommon.a and should only be compiled for frontend.

cheers ./daniel




Re: hashagg slowdown due to spill changes

2020-06-25 Thread Tomas Vondra

On Wed, Jun 24, 2020 at 05:26:07PM -0700, Melanie Plageman wrote:

On Tue, Jun 23, 2020 at 10:06 AM Andres Freund  wrote:


Hi,

On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
> On Mon, Jun 22, 2020 at 9:02 PM Andres Freund 
wrote:
> > It's not this patch's fault, but none, really none, of this stuff
should
> > be in the executor.
> >
> >
> Were you thinking it could be done in grouping_planner() and then the
> bitmaps could be saved in the PlannedStmt?
> Or would you have to wait until query_planner()? Or are you imagining
> somewhere else entirely?

I haven't thought about it in too much detail, but I would say
create_agg_plan() et al. I guess there's some argument to be made to do
it in setrefs.c, because we already do convert_combining_aggrefs() there
(but I don't like that much).

There's no reason to do it before we actually decided on one specific
path, so doing it earlier than create_plan() seems unnecessary. And
having it in agg specific code seems better than putting it into global
routines.

There's probably an argument for having a bit more shared code between
create_agg_plan(), create_group_plan() and
create_groupingsets_plan(). But even just adding a new extract_*_cols()
call to each of those would probably be ok.



So, my summary of this point in the context of the other discussion
upthread is:

Planner should extract the columns that hashagg will need later during
planning. Planner should not have HashAgg/MixedAgg nodes request smaller
targetlists from their children with CP_SMALL_TLIST to avoid unneeded
projection overhead.
Also, even this extraction should only be done when the number of groups
is large enough to suspect a spill.



IMO we should extract the columns irrespectedly of the estimates,
otherwise we won't be able to handle underestimates efficiently.



Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
in create_hashjoin_plan() for the inner side sub-tree and the outer side
one if there are multiple batches. I wondered what was different about
that vs hashagg (i.e. why it is okay to do that there).



Yeah. That means that if we have to start batching during execution, we
may need to spill much more datai. I'd say that's a hashjoin issue that
we should fix too (in v14).

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Backpatch b61d161c14 (Introduce vacuum errcontext ...)

2020-06-25 Thread Amit Kapila
On Tue, Jun 23, 2020 at 11:49 PM Andres Freund  wrote:
>
> Hi,
>
> On 2020-06-23 00:14:40 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > I am only suggesting that where you save the old location, as currently
> > > done with LVRelStats olderrinfo, you instead use a more specific
> > > type. Not that you should pass that anywhere (except for
> > > update_vacuum_error_info).
> >
> > As things currently stand, I don't think we need another struct
> > type at all.  ISTM we should hard-wire the handling of indname
> > as I suggested above.  Then there are only two fields to be dealt
> > with, and we could just as well save them in simple local variables.
>
> That's fine with me too.
>

I have looked at both the patches (using separate variables (by
Justin) and using a struct (by Andres)) and found the second one bit
better.

>
> > If there's a clear future path to needing to save/restore more
> > fields, then maybe another struct type would be useful ... but
> > right now the struct type declaration itself would take more
> > lines of code than it would save.
>
> FWIW, I started to be annoyed by this code when I was addding
> prefetching support to vacuum, and wanted to change what's tracked
> where. The number of places that needed to be touched was
> disproportional.
>
>
> Here's a *draft* for how I thought this roughly could look like. I think
> it's nicer to not specify the exact saved state in multiple places, and
> I think it's much clearer to use a separate function to restore the
> state than to set a "fresh" state.
>

I think this is a good idea and makes code look better.  I think it is
better to name new struct as LVSavedErrInfo instead of LVSavedPos.

> I've only applied a hacky fix for the way the indname is tracked, I
> thought that'd best be discussed separately.
>

I think it is better to use Tom's idea here to save and restore index
information in-place.  I have used Justin's patch with some
improvements like adding Asserts and initializing with NULL for
indname while restoring to make things unambiguous.

I have improved some comments in the code and for now, kept as two
patches (a) one for improving the error info for index (mostly
Justin's patch based on Tom's idea) and (b) the other to generally
improve the code in this area (mostly Andres's patch).

I have done some testing with both the patches and would like to do
more unless there are objections with these.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


improve_vac_err_cb_v1.patch
Description: Binary data


improve_vac_err_cb_v2.patch
Description: Binary data


Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-25 Thread Vik Fearing
On 6/25/20 6:56 AM, Fabien COELHO wrote:
> 
> Hello Tom,
> 
>   INSERT INTO t() VALUES ();
>>
>>> I'm still unclear why it would be forbidden though, it seems logical to
>>> try that, whereas the working one is quite away from the usual syntax.
>>
>> It's forbidden because the SQL standard forbids it.
> 
> Ok, that is definitely a reason. I'm not sure it is a good reason, though.


It's a very good reason.  It might not be good *enough*, but it is a
good reason.


> Why would the standard forbid it? From the language design point of
> view[...]


Don't go there.  There is nothing but pain there.

-- 
Vik Fearing




Re: Review for GetWALAvailability()

2020-06-25 Thread Kyotaro Horiguchi
At Thu, 25 Jun 2020 14:35:34 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/06/25 12:57, Alvaro Herrera wrote:
> > On 2020-Jun-25, Fujii Masao wrote:
> > 
> >>/*
> >> * Find the oldest extant segment file. We get 1 until checkpoint 
> >> removes
> >> * the first WAL segment file since startup, which causes the status 
> >> being
> >> * wrong under certain abnormal conditions but that doesn't actually 
> >> harm.
> >> */
> >>oldestSeg = XLogGetLastRemovedSegno() + 1;
> >>
> >> I see the point of the above comment, but this can cause wal_status to
> >> be
> >> changed from "lost" to "unreserved" after the server restart. Isn't
> >> this
> >> really confusing? At least it seems better to document that behavior.
> > Hmm.
> >
> >> Or if we *can ensure* that the slot with invalidated_at set always
> >> means
> >> "lost" slot, we can judge that wal_status is "lost" without using
> >> fragile
> >> XLogGetLastRemovedSegno(). Thought?
> > Hmm, this sounds compelling -- I think it just means we need to ensure
> > we reset invalidated_at to zero if the slot's restart_lsn is set to a
> > correct position afterwards.
> 
> Yes.

It is error-prone restriction, as discussed before.

Without changing updator-side, invalid restart_lsn AND valid
invalidated_at can be regarded as the lost state. With the following
change suggested by Fujii-san we can avoid the confusing status.

With attached first patch on top of the slot-dirtify fix below, we get
"lost" for invalidated slots after restart.

> > I don't think we have any operation that
> > does that, so it should be safe -- hopefully I didn't overlook
> > anything?
> 
> We need to call ReplicationSlotMarkDirty() and ReplicationSlotSave()
> just after setting invalidated_at and restart_lsn in
> InvalidateObsoleteReplicationSlots()?
> Otherwise, restart_lsn can go back to the previous value after the
> restart.
> 
> diff --git a/src/backend/replication/slot.c
> b/src/backend/replication/slot.c
> index e8761f3a18..5584e5dd2c 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -1229,6 +1229,13 @@ restart:
> s->data.invalidated_at = s->data.restart_lsn;
> s->data.restart_lsn = InvalidXLogRecPtr;
> SpinLockRelease(>mutex);
> +
> +   /*
> + * Save this invalidated slot to disk, to ensure that the slot
> +* is still invalid even after the server restart.
> +*/
> +   ReplicationSlotMarkDirty();
> +   ReplicationSlotSave();
> ReplicationSlotRelease();
>   /* if we did anything, start from scratch */
> 
> Maybe we don't need to do this if the slot is temporary?

The only difference of temprary slots from persistent one seems to be
an attribute "persistency". Actually,
create_physica_replication_slot() does the aboves for temporary slots.

> > Neither copy nor advance seem to work with a slot that has invalid
> > restart_lsn.
> > 
> >> Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
> >> value even after the restart?
> > This seems more work to implement.
> 
> Yes.

The confusing status can be avoided without fixing it, but I prefer to
fix it.  As Fujii-san suggested upthread, couldn't we remember
lastRemovedSegNo in the contorl file? (Yeah, it cuases a bump of
PG_CONTROL_VERSION and CATALOG_VERSION_NO?).

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 7c8803f2bd7267d166f8a6f1a4ca5b129aa5ae96 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 25 Jun 2020 17:03:24 +0900
Subject: [PATCH 1/2] Make slot invalidation persistent

---
 src/backend/replication/slot.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e8761f3a18..26f874e3cb 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1229,7 +1229,15 @@ restart:
 		s->data.invalidated_at = s->data.restart_lsn;
 		s->data.restart_lsn = InvalidXLogRecPtr;
 		SpinLockRelease(>mutex);
+
+		/*
+		 * Save this invalidated slot to disk, to ensure that the slot
+		 * is still invalid even after the server restart.
+		 */
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
 		ReplicationSlotRelease();
+		
 
 		/* if we did anything, start from scratch */
 		CHECK_FOR_INTERRUPTS();
-- 
2.18.4

>From 0792c41c915b87474958689a2acd5c214b393015 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 25 Jun 2020 17:04:01 +0900
Subject: [PATCH 2/2] Show correct value in pg_replication_slots.wal_status
 after restart

The column may show bogus staus until checkpoint removes at least one
WAL segment. This patch changes the logic to detect the state so that
the column shows the correct status after restart.
---
 src/backend/replication/slotfuncs.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/slotfuncs.c 

Missing some ifndef FRONTEND at the top of logging.c and file_utils.c

2020-06-25 Thread Michael Paquier
Hi all,

As subject tells, we have in src/common/ four files that are only
compiled as part of the frontend: fe_memutils.c, file_utils.c,
logging.c and restricted_token.c.  Two of them are missing the
following, to make sure that we never try to compile them with the
backend:
+#ifndef FRONTEND
+#error "This file is not expected to be compiled for backend code"
+#endif

So, shouldn't that stuff be added as per the attached?

Thanks.
--
Michael
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 7584c1f2fb..a2faafdf13 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -12,6 +12,11 @@
  *
  *-
  */
+
+#ifndef FRONTEND
+#error "This file is not expected to be compiled for backend code"
+#endif
+
 #include "postgres_fe.h"
 
 #include 
diff --git a/src/common/logging.c b/src/common/logging.c
index f3fc0b8262..6a3a437a34 100644
--- a/src/common/logging.c
+++ b/src/common/logging.c
@@ -7,6 +7,11 @@
  *
  *-
  */
+
+#ifndef FRONTEND
+#error "This file is not expected to be compiled for backend code"
+#endif
+
 #include "postgres_fe.h"
 
 #include 


signature.asc
Description: PGP signature


Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?

2020-06-25 Thread Daniel Gustafsson
> On 4 Mar 2020, at 23:16, Daniel Gustafsson  wrote:

> The v9 patch attached addresses, I believe, all comments made to date.  I 
> tried
> to read through earlier reviews too to ensure I hadn't missed anything so I
> hope I've covered all findings so far.

Attached is a rebased version which was updated to handle the changes for op
class parameters introduced in 911e70207703799605.

cheers ./daniel



catalog_multi_insert-v10.patch
Description: Binary data


RE: PostgreSQL and big data - FDW

2020-06-25 Thread ROS Didier
Hi Stephen

My EDF company is very interested in this feature (KERBEROS authentication 
method and hdfs_fdw ). 
Is it possible to know how many days of development does this represent ? who 
can develop this implementation ? what cost ?

Best Regards
Didier ROS
EDF
-Message d'origine-
De : sfr...@snowman.net [mailto:sfr...@snowman.net] 
Envoyé : mercredi 24 juin 2020 18:53
À : Bruce Momjian 
Cc : ROS Didier ; pgsql-hackers@lists.postgresql.org
Objet : Re: PostgreSQL and big data - FDW

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Wed, Jun 24, 2020 at 09:05:30AM +, ROS Didier wrote:
> > I would like to use a Foreign Data Wrapper (FDW) to connect to a 
> > HADOOP cluster which uses KERBEROS authentication.

Sadly, not really.

> > is it possible to achieve this ? which FDW should be used ?
> 
> Well, I would use the Hadoop FDW:
> 
>   https://github.com/EnterpriseDB/hdfs_fdw
> 
> and it only supports these authentication methods:
> 
>   Authentication Support
> 
>   The FDW supports NOSASL and LDAP authentication modes. In order to use
>   NOSASL do not specify any OPTIONS while creating user mapping. For LDAP
>   username and password must be specified in OPTIONS while creating user 
> mapping.
> 
> Not every FDW supports every Postgres server authentication method.

That isn't really the issue here, the problem is really that the GSSAPI support 
in PG today doesn't support credential delegation- if it did, then the HDFS FDW 
(and the postgres FDW) could be easily extended to leverage those delegated 
credentials to connect.

That's been something that's been on my personal todo list of things to work on 
but unfortunately I've not, as yet, had time to go implement.  I don't actually 
think it would be very hard- if someone writes it, I'd definitely review it.

Thanks,

Stephen



Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.





Re: some more pg_dump refactoring

2020-06-25 Thread Fabien COELHO


Hallo Peter,

My 0.02 €:

Patch applies cleanly, compiles, make check and pg_dump tap tests ok. The 
refactoring is a definite improvements.


You changed the query strings to use "\n" instead of " ". I would not have 
changed that, because it departs from the style around, and I do not think 
it improves readability at the C code level.


I tried to check manually and randomly that the same query is built for 
the same version, although my check may have been partial, especially on 
the aggregate query which does not comment about what is changed between 
versions, and my eyes are not very good at diffing.


I've notice that some attributes are given systematic replacements (eg 
proparallel), removing the need to test for presence afterwards. This 
looks fine to me.


However, on version < 8.4, ISTM that funcargs and funciargs are always 
added: is this voluntary?.


Would it make sense to accumulate in the other direction, older to newer, 
so that new attributes are added at the end of the select?


Should array_to_string be pg_catalog.array_to_string? All other calls seem 
to have an explicit schema.


I'm fine with inlining most PQfnumber calls.

I do not have old versions at hand for testing.

Here is a patch to reorganize dumpFunc() and dumpAgg() in pg_dump, similar to 
daa9fe8a5264a3f192efa5ddee8fb011ad9da365.  Instead of repeating the almost 
same large query in each version branch, use one

query and add a few columns to the SELECT list depending on the
version.  This saves a lot of duplication.

I have tested this with various old versions of PostgreSQL I had available, 
but a bit more random testing with old versions would be welcome.


--
Fabien.

Re: archive status ".ready" files may be created too early

2020-06-25 Thread Kyotaro Horiguchi
Hello.  Matsumura-san.

I agree that WAL writer is not the place to notify segmnet. And the
direction you suggested would work.

At Fri, 19 Jun 2020 10:18:34 +, "matsumura@fujitsu.com" 
 wrote in 
> 1. Description in primary side
> 
> [Basic problem]
>   A process flushing WAL record doesn't know whether the flushed RecPtr is 
>   EndRecPtr of cross-segment-boundary WAL record or not because only process 
>   inserting the WAL record knows it and it never memorizes the information to 
> anywhere.
> 
> [Basic concept of the patch in primary]
>   A process inserting a cross-segment-boundary WAL record memorizes its 
> EndRecPtr
>   (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl.
>   A flushing process creates .ready (Later, I call it just 'notify'.) against 
>   a segment that is previous one including CrossBoundaryEndRecPtr only when 
> its 
>   flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
...
>   See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in 
> my patch.

I think we don't need most of that shmem stuff.  XLogWrite is called
after WAL buffer is filled up to the requested position. So when it
crosses segment boundary we know the all past corss segment-boundary
records are stable. That means all we need to remember is only the
position of the latest corss-boundary record.

> * Action of inserting process
>   A inserting process memorie its CrossBoundaryEndRecPtr to 
> CrossBoundaryEndRecPtr
>   array element calculated by XLogRecPtrToBufIdx with its 
> CrossBoundaryEndRecPtr.
>   If the WAL record crosses many segments, only element against last segment
>   including the EndRecPtr is set and others are not set.
>   See also CopyXLogRecordToWAL() in my patch.

If we call XLogMarkEndRecPtrIfNeeded() there, the function is called
every time a record is written, most of which are wasteful.
XLogInsertRecord already has a code block executed only at every page
boundary.

> * Action of flushing process
>   Overview has been already written as the follwing.
> A flushing process creates .ready (Later, I call it just 'notify'.) 
> against 
> a segment that is previous one including CrossBoundaryEndRecPtr only when 
> its 
> flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
> 
>   An additional detail is as the following.  The flushing process may notify
>   many segments if the record crosses many segments, so the process memorizes
>   latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl.
>   The process notifies from latestArchiveNotifiedSegNo + 1 to
>   flushing segment number - 1.
>
>   And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process 
> exits
>   replay-loop.  Standby also set same timing (= before promoting).
> 
>   Mutual exlusion about latestArchiveNotifiedSegNo is not required because
>   the timing of accessing has been already included in WALWriteLock critical 
> section.

Looks reasonable.

> 2. Description in standby side
> 
> * Who notifies?
>   walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of
>   cross-segment-boundary WAL record or not.  In standby, only Startup process
>   knows the information because it is hidden in WAL record itself and only
>   Startup process reads and builds WAL record.

Standby doesn't write it's own WAL records.  Even if primary sent an
immature record on segment boundary, it just would promote to a new
TLI and starts its own history. Nothing breaks.  However it could be a
problem if a standby that crashed the problematic way were started
as-is as a primary, such scenario is out of our scope.

Now we can identify stable portion of WAL stream. It's enough to
prevent walsender from sending data that can be overwritten
afterwards. GetReplicationTargetRecPtr() in the attached does that.

> * Action of Statup process
>   Therefore, I implemented that walreceiver never notify and Startup process 
> does it.
>   In detail, when Startup process reads one full-length WAL record, it 
> notifies
>   from a segment that includes head(ReadRecPtr) of the record to a previous 
> segment that 
>   includes EndRecPtr of the record.

I don't like that archiving on standby relies on replay progress.  We
should avoid that and fortunately I think we dont need it.

>   Now, we must pay attention about switching time line.
>   The last segment of previous TimeLineID must be notified before switching.
>   This case is considered when RM_XLOG_ID is detected.

That segment is archived after renamed as ".partial" later. We don't
archive the last incomplete segment of the previous timeline as-is.

> 3. About other notifying for segment
> Two notifyings for segment are remain.  They are not needed to fix.
> 
> (1) Notifying for partial segment
> It is not needed to be completed, so it's OK to notify without special 
> consideration.
> 
> (2) Re-notifying
> Currently, Checkpointer has notified through XLogArchiveCheckDone().
> It is a