Re: User Interface for WAL usage data

2020-04-01 Thread Kyotaro Horiguchi
At Thu, 2 Apr 2020 00:41:20 -0500, Justin Pryzby  wrote 
in 
> On Thu, Apr 02, 2020 at 10:13:18AM +0530, Amit Kapila wrote:
> > In thread [1], we are discussing to expose WAL usage data for each
> > statement in a way quite similar to how we expose BufferUsage data.
> > The way it exposes seems reasonable to me and no one else raises any
> > objection.  It could be that it appears fine to others who have
> > reviewed the patch but I thought it would be a good idea to write a
> > separate email just for its UI and see if anybody has objection.
> 
> +1
> 
> Regarding v10-0004-Add-option-to-report-WAL-usage-in-EXPLAIN-and-au.patch:
> I think there should be additional spaces before "full" and before "bytes":
> 
> >WAL: records=2359 full page records=42 bytes=447788
> 
> Compare with these:
> 
>"Sort Method: %s  %s: %ldkB\n",
>"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory 
> Usage: %ldkB\n",
>"Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
> 
> Otherwise "records=2359 full page records=42" is hard to parse.

I got the same feeling seeing the line.

"full page records" seems to be showing the number of full page
images, not the record having full page images.

> > Exposed via auto_explain
> >   WAL: records=200 full page records=2 bytes=37387
> 
> Same
> 
> In v10-0002:
> +  * BufferUsage and WalUsage during executing maintenance command can be
> should say "during execution of a maintenance command".
> I'm afraid that'll cause merge conflicts for you :(
> 
> In 0003:
> + /* Provide WAL update data to the instrumentation */
> Remove "data" ??

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Allow continuations in "pg_hba.conf" files

2020-04-01 Thread Justin Pryzby
On Thu, Apr 02, 2020 at 07:25:36AM +0200, Fabien COELHO wrote:
> 
> Hello,
> 
> > FWIW, I don't think so. Generally a trailing backspace is an escape
> > character for the following newline.  And '\ ' is a escaped space,
> > which is usualy menas a space itself.
> > 
> > In this case escape character doesn't work generally and I think it is
> > natural that a backslash in the middle of a line is a backslash
> > character itself.
> 
> I concur: The backslash char is only a continuation as the very last
> character of the line, before cr/nl line ending markers.
> 
> There are no assumption about backslash escaping, quotes and such, which
> seems reasonable given the lexing structure of the files, i.e. records of
> space-separated words, and # line comments.

Quoting does allow words containing spaces:

https://www.postgresql.org/docs/current/auth-pg-hba-conf.html
|A record is made up of a number of fields which are separated by spaces and/or
|tabs. Fields can contain white space if the field value is double-quoted.
|Quoting one of the keywords in a database, user, or address field (e.g., all or
|replication) makes the word lose its special meaning, and just match a
|database, user, or host with that name.

-- 
Justin




Re: WAL usage calculation patch

2020-04-01 Thread Amit Kapila
On Thu, Apr 2, 2020 at 11:07 AM Amit Kapila  wrote:
>
> On Wed, Apr 1, 2020 at 8:00 PM Julien Rouhaud  wrote:
>

Also, I forgot to mention that let's not base this on buffer usage
patch for create index
(v10-0002-Allow-parallel-index-creation-to-accumulate-buff) because as
per recent discussion I am not sure about its usefulness.  I think we
can proceed with this patch without
v10-0002-Allow-parallel-index-creation-to-accumulate-buff as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: User Interface for WAL usage data

2020-04-01 Thread Justin Pryzby
On Thu, Apr 02, 2020 at 10:13:18AM +0530, Amit Kapila wrote:
> In thread [1], we are discussing to expose WAL usage data for each
> statement in a way quite similar to how we expose BufferUsage data.
> The way it exposes seems reasonable to me and no one else raises any
> objection.  It could be that it appears fine to others who have
> reviewed the patch but I thought it would be a good idea to write a
> separate email just for its UI and see if anybody has objection.

+1

Regarding v10-0004-Add-option-to-report-WAL-usage-in-EXPLAIN-and-au.patch:
I think there should be additional spaces before "full" and before "bytes":

>WAL: records=2359 full page records=42 bytes=447788

Compare with these:

 "Sort Method: %s  %s: %ldkB\n",
 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory 
Usage: %ldkB\n",
 "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",

Otherwise "records=2359 full page records=42" is hard to parse.

> Exposed via auto_explain
>   WAL: records=200 full page records=2 bytes=37387

Same

In v10-0002:
+* BufferUsage and WalUsage during executing maintenance command can be
should say "during execution of a maintenance command".
I'm afraid that'll cause merge conflicts for you :(

In 0003:
+   /* Provide WAL update data to the instrumentation */
Remove "data" ??

-- 
Justin




Re: WAL usage calculation patch

2020-04-01 Thread Amit Kapila
On Wed, Apr 1, 2020 at 8:00 PM Julien Rouhaud  wrote:
>
> On Wed, Apr 01, 2020 at 04:29:16PM +0530, Amit Kapila wrote:
> > 3. Doing some testing with and without parallelism to ensure WAL usage
> > data is correct would be great and if possible, share the results?
>
>
> I just saw that Dilip did some testing, but just in case here is some
> additional one
>
> - vacuum, after a truncate, loading 1M row and a "UPDATE t1 SET id = id"
>
> =# select query, calls, wal_bytes, wal_records, wal_num_fpw from 
> pg_stat_statements where query ilike '%vacuum%';
>  query  | calls | wal_bytes | wal_records | wal_num_fpw
> +---+---+-+-
>  vacuum (parallel 3) t1 | 1 |  20098962 |   34104 |   2
>  vacuum (parallel 0) t1 | 1 |  20098962 |   34104 |   2
> (2 rows)
>
> - create index, overload t1's parallel_workers, using the 1M line just
>   vacuumed:
>
> =# alter table t1 set (parallel_workers = 2);
> ALTER TABLE
>
> =# create index t1_parallel_2 on t1(id);
> CREATE INDEX
>
> =# alter table t1 set (parallel_workers = 0);
> ALTER TABLE
>
> =# create index t1_parallel_0 on t1(id);
> CREATE INDEX
>
> =# select query, calls, wal_bytes, wal_records, wal_num_fpw from 
> pg_stat_statements where query ilike '%create index%';
> query | calls | wal_bytes | wal_records | 
> wal_num_fpw
> --+---+---+-+-
>  create index t1_parallel_0 on t1(id) | 1 |  20355540 |2762 | 
>2745
>  create index t1_parallel_2 on t1(id) | 1 |  20406811 |2762 | 
>2758
> (2 rows)
>
> It all looks good to me.
>

Here the wal_num_fpw and wal_bytes are different between parallel and
non-parallel versions.  Is it due to checkpoint or something else?  We
can probably rule out checkpoint by increasing checkpoint_timeout and
other checkpoint related parameters.

>
> > 5.
> > -SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE 
> > "C";
> > -   query   | calls | rows
> > +---+--
> > - SELECT $1::TEXT   | 1 |1
> > - SELECT PLUS_ONE($1)   | 2 |2
> > - SELECT PLUS_TWO($1)   | 2 |2
> > - SELECT pg_stat_statements_reset() | 1 |1
> > +SELECT query, calls, rows, wal_bytes, wal_records FROM
> > pg_stat_statements ORDER BY query COLLATE "C";
> > +   query   | calls | rows | wal_bytes | wal_records
> > +---+---+--+---+-
> > + SELECT $1::TEXT   | 1 |1 | 0 |   0
> > + SELECT PLUS_ONE($1)   | 2 |2 | 0 |   0
> > + SELECT PLUS_TWO($1)   | 2 |2 | 0 |   0
> > + SELECT pg_stat_statements_reset() | 1 |1 | 0 |   0
> >  (4 rows)
> >
> > Again, I am not sure if these modifications make much sense?
>
>
> Those are queries that were previously executed.  As those are read-only 
> query,
> that are pretty much guaranteed to not cause any WAL activity, I don't see how
> it hurts to test at the same time that that's we indeed record with
> pg_stat_statements, just to be safe.
>

On a similar theory, one could have checked bufferusage stats as well.
The statements are using some expressions so don't see any value in
check all usage data for such statements.

>  Once again, feel free to drop the extra
> wal_* columns from the output if you disagree.
>

Right now, that particular patch is not getting applied (probably due
to recent commit 17e0328224).  Can you rebase it?

>
>
> > v9-0004-Add-option-to-report-WAL-usage-in-EXPLAIN-and-aut
> >
> > 3.
> > + if (usage->wal_num_fpw > 0)
> > + appendStringInfo(es->str, " full page records=%ld",
> > +usage->wal_num_fpw);
> > + if (usage->wal_bytes > 0)
> > + appendStringInfo(es->str, " bytes=" UINT64_FORMAT,
> > +usage->wal_bytes);
> >
> > Shall we change to 'full page writes' or 'full page image' instead of
> > full page records?
>
>
> Indeed, I changed it in the (auto)vacuum output but missed this one.  Fixed.
>

I don't see this change in the patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Allow continuations in "pg_hba.conf" files

2020-04-01 Thread Fabien COELHO



Hello,


FWIW, I don't think so. Generally a trailing backspace is an escape
character for the following newline.  And '\ ' is a escaped space,
which is usualy menas a space itself.

In this case escape character doesn't work generally and I think it is 
natural that a backslash in the middle of a line is a backslash 
character itself.


I concur: The backslash char is only a continuation as the very last 
character of the line, before cr/nl line ending markers.


There are no assumption about backslash escaping, quotes and such, which 
seems reasonable given the lexing structure of the files, i.e. records of 
space-separated words, and # line comments.


--
Fabien.




Re: Some problems of recovery conflict wait events

2020-04-01 Thread Masahiko Sawada
On Wed, 1 Apr 2020 at 22:32, Fujii Masao  wrote:
>
>
>
> On 2020/03/30 20:10, Masahiko Sawada wrote:
> > On Fri, 27 Mar 2020 at 17:54, Fujii Masao  
> > wrote:
> >>
> >>
> >>
> >> On 2020/03/04 14:31, Masahiko Sawada wrote:
> >>> On Wed, 4 Mar 2020 at 13:48, Fujii Masao  
> >>> wrote:
> 
> 
> 
>  On 2020/03/04 13:27, Michael Paquier wrote:
> > On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote:
> >> Yeah, so 0001 patch sets existing wait events to recovery conflict
> >> resolution. For instance, it sets (PG_WAIT_LOCK | LOCKTAG_TRANSACTION)
> >> to the recovery conflict on a snapshot. 0003 patch improves these wait
> >> events by adding the new type of wait event such as
> >> WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) patch
> >> is the fix for existing versions and 0003 patch is an improvement for
> >> only PG13. Did you mean even 0001 patch doesn't fit for back-patching?
> 
>  Yes, it looks like a improvement rather than bug fix.
> 
> >>>
> >>> Okay, understand.
> >>>
> > I got my eyes on this patch set.  The full patch set is in my opinion
> > just a set of improvements, and not bug fixes, so I would refrain from
> > back-backpatching.
> 
>  I think that the issue (i.e., "waiting" is reported twice needlessly
>  in PS display) that 0002 patch tries to fix is a bug. So it should be
>  fixed even in the back branches.
> >>>
> >>> So we need only two patches: one fixes process title issue and another
> >>> improve wait event. I've attached updated patches.
> >>
> >> I started reading 
> >> v2-0002-Improve-wait-events-for-recovery-conflict-resolut.patch.
> >>
> >> -   ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
> >> +   ProcWaitForSignal(WAIT_EVENT_RECOVERY_CONFLICT_BUFFER_PIN);
> >>
> >> Currently the wait event indicating the wait for buffer pin has already
> >> been reported. But the above change in the patch changes the name of
> >> wait event for buffer pin only in the startup process. Is this really 
> >> useful?
> >> Isn't the existing wait event for buffer pin enough?
> >>
> >> -   /* Wait to be signaled by the release of the Relation Lock */
> >> -   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
> >> +   /* Wait to be signaled by the release of the Relation Lock 
> >> */
> >> +   ProcWaitForSignal(WAIT_EVENT_RECOVERY_CONFLICT_LOCK);
> >>
> >> Same as above. Isn't the existing wait event enough?
> >
> > Yeah, we can use the existing wait events for buffer pin and lock.
> >
> >>
> >> -   /*
> >> -* Progressively increase the sleep times, but not to more than 
> >> 1s, since
> >> -* pg_usleep isn't interruptible on some platforms.
> >> -*/
> >> -   standbyWait_us *= 2;
> >> -   if (standbyWait_us > 100)
> >> -   standbyWait_us = 100;
> >> +   WaitLatch(MyLatch,
> >> + WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
> >> + STANDBY_WAIT_MS,
> >> + wait_event_info);
> >> +   ResetLatch(MyLatch);
> >>
> >> ResetLatch() should be called before WaitLatch()?
> >
> > Fixed.
> >
> >>
> >> Could you tell me why you dropped the "increase-sleep-times" logic?
> >
> > I thought we can remove it because WaitLatch is interruptible but my
> > observation was not correct. The waiting startup process is not
> > necessarily woken up by signal. I think it's still better to not wait
> > more than 1 sec even if it's an interruptible wait.
>
> So we don't need to use WaitLatch() there, i.e., it's ok to keep using
> pg_usleep()?
>
> > Attached patch fixes the above and introduces only two wait events of
> > conflict resolution: snapshot and tablespace.
>
> Many thanks for updating the patch!
>
> -   /* Wait to be signaled by the release of the Relation Lock */
> -   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
> +   /* Wait to be signaled by the release of the Relation Lock */
> +   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
> +   }
>
> Is this change really valid? What happens if the latch is set during
> ResolveRecoveryConflictWithVirtualXIDs()?
> ResolveRecoveryConflictWithVirtualXIDs() can return after the latch
> is set but before WaitLatch() in WaitExceedsMaxStandbyDelay() is reached.

Thank you for reviewing the patch!

You're right. It's better to keep using pg_usleep() and set the wait
event by pgstat_report_wait_start().

>
> +   default:
> +   event_name = "unknown wait event";
> +   break;
>
> Seems this default case should be removed. Please see other
> pgstat_get_wait_xxx() function, so there is no such code.
>
> > I also removed the wait
> > event of conflict resolution of database since it's unlikely to become
> > a user-visible and a long sleep as we discussed before.
>
> Is it worth defining new wait 

Re: Autovacuum vs vac_update_datfrozenxid() vs ?

2020-04-01 Thread Noah Misch
On Mon, Mar 23, 2020 at 04:50:36PM -0700, Andres Freund wrote:
> I think there's also another (even larger?) race in
> vac_update_datfrozenxid(): Unless I miss something, two backends can
> concurrently run through the scan in vac_update_datfrozenxid() for two
> different tables in the same database, both can check that they need to
> update the pg_database row, and then one of them can overwrite the
> results of the other. And the one that updates last might actually be
> the one with an older horizon.  This is possible since there is no 'per
> database' locking in heap_vacuum_rel().

Right.  This thread has a fix:
https://www.postgresql.org/message-id/flat/20190218073103.GA1434723%40rfd.leadboat.com

The CF entry blocking it is starting to see some activity.  (Your discovery
involving lastSaneFrozenXid would need a separate fix.)




Re: NOT IN subquery optimization

2020-04-01 Thread Andrey Lepikhov
You should do small rebase (conflict with 911e7020770) and pgindent of 
the patch to repair problems with long lines and backspaces.


I am reviewing your patch in small steps. Questions:
1. In the find_innerjoined_rels() routine you stop descending on 
JOIN_FULL node type. I think it is wrong because if var has NOT NULL 
constraint, full join can't change it to NULL.
2. The convert_NOT_IN_to_join() routine is ok, but its name is 
misleading. May be you can use something like make_NOT_IN_to_join_quals()?

3. pull_up_sublinks_qual_recurse(). Comment:
"Return pullout predicate (x is NOT NULL)..."
may be change to
"Return pullout predicate (x is NOT NULL or NOT EXISTS...)"?
4. is_node_nonnullable():
I think one more case of non-nullable var may be foreign key constraint.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company





User Interface for WAL usage data

2020-04-01 Thread Amit Kapila
Hi,

In thread [1], we are discussing to expose WAL usage data for each
statement in a way quite similar to how we expose BufferUsage data.
The way it exposes seems reasonable to me and no one else raises any
objection.  It could be that it appears fine to others who have
reviewed the patch but I thought it would be a good idea to write a
separate email just for its UI and see if anybody has objection.

It exposes three variables (a) wal_records (Number of WAL records
produced), (b)  wal_num_fpw (Number of WAL full page image records),
(c) wal_bytes (size of WAL records produced).

The patch has exposed these three variables via explain (analyze, wal)
, auto_explain and pg_stat_statements.

Exposed via Explain

Note the usage via line displaying WAL.  This parameter may only be
used when ANALYZE is also enabled.

postgres=# explain (analyze, buffers, wal) update t1 set c2='';
QUERY PLAN
---
 Update on t1  (cost=0.00..53.99 rows=1199 width=414) (actual
time=6.030..6.030 rows=0 loops=1)
   Buffers: shared hit=2484 dirtied=44
   WAL: records=2359 full page records=42 bytes=447788
   ->  Seq Scan on t1  (cost=0.00..53.99 rows=1199 width=414) (actual
time=0.040..0.540 rows=1199 loops=1)
 Buffers: shared hit=42
 Planning Time: 0.179 ms
 Execution Time: 6.119 ms
(7 rows)

Exposed via auto_explain
--
Users need to set auto_explain.log_wal to print WAL usage statistics.
This parameter has no effect unless auto_explain.log_analyze is
enabled.  Note the usage via line displaying WAL.

LOG:  duration: 0.632 ms  plan:
Query Text: update t1 set c2='';
Update on t1  (cost=0.00..16.10 rows=610 width=414) (actual
time=0.629..0.629 rows=0 loops=1)
  Buffers: shared hit=206 dirtied=5 written=2
  WAL: records=200 full page records=2 bytes=37387
  ->  Seq Scan on t1  (cost=0.00..16.10 rows=610 width=414) (actual
time=0.022..0.069 rows=100 loops=1)
Buffers: shared hit=2 dirtied=1

Exposed via pg_stat_statements

Three new parameters are added to pg_stat_statements function.

select query, wal_bytes, wal_records, wal_num_fpw from
pg_stat_statements where query like 'VACUUM%';
  query   | wal_bytes | wal_records | wal_num_fpw
--+---+-+-
 VACUUM test   |  72814331 |8857|8855

Any objections/suggestions?

[1] - 
https://www.postgresql.org/message-id/CAB-hujrP8ZfUkvL5OYETipQwA%3De3n7oqHFU%3D4ZLxWS_Cza3kQQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-04-01 Thread Dilip Kumar
On Thu, Apr 2, 2020 at 9:13 AM Dilip Kumar  wrote:
>
> On Thu, Apr 2, 2020 at 8:34 AM Peter Geoghegan  wrote:
> >
> > On Wed, Apr 1, 2020 at 7:52 PM Amit Kapila  wrote:
> > > Peter, Is this behavior expected?
> > >
> > > Let me summarize the situation so that it would be easier for Peter to
> > > comment.  Julien has noticed that parallel vacuum and parallel create
> > > index doesn't seem to report correct values for buffer usage stats.
> > > Sawada-San wrote a patch to fix the problem for both the cases.  We
> > > expect that 'total_read_blks' as reported in pg_stat_statements should
> > > give the same value for parallel and non-parallel operations.  We see
> > > that is true for parallel vacuum and previously we have the same
> > > observation for the parallel query. Now, for parallel create index
> > > this doesn't seem to be true as test results by Dilip show that.  We
> > > have two possibilities here (a) there is some bug in Sawada-San's
> > > patch or (b) this is expected behavior for parallel create index.
> > > What do you think?
> >
> > nbtree CREATE INDEX doesn't even go through the buffer manager.
>
> Thanks for clarifying.  So IIUC, it will not go through the buffer
> manager for the index pages,  but for the heap pages, it will still go
> through the buffer manager.
>
> > The
> > difference that Dilip showed is probably due to extra catalog accesses
> > in the two parallel workers -- pg_amproc lookups, and the like. Those
> > are rather small differences, overall.
>
> > Can Dilip demonstrate the the "extra" buffer accesses are
> > proportionate to the number of workers launched in some constant,
> > predictable way?
>
> Okay, I will test this.

0-worker
   query | total_time  | shared_blks_hit |
shared_blks_read | total_read_blks | shared_blks_dirtied |
shared_blks_written
--+-+-+--+-+-+-
 CREATE INDEX idx1 on test(a) | 1228.895057 |8947 |
   11 |8971 |   5 |
0

1-worker
query | total_time  | shared_blks_hit |
shared_blks_read | total_read_blks | shared_blks_dirtied |
shared_blks_written
--+-+-+--+-+-+-
 CREATE INDEX idx1 on test(a) | 1006.157231 |8962 |
   12 |8974 |   5 |
0

2-workers
query | total_time | shared_blks_hit |
shared_blks_read | total_read_blks | shared_blks_dirtied |
shared_blks_written
--++-+--+-+-+-
 CREATE INDEX idx1 on test(a) |  949.44663 |8965 |
  12 |8977 |   5 |   0

3-workers
query | total_time  | shared_blks_hit |
shared_blks_read | total_read_blks | shared_blks_dirtied |
shared_blks_written
--+-+-+--+-+-+-
 CREATE INDEX idx1 on test(a) | 1037.297196 |8968 |
   12 |8980 |   5 |
0

4-workers
query | total_time | shared_blks_hit |
shared_blks_read | total_read_blks | shared_blks_dirtied |
shared_blks_written
--++-+--+-+-+-
 CREATE INDEX idx1 on test(a) | 889.332782 |8971 |
  12 |8983 |   6 |   0

You are right, it is increasing with some constant factor.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Allow continuations in "pg_hba.conf" files

2020-04-01 Thread Kyotaro Horiguchi
At Thu, 02 Apr 2020 00:20:12 +, David Zhang  wrote 
in 
> Hi Fabien,
> Should we consider the case "\ ", i.e. one or more spaces after the backslash?
> For example, if I replace a user map 
> "mymap   /^(.*)@mydomain\.com$  \1" with 
> "mymap   /^(.*)@mydomain\.com$  \ "
> "\1"
> by adding one extra space after the backslash, then I got the pg_role="\\"
> but I think what we expect is pg_role="\\1"

FWIW, I don't think so. Generally a trailing backspace is an escape
character for the following newline.  And '\ ' is a escaped space,
which is usualy menas a space itself.

In this case escape character doesn't work generally and I think it is
natural that a backslash in the middle of a line is a backslash
character itself.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-01 Thread Craig Ringer
On Thu, 2 Apr 2020 at 07:57, Tom Lane  wrote:

> Alvaro Herrera  writes:
> > On 2020-Apr-01, Tom Lane wrote:
> >> The fact that I had to use max(age(...)) in that sample query
> >> hints at one reason: it's really hard to do arithmetic correctly
> >> on raw XIDs.  Dealing with wraparound is a problem, and knowing
> >> what's past or future is even harder.  What use-case do you
> >> foresee exactly?
>
> > Maybe it would make sense to start exposing fullXids in these views and
> > functions, for this reason.  There's no good reason to continue to
> > expose bare Xids to userspace, we should use them only for storage.
>
> +1, that would help a lot.
>
> > But I think James' point is precisely that it's not easy to know where
> > to look for things that keep Xmin from advancing.  Currently it's
> > backends, replication slots, prepared transactions, and replicas with
> > hot_standby_feedback.  If you forget to monitor just one of these, your
> > vacuums might be useless and you won't notice until disaster strikes.
>
> Agreed, but just knowing what the oldest xmin is doesn't help you
> find *where* it is.  Maybe what we need is a view showing all of
> these potential sources of an old xmin.


 Strongly agree.

https://www.postgresql.org/message-id/camsr+ygss6jbhmehbxqmdc1xj7sobdsq62ywaekohn-kbqy...@mail.gmail.com

I was aiming to write such a view, but folks seemed opposed. I still think
it'd be a very good thing to have built-in as Pg grows more complex.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: BUG #16109: Postgres planning time is high across version (Expose buffer usage during planning in EXPLAIN)

2020-04-01 Thread Fujii Masao



On 2020/04/02 3:47, Julien Rouhaud wrote:

On Wed, Apr 1, 2020 at 7:51 PM Fujii Masao  wrote:



On 2020/03/31 10:31, Justin Pryzby wrote:

On Wed, Jan 29, 2020 at 12:15:59PM +0100, Julien Rouhaud wrote:

Rebase due to conflict with 3ec20c7091e97.


This is failing to apply probably since 
4a539a25ebfc48329fd656a95f3c1eb2cda38af3.
Could you rebase?   (Also, not sure if this can be set as RFC?)


I updated the patch. Attached.


Thanks a lot!  I'm sorry I missed Justin's ping, and it I just
realized that my cron job that used to warn me about cfbot failure was
broken :(


+/* Compute the difference between two BufferUsage */
+BufferUsage
+ComputeBufferCounters(BufferUsage *start, BufferUsage *stop)

Since BufferUsageAccumDiff() was exported, ComputeBufferCounters() is
no longer necessary. In the patched version, BufferUsageAccumDiff() is
used to calculate the difference of buffer usage.


Indeed, exposing BufferUsageAccumDiff wa definitely a good thing!


+   if (es->summary && (planduration || es->buffers))
+   ExplainOpenGroup("Planning", "Planning", true, es);

Isn't it more appropriate to check "bufusage" instead of "es->buffers" here?
The patch changes the code so that "bufusage" is checked.


AFAICS not unless ExplainOneQuery is also changed to pass a NULL
pointer if the BUFFER option wasn't provided (and maybe also
optionally skip the planning buffer computation).  With this version
you now get:

=# explain (analyze, buffers off) update t1 set id = id;
   QUERY PLAN
---
  Update on t1  (cost=0.00..22.70 rows=1270 width=42) (actual
time=0.170..0.170 rows=0 loops=1)
->  Seq Scan on t1  (cost=0.00..22.70 rows=1270 width=42) (actual
time=0.050..0.054 rows=1 loops=1)
  Planning Time: 1.461 ms
Buffers: shared hit=25
  Execution Time: 1.071 ms
(5 rows)

which seems wrong to me.

I reused the es->buffers to avoid having needing something like:


Yes, you're right! So I updated the patch as you suggested.
Attached is the updated version of the patch.
Thanks for the review!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee0e638f33..81a18abbeb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -372,7 +372,11 @@ ExplainOneQuery(Query *query, int cursorOptions,
PlannedStmt *plan;
instr_time  planstart,
planduration;
+   BufferUsage bufusage_start,
+   bufusage;
 
+   if (es->buffers)
+   bufusage_start = pgBufferUsage;
INSTR_TIME_SET_CURRENT(planstart);
 
/* plan the query */
@@ -381,9 +385,16 @@ ExplainOneQuery(Query *query, int cursorOptions,
INSTR_TIME_SET_CURRENT(planduration);
INSTR_TIME_SUBTRACT(planduration, planstart);
 
+   /* calc differences of buffer counters. */
+   if (es->buffers)
+   {
+   memset(, 0, sizeof(BufferUsage));
+   BufferUsageAccumDiff(, , 
_start);
+   }
+
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-  );
+  , (es->buffers ? 
 : NULL));
}
 }
 
@@ -476,7 +487,8 @@ ExplainOneUtility(Node *utilityStmt, IntoClause *into, 
ExplainState *es,
 void
 ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
   const char *queryString, ParamListInfo params,
-  QueryEnvironment *queryEnv, const instr_time 
*planduration)
+  QueryEnvironment *queryEnv, const instr_time 
*planduration,
+  const BufferUsage *bufusage)
 {
DestReceiver *dest;
QueryDesc  *queryDesc;
@@ -560,6 +572,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, 
ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
 
+   if (es->summary && (planduration || bufusage))
+   ExplainOpenGroup("Planning", "Planning", true, es);
+
if (es->summary && planduration)
{
double  plantime = INSTR_TIME_GET_DOUBLE(*planduration);
@@ -567,6 +582,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, 
ExplainState *es,
ExplainPropertyFloat("Planning Time", "ms", 1000.0 * plantime, 
3, es);
}
 
+   /* Show buffer usage */
+   if (es->summary && bufusage)
+   {
+   if (es->format == EXPLAIN_FORMAT_TEXT)
+  

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-04-01 Thread Noah Misch
On Mon, Mar 30, 2020 at 11:37:57PM -0700, Andres Freund wrote:
> On 2020-03-30 23:28:54 -0700, Noah Misch wrote:
> > On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote:
> > > On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> > > > I think attached v41nm is ready for commit.  Would anyone like to vote 
> > > > against
> > > > back-patching this?  It's hard to justify lack of back-patch for a 
> > > > data-loss
> > > > bug, but this is atypically invasive.  (I'm repeating the question, 
> > > > since some
> > > > folks missed my 2020-02-18 question.)  Otherwise, I'll push this on 
> > > > Saturday.
> > > 
> > > The invasiveness of the patch is a concern.  Have you considered a
> > > different strategy?  For example, we are soon going to be in beta for
> > > 13, so you could consider committing the patch only on HEAD first.
> > > If there are issues to take care of, you can then leverage the beta
> > > testing to address any issues found.  Finally, once some dust has
> > > settled on the concept and we have gained enough confidence, we could
> > > consider a back-patch.
> > 
> > No.  Does anyone favor this proposal more than back-patching normally?
> 
> I have not reviewed the patch, so I don't have a good feeling for its
> riskiness. But it does sound fairly invasive. Given that we've lived
> with this issue for many years by now, and that the rate of incidents
> seems to have been fairly low, I think living with the issue for a bit
> longer to gain confidence might be a good choice.  But I'd not push back
> if you, being much more informed, think the risk/reward balance favors
> immediate backpatching.

I've translated the non-vote comments into estimated votes of -0.3, -0.6,
-0.4, +0.5, and -0.3.  Hence, I revoke the plan to back-patch.




Re: Add A Glossary

2020-04-01 Thread Corey Huinker
>
> I propose we define "planner" and make "optimizer" a  entry.
>

I have no objection to more entries, or edits to entries, but am concerned
that the process leads to someone having to manually merge several
start-from-scratch patches, with no clear sense of when we'll be done. I
may make sense to appoint an edit-collector.


> I further propose not to define the term "normalized", at least not for
> now.  That seems a very deep rabbit hole.
>

+1 I think we appointed a guy named Xeno to work on that definition. He
says he's getting close...


Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-04-01 Thread Dilip Kumar
On Thu, Apr 2, 2020 at 8:34 AM Peter Geoghegan  wrote:
>
> On Wed, Apr 1, 2020 at 7:52 PM Amit Kapila  wrote:
> > Peter, Is this behavior expected?
> >
> > Let me summarize the situation so that it would be easier for Peter to
> > comment.  Julien has noticed that parallel vacuum and parallel create
> > index doesn't seem to report correct values for buffer usage stats.
> > Sawada-San wrote a patch to fix the problem for both the cases.  We
> > expect that 'total_read_blks' as reported in pg_stat_statements should
> > give the same value for parallel and non-parallel operations.  We see
> > that is true for parallel vacuum and previously we have the same
> > observation for the parallel query. Now, for parallel create index
> > this doesn't seem to be true as test results by Dilip show that.  We
> > have two possibilities here (a) there is some bug in Sawada-San's
> > patch or (b) this is expected behavior for parallel create index.
> > What do you think?
>
> nbtree CREATE INDEX doesn't even go through the buffer manager.

Thanks for clarifying.  So IIUC, it will not go through the buffer
manager for the index pages,  but for the heap pages, it will still go
through the buffer manager.

> The
> difference that Dilip showed is probably due to extra catalog accesses
> in the two parallel workers -- pg_amproc lookups, and the like. Those
> are rather small differences, overall.

> Can Dilip demonstrate the the "extra" buffer accesses are
> proportionate to the number of workers launched in some constant,
> predictable way?

Okay, I will test this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Add A Glossary

2020-04-01 Thread Corey Huinker
>
> 2. I found out that "see xyz" and "see also" have bespoke markup in
> Docbook --  and .  I changed some glossentries
> to use those, removing some glossdefs and changing a couple of paras to
> glossseealsos.  I also removed all "id" properties from glossentries
> that are just , because I think it's a mistake to have
> references to entries that will make the reader look up a different
> term; for me as a reader that's annoying, and I don't like to annoy
> people.
>

+1 These structural enhancements are great. I'm fine with removing the id
from just-glossee, and glad that we're keeping the entry to aid discovery.


> I rewrote the definition for "atomic" once again.  Made it two
> glossdefs, because I can.  If you don't like this, I can undo.
>

+1 Splitting this into two definitions, one for each context, is the most
sensible thing and I don't know why I didn't do that in the first place.


Re: Should we add xid_current() or a int8->xid cast?

2020-04-01 Thread Thomas Munro
On Sat, Mar 21, 2020 at 11:14 AM Thomas Munro  wrote:
> * updated OIDs to avoid collisions
> * added btequalimage to btree/xid8_ops

Here's the version I'm planning to commit tomorrow, if no one objects.  Changes:

* txid.c renamed to xid8funcs.c
* remaining traces of "txid" replaced various internal identifiers
* s/backwards compatible/backward compatible/ in funcs.sgml (en_GB -> en_US)
From 308b094306f587a6736963983162290ac2808ac7 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 2 Apr 2020 15:00:23 +1300
Subject: [PATCH v8 1/2] Add SQL type xid8 to expose FullTransactionId to
 users.

Similar to xid, but 64 bits wide.  This new type is suitable for use in
various system views and administration functions.

Reviewed-by: Fujii Masao 
Reviewed-by: Takao Fujii 
Reviewed-by: Yoshikazu Imai 
Reviewed-by: Mark Dilger 
Discussion: https://postgr.es/m/20190725000636.666m5mad25wfbrri%40alap3.anarazel.de
---
 doc/src/sgml/datatype.sgml   |   7 ++
 src/backend/access/hash/hashvalidate.c   |   3 +
 src/backend/utils/adt/xid.c  | 116 +++
 src/fe_utils/print.c |   1 +
 src/include/access/transam.h |  14 +++
 src/include/catalog/pg_amop.dat  |  22 
 src/include/catalog/pg_amproc.dat|   8 ++
 src/include/catalog/pg_cast.dat  |   4 +
 src/include/catalog/pg_opclass.dat   |   4 +
 src/include/catalog/pg_operator.dat  |  25 +
 src/include/catalog/pg_opfamily.dat  |   4 +
 src/include/catalog/pg_proc.dat  |  36 ++
 src/include/catalog/pg_type.dat  |   4 +
 src/include/utils/xid8.h |  22 
 src/test/regress/expected/opr_sanity.out |   7 ++
 src/test/regress/expected/xid.out| 136 +++
 src/test/regress/parallel_schedule   |   2 +-
 src/test/regress/serial_schedule |   1 +
 src/test/regress/sql/xid.sql |  48 
 19 files changed, 463 insertions(+), 1 deletion(-)
 create mode 100644 src/include/utils/xid8.h
 create mode 100644 src/test/regress/expected/xid.out
 create mode 100644 src/test/regress/sql/xid.sql

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 03971822c2..89f3a7c119 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4516,6 +4516,10 @@ INSERT INTO mytable VALUES(-1);  -- fails
 regtype

 
+   
+xid8
+   
+

 cid

@@ -4719,6 +4723,9 @@ SELECT * FROM pg_attribute
 Another identifier type used by the system is xid, or transaction
 (abbreviated xact) identifier.  This is the data type of the system columns
 xmin and xmax.  Transaction identifiers are 32-bit quantities.
+In some contexts, a 64-bit variant xid8 is used.  Unlike
+xid values, xid8 values increase strictly
+monotonically and cannot be reused in the lifetime of a database cluster.

 

diff --git a/src/backend/access/hash/hashvalidate.c b/src/backend/access/hash/hashvalidate.c
index 7b08ed5354..b3d1367fec 100644
--- a/src/backend/access/hash/hashvalidate.c
+++ b/src/backend/access/hash/hashvalidate.c
@@ -317,6 +317,9 @@ check_hash_func_signature(Oid funcid, int16 amprocnum, Oid argtype)
 			(argtype == DATEOID ||
 			 argtype == XIDOID || argtype == CIDOID))
 			 /* okay, allowed use of hashint4() */ ;
+		else if ((funcid == F_HASHINT8 || funcid == F_HASHINT8EXTENDED) &&
+			(argtype == XID8OID))
+			 /* okay, allowed use of hashint8() */ ;
 		else if ((funcid == F_TIMESTAMP_HASH ||
   funcid == F_TIMESTAMP_HASH_EXTENDED) &&
  argtype == TIMESTAMPTZOID)
diff --git a/src/backend/utils/adt/xid.c b/src/backend/utils/adt/xid.c
index db6fc9dd6b..20389aff1d 100644
--- a/src/backend/utils/adt/xid.c
+++ b/src/backend/utils/adt/xid.c
@@ -21,6 +21,7 @@
 #include "access/xact.h"
 #include "libpq/pqformat.h"
 #include "utils/builtins.h"
+#include "utils/xid8.h"
 
 #define PG_GETARG_TRANSACTIONID(n)	DatumGetTransactionId(PG_GETARG_DATUM(n))
 #define PG_RETURN_TRANSACTIONID(x)	return TransactionIdGetDatum(x)
@@ -147,6 +148,121 @@ xidComparator(const void *arg1, const void *arg2)
 	return 0;
 }
 
+Datum
+xid8toxid(PG_FUNCTION_ARGS)
+{
+	FullTransactionId fxid = PG_GETARG_FULLTRANSACTIONID(0);
+
+	PG_RETURN_TRANSACTIONID(XidFromFullTransactionId(fxid));
+}
+
+Datum
+xid8in(PG_FUNCTION_ARGS)
+{
+	char	   *str = PG_GETARG_CSTRING(0);
+
+	PG_RETURN_FULLTRANSACTIONID(FullTransactionIdFromU64(pg_strtouint64(str, NULL, 0)));
+}
+
+Datum
+xid8out(PG_FUNCTION_ARGS)
+{
+	FullTransactionId fxid = PG_GETARG_FULLTRANSACTIONID(0);
+	char	   *result = (char *) palloc(21);
+
+	snprintf(result, 21, UINT64_FORMAT, U64FromFullTransactionId(fxid));
+	PG_RETURN_CSTRING(result);
+}
+
+Datum
+xid8recv(PG_FUNCTION_ARGS)
+{
+	StringInfo	buf = (StringInfo) PG_GETARG_POINTER(0);
+	uint64		value;
+
+	value = (uint64) pq_getmsgint64(buf);
+	PG_RETURN_FULLTRANSACTIONID(FullTransactionIdFromU64(value));
+}
+
+Datum
+xid8send(PG_FUNCTION_ARGS)
+{
+	FullTransactionId arg1 = 

Re: Berserk Autovacuum (let's save next Mandrill)

2020-04-01 Thread Tom Lane
Dean Rasheed  writes:
> Yeah, that makes sense. I still can't see what might be causing those
> failures. The tests that were doing an ALTER COLUMN and then expecting
> to see the results of a non-analysed table ought to be fixed by
> 0936d1b6f, but that doesn't match the buildfarm failures. Possibly
> 0936d1b6f will help with those anyway, but if so, it'll be annoying
> not understanding why.

Quite :-(.  While it's too early to declare victory, we've seen no
more failures of this ilk since 0936d1b6f, so it's sure looking like
autovacuum did have something to do with it.

Just to save people repeating the search I did, these are the buildfarm
failures of interest so far:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack=2020-03-28%2006%3A33%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura=2020-03-28%2009%3A20%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura=2020-03-28%2013%3A20%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack=2020-03-28%2020%3A03%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison=2020-03-28%2022%3A00%3A19
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly=2020-03-29%2006%3A45%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura=2020-03-30%2002%3A20%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus=2020-03-30%2006%3A00%3A06
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona=2020-03-30%2006%3A10%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona=2020-03-30%2023%3A10%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela=2020-03-31%2005%3A00%3A35

The first of those is unlike the rest, and I'm not trying to account for
it here.  In the rest, what we see is that sometimes the estimates are off
by a little bit from what's expected, up or down just a percent or two.
And those deltas kick at inconsistent spots partway through a series of
similar tests, so it's hard to deny that *something* asynchronous to the
test script is causing it.

After contemplating the failures for awhile, I have a theory that
at least partially matches the data.  What I think is happening is
that autovacuum (NOT auto-analyze) launches on the table, and since
it is running concurrently with the foreground test script, it fails
to immediately acquire buffer lock on one or more of the table pages.
Since this isn't an aggressive vacuum scan, it just politely backs
off and doesn't scan those pages.  And that translates to not getting
a perfectly accurate reltuples estimate at the end of the vacuum.
On my x86_64 machine, which matches the buildfarm critters having
trouble, the actual contents of both of the troublesome tables will
be 5000 tuples in 31 pages --- which comes out to be 30 pages with
162 tuples each and then 140 tuples in the last page.  Working through
the math in vac_estimate_reltuples (and assuming that the "old" values
were accurate numbers from the test script's own ANALYZE), what I find
is that autovacuum will conclude there are 4999 tuples if it misses
scanning one of the first 30 pages, or 5021 tuples if it misses scanning
the last page, because its interpolation from the old tuple density
figure will underestimate or overestimate the number of missed tuples
accordingly.  Once that slightly-off number gets pushed into pg_class,
we start to get slightly-off rowcount estimates in the test cases.

So what I'm hypothesizing is that the pg_statistic data is perfectly
fine but pg_class.reltuples goes off a little bit after autovacuum.
The percentage changes in reltuples that I predict this way don't
quite square with the percentage changes we see in the overall
rowcount estimates, which is a problem for this theory.  But the test
cases are exercising some fairly complex estimation logic, and it
wouldn't surprise me much if the estimates aren't linearly affected by
reltuples.  (Tomas, do you want to comment further on that point?)

regards, tom lane




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-01 Thread Etsuro Fujita
Hi Tomas and Ashutosh,

On Thu, Apr 2, 2020 at 1:51 AM Ashutosh Bapat
 wrote:
> On Thu, 26 Mar 2020 at 05:47, Tomas Vondra  
> wrote:

>> three more comments after eye-balling the code for a bit longer.
>>
>> 1) The patch probably needs to tweak config.sgml which says this about
>> the enable_partitionwise_join GUC:
>>
>>.. Partitionwise join currently applies only when the join conditions
>>include all the partition keys, which must be of the same data type
>>and have exactly matching sets of child partitions. ..
>
>
> Done. Actually this wasn't updated when partition pruning was introduced, 
> which could cause a partitionwise join to be not used even when those 
> conditions were met. Similarly when a query involved whole row reference. 
> It's hard to explain all the conditions under which partitionwise join 
> technique will be used. But I have tried to keep it easy to understand.
>
>>
>>
>> Which is probably incorrect, because the point of this patch is not to
>> require exact match of the partitions, right?
>>
>> 2) Do we really need the 'merged' flag in try_partitionwise_join? Can't
>> we simply use the joinrel->merged flag directly? ISTM the we always
>> start with joinrel->merged=false, and then only ever set it to true in
>> some cases. I've tried doing that, per the attached 0002 patch. The
>> regression tests seem to work fine.
>
>
> Thanks. I just added a small prologue to compute_partition_bounds().
>
>>
>>
>> I noticed this because I've tried moving part of the function into a
>> separate function, and removing the variable makes that simpler.
>>
>> The patch also does a couple additional minor tweaks.
>>
>> 3) I think the for nparts comment is somewhat misleading:
>>
>>int nparts;  /* number of partitions; 0 = not partitioned;
>>  * -1 = not yet set */
>>
>> which says that nparts=0 means not partitioned. But then there are
>> conditions like this:
>>
>>  /* Nothing to do, if the join relation is not partitioned. */
>>  if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>>  return;
>>
>> which (ignoring the obsolete comment) seems to say we can have nparts==0
>> even for partitioned tables, no?
>
>
> See my previous mail.
>
>>
>>
>> Anyway, attached is the original patch (0001) and two patches with
>> proposed changes. 0002 removes the "merged" flag as explained in (2),
>> 0003 splits the try_partitionwise_join() function into two parts.
>>
>> I'm saying these changes have to happen and it's a bit crude (and it
>> might be a bit of a bikeshedding).
>
>
> I have added 0005 with the changes I described in this as well as the 
> previous mail. 0004 is just some white space fixes.

Thanks for the comments, Tomas!  Thanks for the patch, Ashutosh!  I
will look at the patch.

Best regards,
Etsuro Fujita




Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 7:52 PM Amit Kapila  wrote:
> Peter, Is this behavior expected?
>
> Let me summarize the situation so that it would be easier for Peter to
> comment.  Julien has noticed that parallel vacuum and parallel create
> index doesn't seem to report correct values for buffer usage stats.
> Sawada-San wrote a patch to fix the problem for both the cases.  We
> expect that 'total_read_blks' as reported in pg_stat_statements should
> give the same value for parallel and non-parallel operations.  We see
> that is true for parallel vacuum and previously we have the same
> observation for the parallel query. Now, for parallel create index
> this doesn't seem to be true as test results by Dilip show that.  We
> have two possibilities here (a) there is some bug in Sawada-San's
> patch or (b) this is expected behavior for parallel create index.
> What do you think?

nbtree CREATE INDEX doesn't even go through the buffer manager. The
difference that Dilip showed is probably due to extra catalog accesses
in the two parallel workers -- pg_amproc lookups, and the like. Those
are rather small differences, overall.

Can Dilip demonstrate the the "extra" buffer accesses are
proportionate to the number of workers launched in some constant,
predictable way?

-- 
Peter Geoghegan




Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-04-01 Thread Amit Kapila
Adding Peter G.

On Wed, Apr 1, 2020 at 12:41 PM Dilip Kumar  wrote:
>
> I have done some testing for the parallel "create index".
>
> postgres[99536]=# show maintenance_work_mem ;
>  maintenance_work_mem
> --
>  1MB
> (1 row)
>
> CREATE TABLE test (a int, b int);
> INSERT INTO test SELECT i, i FROM GENERATE_SERIES(1,200) as i;
> CREATE INDEX idx1 on test(a);
> select query, total_time, shared_blks_hit, shared_blks_read,
> shared_blks_hit + shared_blks_read as total_read_blks,
> shared_blks_dirtied, shared_blks_written from pg_stat_statements where
> query like 'CREATE INDEX%';
>
>
> SET max_parallel_maintenance_workers TO 0;
> query | total_time | shared_blks_hit |
> shared_blks_read | total_read_blks | shared_blks_dirtied |
> shared_blks_written
> --++-+--+-+-+-
>  CREATE INDEX idx1 on test(a) | 1947.495997999 |8947 |
>   11 |8958 |   5 |
>   0
>
> SET max_parallel_maintenance_workers TO 2;
>
> query | total_time | shared_blks_hit |
> shared_blks_read | total_read_blks | shared_blks_dirtied |
> shared_blks_written
> --++-+--+-+-+-
>  CREATE INDEX idx1 on test(a) | 1942.142604001 |8960 |
>   14 |8974 |   5 |
>   0
> (1 row)
>
> I have noticed that the total_read_blks, with the parallel, create
> index is more compared to non-parallel one.  I have created a fresh
> database before each run.  I am not much aware of the internal code of
> parallel create an index so I am not sure whether it is expected to
> read extra blocks with the parallel create an index.  I guess maybe
> because multiple workers are inserting int the btree they might need
> to visit some btree nodes multiple times while traversing the tree
> down.  But, it's better if someone who have more idea with this code
> can confirm this.
>

Peter, Is this behavior expected?

Let me summarize the situation so that it would be easier for Peter to
comment.  Julien has noticed that parallel vacuum and parallel create
index doesn't seem to report correct values for buffer usage stats.
Sawada-San wrote a patch to fix the problem for both the cases.  We
expect that 'total_read_blks' as reported in pg_stat_statements should
give the same value for parallel and non-parallel operations.  We see
that is true for parallel vacuum and previously we have the same
observation for the parallel query. Now, for parallel create index
this doesn't seem to be true as test results by Dilip show that.  We
have two possibilities here (a) there is some bug in Sawada-San's
patch or (b) this is expected behavior for parallel create index.
What do you think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-01 Thread Tomas Vondra

On Wed, Apr 01, 2020 at 10:09:20PM -0400, James Coleman wrote:

On Wed, Apr 1, 2020 at 5:42 PM Tomas Vondra
 wrote:

...
I've realized the way get_useful_pathkeys_for_relation() is coded kinda
works against the fastpath we added when comparing pathkeys. That
depends on comparing pointers to the list, but we've been building new
lists (and then returned those) which defeats the optimization. Attached
is a patch that returns the original list in most cases (and only
creates a copy when really necessary). This might also save a few cycles
on bulding the new list, of course.

I've done a bunch of read-only pgbench tests with fairly small scales (1
and 10). First with the built-in read-only transaction, and also with a
simple custom query doing an order-by. And I did this both on the
default schema and with a bunch of extra indexes. The script I used to
run this is attached, along with a summary of results.

There are results for master and v40 and v50 patches (the v50 also
includes the extra patch fixing get_useful_pathkeys_for_relation).

Overall, I'm happy with those results - the v50 seems to be within 1% of
master, in both directions. This very much seems like a noise.

I still want to do a bit more review of the costing / tuplesort changes,
which I plan to do tomorrow. If that goes well, I plan to start
committing this. So please if you think this is not ready or wants more


I think we need to either implement this or remove the comment:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.
in generate_useful_gather_paths().



Yeah. I think we don't need the projection here. My reasoning is that if
we don't need it in generate_gather_paths(), we don't need it here.


In the same function we have the following code:
/*
* When the partial path is already sorted, we can just add a gather
* merge on top, and we're done - no point in adding explicit sort.
*
* XXX Can't we skip this (maybe only for the cheapest partial path)
* when the path is already sorted? Then it's likely duplicate with
* the path created by generate_gather_paths.
*/
if (is_sorted)
{
   path = create_gather_merge_path(root, rel, subpath, rel->reltarget,

subpath->pathkeys, NULL, rowsp);

   add_path(rel, >path);
   continue;
}

looking at the relevant loop in generate_gather_paths:
/*
* For each useful ordering, we can consider an order-preserving Gather
* Merge.
*/
foreach(lc, rel->partial_pathlist)
{
   Path   *subpath = (Path *) lfirst(lc);
   GatherMergePath *path;

   if (subpath->pathkeys == NIL)
   continue;

   rows = subpath->rows * subpath->parallel_workers;
   path = create_gather_merge_path(root, rel, subpath, rel->reltarget,

subpath->pathkeys, NULL, rowsp);
   add_path(rel, >path);
}

I believe we can eliminate the block entirely in
generate_useful_gather_paths(). Here's my reasoning: all paths for
which is_sorted is true must necessarily have pathkeys, and since we
already add a gather merge for every subpath with pathkeys, we've
already added gather merge paths for all of these.

I've included a patch to change this, but let me know if the reasoning
isn't sound.



Good catch! I think you're correct - we don't need to generate this
path, and we can just skip that partial path entirely.


We can also remove the XXX on this comment (in the same function):
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

because of this code in generate_gather_paths():
cheapest_partial_path = linitial(rel->partial_pathlist);
rows =
   cheapest_partial_path->rows * cheapest_partial_path->parallel_workers;
simple_gather_path = (Path *)
   create_gather_path(root, rel, cheapest_partial_path, rel->reltarget,
  NULL, rowsp);
add_path(rel, simple_gather_path);

but we can cleanup the comment a bit: fix the grammar issue in the
last line and fix the reference to gather merge path (it's a gather
path).

I've included that in the same patch.



OK, makes sense.


I also noticed that in create_incremental_sort_path we have this:
/* XXX comparison_cost shouldn't be 0? */
but I guess that's part of what you're reviewing tomorrow.



Right, one of the bits.


time for a review, let me know. I'm not yet sure if I'll commit this as
a single change, or in three separate commits.


I don't love the idea of committing it as a single patch, but at least
the first two I think probably go together. Otherwise we're
introducing a "fix" with no proven impact that will slow down planning
(even if only in a small way) only to intend to condition that on a
GUC in the next commit.

But I think you could potentially make an argument for keeping the
additional paths separate...but it's not absolutely necessary IMO.



OK. I've been actually 

Re: Planning counters in pg_stat_statements (using pgss_store)

2020-04-01 Thread Fujii Masao




On 2020/04/01 18:19, Fujii Masao wrote:



On 2020/04/01 3:42, Julien Rouhaud wrote:

On Wed, Apr 01, 2020 at 02:43:10AM +0900, Fujii Masao wrote:



On 2020/03/31 16:33, Julien Rouhaud wrote:


v12 attached!


Thanks for updating the patch! The patch looks good to me.

I applied minor and cosmetic changes into the patch. Attached is
the updated version of the patch. Barring any objection, I'd like to
commit this version.

BTW, the minor and cosmetic changes that I applied are, for example,

- Rename pgss_planner_hook to pgss_planner for the sake of consistency.
    Other function using hook in pgss doesn't use "_hook" in their names, too.
- Make pgss_planner use PG_FINALLY() instead of PG_CATCH().
- Make PGSS_NUMKIND as the last value in enum pgssStoreKind.



+1, and the PGSS_INVALID is also way better.



- Update the sample output in the document.
etc



Thanks a lot.  It all looks good to me!


Finally I pushed the patch!
Many thanks for all involved in this patch!

As a remaining TODO item, I'm thinking that the document would need to
be improved. For example, previously the query was not stored in pgss
when it failed. But, in v13, if pgss_planning is enabled, such a query is
stored because the planning succeeds. Without the explanation about
that behavior in the document, I'm afraid that users will get confused.
Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




RE: SLRU statistics

2020-04-01 Thread Shinoda, Noriyoshi (PN Japan A Delivery)
Hi, 

Thank you for developing great features.
The attached patch is a small fix to the committed documentation for the data 
type name of blks_hit column.

Best regards,
Noriyoshi Shinoda

-Original Message-
From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com] 
Sent: Thursday, April 2, 2020 9:42 AM
To: Alvaro Herrera 
Cc: Masahiko Sawada ; 
tsunakawa.ta...@fujitsu.com; pgsql-hack...@postgresql.org
Subject: Re: SLRU statistics

Hi,

I've pushed this after some minor cleanup and improvements.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




monitoring_pg_stat_slru.diff
Description: monitoring_pg_stat_slru.diff


Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 17:54:06 -0700, Andres Freund wrote:
>  * Check whether the given snapshot is too old to have safely read the given
>  * page from the given table.  If so, throw a "snapshot too old" error.
>  *
>  * This test generally needs to be performed after every BufferGetPage() call
>  * that is executed as part of a scan.  It is not needed for calls made for
>  * modifying the page (for example, to position to the right place to insert a
>  * new index tuple or for vacuuming).  It may also be omitted where calls to
>  * lower-level functions will have already performed the test.

To me this sounds like we'd not need to check for an old snapshot in
heap_delete/update/lock_tuple. And they were explictly not testing for
old snapshots.  But I don't understand why that'd be correct?

In a lot of UPDATE/DELETE queries there's no danger that the target
tuple will be pruned away, because the underlying scan node will hold a
pin. But I don't think that's guaranteed. E.g. if a tidscan is below the
ModifyTable node, it will not hold a pin by the time we heap_update,
because there's no scan holding a pin, and the slot will have been
materialized before updating.

There are number of other ways, I think.


So it's possible to get to heap_update/delete (and probably lock_tuple
as well) with a tid that's already been pruned away. Neither contains a
non-assert check ensuring the tid still is normal.

With assertions we'd fail with an assertion in PageGetItem(). But
without it looks like we'll interpret the page header as a tuple. Which
can't be good.

Greetings,

Andres Freund




Re: recovery_target_action=pause with confusing hint

2020-04-01 Thread movead...@highgo.ca

>So I'd like to propose the attached patch. The patch changes the message
>logged when a promotion is requested, based on whether the recovery is
>in paused state or not.
It is a compromise, we should notice it in document I think.




Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


Re: Add A Glossary

2020-04-01 Thread Alvaro Herrera
On 2020-Apr-01, Justin Pryzby wrote:

> planner/optimizer: ...

I propose we define "planner" and make "optimizer" a  entry.

I further propose not to define the term "normalized", at least not for
now.  That seems a very deep rabbit hole.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: control max length of parameter values logged

2020-04-01 Thread Justin Pryzby
Thanks for updating the patch.

On Thu, Apr 02, 2020 at 01:29:04AM +0100, Alexey Bashtanov wrote:
> +If greater than zero, bind parameter values reported in non-error
> +statement-logging messages are trimmed to no more than this many 
> bytes.
> +If this value is specified without units, it is taken as bytes.
> +Zero disables logging parameters with statements.
> +-1 (the default) makes parameters logged in full.

say: "..causes parameters to be logged in full".

Also..I just realized that this truncates *each* parameter, rather than
truncating the parameter list.

So say: "
|If greater than zero, each bind parameter value reported in a non-error
|statement-logging messages is trimmed to no more than this number of bytes.

> +Non-zero values add some overhead, as
> +PostgreSQL will compute and store textual
> +representations of parameter values in memory for all statements,
> +even if they eventually do not get logged.

say: "even if they are ultimately not logged"

> +++ b/src/backend/nodes/params.c
> @@ -280,6 +280,7 @@ BuildParamLogString(ParamListInfo params, char 
> **knownTextValues, int maxlen)
>   oldCxt;
>   StringInfoData buf;
>  
> + Assert(maxlen == -1 || maxlen > 0);

You can write: >= -1

> - if (log_parameters_on_error)
> + if (log_parameter_max_length_on_error 
> != 0)

I would write this as >= 0

> + if 
> (log_parameter_max_length_on_error > 0)
> + {
> +   /*
> +* We can trim the 
> saved string, knowing that we
> +* won't print all of 
> it.  But we must copy at
> +* least two more 
> full characters than
> +* 
> BuildParamLogString wants to use; otherwise it
> +* might fail to 
> include the trailing ellipsis.
> +*/
> +   
> knownTextValues[paramno] =
> +   
> pnstrdup(pstring,
> + 
>log_parameter_max_length_on_error
> + 
>+ 2 * MAX_MULTIBYTE_CHAR_LEN);

The comment says we need at least 2 chars, but
log_parameter_max_length_on_error might be 1, so I think you can either add 64
byte fudge factor, like before, or do Max(log_parameter_max_length_on_error, 2).

> + }
> + else
> + 
> knownTextValues[paramno] = pstrdup(pstring);

I suggest to handle the "== -1" case first and the > 0 case as "else".

Thanks,
-- 
Justin




Re: Add A Glossary

2020-04-01 Thread Justin Pryzby
On Tue, Mar 31, 2020 at 03:26:02PM -0400, Corey Huinker wrote:
> Just so I can prioritize my work, which of these things, along with your
> suggestions in previous emails, would you say is a barrier to considering
> this ready for a committer?

To answer your off-list inquiry, I'm not likely to mark it "ready" myself.
I don't know if any of these would be a "blocker" for someone else.

> > Here's some ideas; I'm *not* suggesting to include all of everything, but
> > hopefully start with a coherent, self-contained list.
> 
> > grep -roh '[^<]*' doc/src/ |sed 's/.*/\L&/' |sort |uniq -c
> > |sort -nr |less

I looked through that list and found these that might be good to include now or
in the future.  Probably all of these need language polishing; I'm not
requesting you to just copy them in just to say they're there.

join: concept of combining columns from two tables or other relations.  The
result of joining a table with N rows to another table with M rows might have
up to N*M rows (if every row from the first table "joins to" every row on the
second table).

normalized: A database schema is said to be "normalized" if its redundancy has
been removed.  Typically a "normalized" schema has a larger number of tables,
which include ID columns, and queries typically involve joining together
multiple tables.

query: a request send by a client to a server, usually to return results or to
modify data on the server;

query plan: the particular procedure by which the database server executes a
query.  A simple query involving a single table could might be planned using a
sequential scan or an index scan.  For a complex query involving multiple
tables joined togther, the optimizer attempts to determine the
cheapest/fastest/best way to execute the query, by joining tables in the
optimal order, and with the optimal join strategy.

planner/optimizer: ...

transaction isolation:
psql: ...

synchronous: An action is said to be "synchronous" if it does not return to its
requestor until its completion;

bind parameters: arguments to a SQL query that are sent separately from the
query text.  For example, the query text "SELECT * FROM tbl WHERE col=$1" might
be executed for some certain value of the $1 parameter.  If parameters are sent
"in-line" as a part of the query text, they need to be properly
quoted/escaped/sanitized, to avoid accidental or malicious misbehavior if the
input contains special characters like semicolons or quotes.

> > Maybe also:
> > object identifier
> > operator classes
> > operator family
> > visibility map

-- 
Justin




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 5:54 PM Andres Freund  wrote:
> As far as I can tell there's not sufficient in-tree explanation of when
> code needs to test for an old snapshot. There's just the following
> comment above TestForOldSnapshot():
>  * Check whether the given snapshot is too old to have safely read the given
>  * page from the given table.  If so, throw a "snapshot too old" error.
>  *
>  * This test generally needs to be performed after every BufferGetPage() call
>  * that is executed as part of a scan.  It is not needed for calls made for
>  * modifying the page (for example, to position to the right place to insert a
>  * new index tuple or for vacuuming).  It may also be omitted where calls to
>  * lower-level functions will have already performed the test.
>
> But I don't find "as part of a scan" very informative.

I also find it strange that _bt_search() calls TestForOldSnapshot() on
every level on the tree (actually, it calls _bt_moveright() which
calls it on every level of the tree). At least with reads (see the
comments at the top of _bt_moveright()).

Why do we need to do the test on internal pages? We only ever call
PredicateLockPage() on a leaf nbtree page. Why the inconsistency
between the two similar-seeming cases?

-- 
Peter Geoghegan




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 16:59:51 -0700, Andres Freund wrote:
> The primary issue here is that there is no TestForOldSnapshot() in
> heap_hot_search_buffer(). Therefore index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.

bitmap heap scan doesn't have the necessary checks either. In the
non-lossy case it uses heap_hot_search_buffer, for the lossy case it has
an open coded access without the check (that's bitgetpage() before v12,
and heapam_scan_bitmap_next_block() after that).

Nor do sample scans, but that was "at least" introduced later.

As far as I can tell there's not sufficient in-tree explanation of when
code needs to test for an old snapshot. There's just the following
comment above TestForOldSnapshot():
 * Check whether the given snapshot is too old to have safely read the given
 * page from the given table.  If so, throw a "snapshot too old" error.
 *
 * This test generally needs to be performed after every BufferGetPage() call
 * that is executed as part of a scan.  It is not needed for calls made for
 * modifying the page (for example, to position to the right place to insert a
 * new index tuple or for vacuuming).  It may also be omitted where calls to
 * lower-level functions will have already performed the test.

But I don't find "as part of a scan" very informative. I mean, it
was explicitly not called from with the executor back then (for a while
the check was embedded in BufferGetPage()):

static void
bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
...
Pagedp = BufferGetPage(buffer, NULL, NULL, 
BGP_NO_SNAPSHOT_TEST);


I am more than a bit dumbfounded here.

Greetings,

Andres Freund




Re: SLRU statistics

2020-04-01 Thread Tomas Vondra

Hi,

I've pushed this after some minor cleanup and improvements.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: proposal \gcsv

2020-04-01 Thread Vik Fearing
On 4/1/20 6:29 PM, Tom Lane wrote:
> I'd modify my first proposal so far as to make it
> 
>   \g ( pset-option pset-value ... ) filename-or-pipe
> 
> That is, require spaces around the parens

I think requiring spaces inside the parentheses is a severe POLA
violation and I vote strongly against it.
-- 
Vik Fearing




Re: control max length of parameter values logged

2020-04-01 Thread Alexey Bashtanov

Hi,

The privilege argument seems irrelevant to me.  We already decided
that the plan is (a) SUSET for non-error statement logging purposes and
(b) USERSET for logging caused by errors, and that would have to apply
to length limits as well as enable/disable ability.  Otherwise a user
could pretty effectively disable logging by setting the length to 1.
The only privilege that user can gain if we drop the boolean is to 
*enable* logging parameters on error.
That gives user a little bit easier way to fill up the disk with logs, 
but they anyway can do that if they want to.

If that's okay with everyone, please see the new version attached.

Best, Alex
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..3d7db57d74 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6690,29 +6690,48 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
   
  
 
- 
-  log_parameters_on_error (boolean)
+ 
+  log_parameter_max_length (integer)
   
-   log_parameters_on_error configuration parameter
+   log_parameter_max_length configuration parameter
   
   
   

-Controls whether bind parameters are logged when a statement is logged
-as a result of .
-It adds some overhead, as PostgreSQL will
-compute and store textual representations of parameter values in
-memory for all statements, even if they eventually do not get logged.
-This setting has no effect on statements logged due to
- or
- settings, as they are always logged
-with parameters.
-The default is off.
+If greater than zero, bind parameter values reported in non-error
+statement-logging messages are trimmed to no more than this many bytes.
+If this value is specified without units, it is taken as bytes.
+Zero disables logging parameters with statements.
+-1 (the default) makes parameters logged in full.
 Only superusers can change this setting.

   
  
 
+ 
+  log_parameter_max_length_on_error (integer)
+  
+   log_parameter_max_length_on_error configuration parameter
+  
+  
+  
+   
+If greater than zero, bind parameter values reported in error messages
+are trimmed to no more than this many bytes.
+If this value is specified without units, it is taken as bytes.
+Zero (the default) disables logging parameters on error.
+-1 makes parameters logged in full.
+
+This setting only affects statements logged as a result of
+.
+Non-zero values add some overhead, as
+PostgreSQL will compute and store textual
+representations of parameter values in memory for all statements,
+even if they eventually do not get logged.
+   
+  
+ 
+
  
   log_statement (enum)
   
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index a57f9eea76..77328e17ff 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -269,7 +269,7 @@ RestoreParamList(char **start_address)
  * can contain NULLs for any unknown individual values.  NULL can be given if
  * no parameters are known.
  *
- * If maxlen is not zero, that's the maximum number of bytes of any one
+ * If maxlen is not -1, that's the maximum number of bytes of any one
  * parameter value to be printed; an ellipsis is added if the string is
  * longer.  (Added quotes are not considered in this calculation.)
  */
@@ -280,6 +280,7 @@ BuildParamLogString(ParamListInfo params, char **knownTextValues, int maxlen)
 oldCxt;
 	StringInfoData buf;
 
+	Assert(maxlen == -1 || maxlen > 0);
 	/*
 	 * NB: think not of returning params->paramValuesStr!  It may have been
 	 * generated with a different maxlen, and so be unsuitable.  Besides that,
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5b677863b9..cbe4907e61 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -1838,7 +1838,7 @@ exec_bind_message(StringInfo input_message)
  */
 if (pstring)
 {
-	if (log_parameters_on_error)
+	if (log_parameter_max_length_on_error != 0)
 	{
 		MemoryContext	oldcxt;
 
@@ -1846,13 +1846,24 @@ exec_bind_message(StringInfo input_message)
 		if (knownTextValues == NULL)
 			knownTextValues =
 palloc0(numParams * sizeof(char *));
-		/*
-		 * Note: must copy at least two more full characters
-		 * than BuildParamLogString wants to see; otherwise it
-		 * might fail to include the ellipsis.
-		 */
-		knownTextValues[paramno] =
-			pnstrdup(pstring, 64 + 2 * MAX_MULTIBYTE_CHAR_LEN);
+
+		if (log_parameter_max_length_on_error > 0)
+		{
+			/*
+			 * We can trim the saved string, knowing that we
+			 * won't print all of it.  But we must copy at
+			 * least two more full characters than
+		

Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 4:59 PM Andres Freund  wrote:
> Thanks, that's super helpful.

Glad I could help.

> I got a bit confused here - you seemed to have switched session 1 and 2
> around? Doesn't seem to matter much though, I was able to reproduce this.

Yeah, I switched the session numbers because I was in a hurry. Sorry about that.

As you have already worked out, one session does all the DDL and
initial loading of data, while the other session queries the data
repeatedly in a serializable (or RR) xact. The latter session exhibits
the bug.

> This indeed seems a separate bug.

> The primary issue here is that there is no TestForOldSnapshot() in
> heap_hot_search_buffer(). Therefore index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.

I suspected that heap_hot_search_buffer() was missing something.

> The wrong queries I saw took longer to reproduce, so I've not been able
> to debug the precise reasons.

How hard would it be to write a debug patch that reduces the quantum
used in places like TransactionIdLimitedForOldSnapshots() to something
much less than the current 1 minute quantum? That made reproducing the
bug *very* tedious.

-- 
Peter Geoghegan




Re: Allow continuations in "pg_hba.conf" files

2020-04-01 Thread David Zhang
Hi Fabien,
Should we consider the case "\ ", i.e. one or more spaces after the backslash?
For example, if I replace a user map 
"mymap   /^(.*)@mydomain\.com$  \1" with 
"mymap   /^(.*)@mydomain\.com$  \ "
"\1"
by adding one extra space after the backslash, then I got the pg_role="\\"
but I think what we expect is pg_role="\\1"

Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 16:59:51 -0700, Andres Freund wrote:
> The primary issue here is that there is no TestForOldSnapshot() in
> heap_hot_search_buffer(). Therefore index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.

FWIW, with autovacuum=off the query does not get killed until a manual
vacuum, nor if fewer rows are deleted and the table has previously been
vacuumed.

The vacuum in the second session isn't required. There just needs to be
something consuming an xid, so that oldSnapshotControl->latest_xmin is
increased. A single SELECT txid_current(); or such in a separate session
is sufficient.

Greetings,

Andres Freund




Re: Add A Glossary

2020-04-01 Thread Alvaro Herrera
On 2020-Apr-01, Jürgen Purtz wrote:

> 
> On 31.03.20 19:58, Justin Pryzby wrote:
> > On Tue, Mar 31, 2020 at 04:13:00PM +0200, Jürgen Purtz wrote:
> > > Please find some minor suggestions in the attachment. They are based on
> > > Corey's last patch 0001-glossary-v4.patch.
> > > @@ -220,7 +220,7 @@
> > > Records to the file system and creates a special
> > > checkpoint record. This process is initiated when predefined
> > > conditions are met, such as a specified amount of time has 
> > > passed, or
> > > -  a certain volume of records have been collected.
> > > +  a certain volume of records has been collected.
> > I think you're correct in that "volume" is singular.  But I think 
> > "collected"
> > is the wrong world.  I suggested "written".
> > 
> "collected" is not optimal. I suggest "created". Please avoid "written", the
> WAL records will be written when the Checkpointer is running, not before.

Actually, you're mistaken; the checkpointer hardly writes any WAL
records.  In fact, it only writes *one* wal record, which is the
checkpoint record itself.  All the other wal records are written either
by the backends that produce it, or by the wal writer process.  By the
time the checkpoint runs, the wal records are long expected to be written.

Anyway I changed a lot of terms again, as well as changing the way the
terms are marked up -- for two reasons:

1. I didn't like the way the WAL-related entries were structured.  I
created a new entry called "Write-Ahead Log", which explains what WAL
is; this replaces the term "WAL Log", which is redundant (since the L in
WAL stands for "log" already). I kept the id as glossary-wal, though,
because it's shorter and *shrug*.  The definition uses the terms "wal
record" and "wal file", which I also rewrote.

2. I found out that "see xyz" and "see also" have bespoke markup in
Docbook --  and .  I changed some glossentries
to use those, removing some glossdefs and changing a couple of paras to
glossseealsos.  I also removed all "id" properties from glossentries
that are just , because I think it's a mistake to have
references to entries that will make the reader look up a different
term; for me as a reader that's annoying, and I don't like to annoy
people.


While at it, I again came across "analytic", which is a term we don't
use much, so I made it a glosssee for "window function"; and while at it
I realized we didn't clearly explain what a window was. So I added
"window frame" for that.  I considered adding the term "partition" which
is used in this context, but decided it wasn't necessary.

I also added "(process)" to terms that define processes.  So
now we have "checkpointer (process)" and so on.

I rewrote the definition for "atomic" once again.  Made it two
glossdefs, because I can.  If you don't like this, I can undo.

I added "recycling".

I still have to go through some other defs.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 1043d0f7ab..cf21ef857e 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -170,6 +170,7 @@
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
new file mode 100644
index 00..3c417f2fd3
--- /dev/null
+++ b/doc/src/sgml/glossary.sgml
@@ -0,0 +1,1526 @@
+
+ Glossary
+ 
+  This is a list of terms and their meaning in the context of
+  PostgreSQL and relational database
+  systems in general.
+ 
+
+ 
+  
+   Aggregating
+   
+
+ The act of combining a collection of data (input) values into
+ a single output value, which may not be of the same type as the
+ input values.
+
+   
+  
+
+  
+   Aggregate Function
+   
+
+ A function that
+ combines multiple input values,
+ for example by counting, averaging or adding them all together,
+ yielding a single output value.
+
+
+ For more information, see
+ .
+
+
+   
+  
+
+  
+   Analytic Function
+   
+  
+
+  
+   Atomic
+   
+
+ In reference to a datum:
+ the fact that its value that cannot be broken down into smaller
+ components.
+
+   
+   
+
+ In reference to a
+ database transaction:
+ the fact that all the operations in the transaction either complete as
+ a whole, or none of them become visible.
+
+   
+  
+
+  
+   Atomicity
+   
+
+ One of the ACID properties. This is the state of
+ being Atomic in the operational/transactional sense.
+
+   
+  
+
+  
+   Attribute
+   
+
+ An element with a certain name and data type found within a
+ Tuple or Table.
+
+   
+  
+
+  
+   Autovacuum
+   
+
+ Background processes that routinely perform
+ Vacuum and Analyze
+ operations.
+
+
+ For more information, see
+ .
+
+   
+  
+
+  
+   Backend Process
+   
+
+ 

Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 15:30:39 -0700, Peter Geoghegan wrote:
> On Wed, Apr 1, 2020 at 3:00 PM Peter Geoghegan  wrote:
> > I like that idea. I think that I've spotted what may be an independent
> > bug, but I have to wait around for a minute or two to reproduce it
> > each time. Makes it hard to get to a minimal test case.
>
> I now have simple steps to reproduce a bug when I start Postgres
> master with "--old_snapshot_threshold=1"  (1 minute).

Thanks, that's super helpful.


> This example shows wrong answers to queries in session 2:
>
> Session 1:
>
> pg@regression:5432 [1444078]=# create table snapshot_bug (col int4);
> CREATE TABLE
> pg@regression:5432 [1444078]=# create index on snapshot_bug (col );
> CREATE INDEX
> pg@regression:5432 [1444078]=# insert into snapshot_bug select i from
> generate_series(1, 500) i;
> INSERT 0 500
>
> Session 2 starts, and views the data in a serializable transaction:
>
> pg@regression:5432 [1444124]=# begin isolation level serializable ;
> BEGIN
> pg@regression:5432 [1444124]=*# select col from snapshot_bug where col
> >= 0 order by col limit 14;
> ┌─┐
> │ col │
> ├─┤
> │   1 │
> │   2 │
> │   3 │
> │   4 │
> │   5 │
> │   6 │
> │   7 │
> │   8 │
> │   9 │
> │  10 │
> │  11 │
> │  12 │
> │  13 │
> │  14 │
> └─┘
> (14 rows)
>
> So far so good. Now session 2 continues:

> pg@regression:5432 [1444078]=# delete from snapshot_bug where col < 15;
> DELETE 14

I got a bit confused here - you seemed to have switched session 1 and 2
around? Doesn't seem to matter much though, I was able to reproduce this.

This indeed seems a separate bug.

The backtrace to the point where the xmin horizon is affected by
TransactionIdLimitedForOldSnapshots() is:

#0  TransactionIdLimitedForOldSnapshots (recentXmin=2082816071, 
relation=0x7f52ff3b56f8) at 
/home/andres/src/postgresql/src/backend/utils/time/snapmgr.c:1870
#1  0x5567f4cd1a55 in heap_page_prune_opt (relation=0x7f52ff3b56f8, 
buffer=175) at 
/home/andres/src/postgresql/src/backend/access/heap/pruneheap.c:106
#2  0x5567f4cc70e2 in heapam_index_fetch_tuple (scan=0x5567f6db3028, 
tid=0x5567f6db2e40, snapshot=0x5567f6d67d68, slot=0x5567f6db1b60,
call_again=0x5567f6db2e46, all_dead=0x7ffce13d78de) at 
/home/andres/src/postgresql/src/backend/access/heap/heapam_handler.c:137
#3  0x5567f4cdf5e6 in table_index_fetch_tuple (scan=0x5567f6db3028, 
tid=0x5567f6db2e40, snapshot=0x5567f6d67d68, slot=0x5567f6db1b60,
call_again=0x5567f6db2e46, all_dead=0x7ffce13d78de) at 
/home/andres/src/postgresql/src/include/access/tableam.h:1020
#4  0x5567f4ce0767 in index_fetch_heap (scan=0x5567f6db2de0, 
slot=0x5567f6db1b60) at 
/home/andres/src/postgresql/src/backend/access/index/indexam.c:577
#5  0x5567f4f19191 in IndexOnlyNext (node=0x5567f6db16a0) at 
/home/andres/src/postgresql/src/backend/executor/nodeIndexonlyscan.c:169
#6  0x5567f4ef4bc4 in ExecScanFetch (node=0x5567f6db16a0, 
accessMtd=0x5567f4f18f20 , recheckMtd=0x5567f4f1951c 
)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:133
#7  0x5567f4ef4c39 in ExecScan (node=0x5567f6db16a0, 
accessMtd=0x5567f4f18f20 , recheckMtd=0x5567f4f1951c 
)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:182
#8  0x5567f4f195d4 in ExecIndexOnlyScan (pstate=0x5567f6db16a0) at 
/home/andres/src/postgresql/src/backend/executor/nodeIndexonlyscan.c:317
#9  0x5567f4ef0f71 in ExecProcNodeFirst (node=0x5567f6db16a0) at 
/home/andres/src/postgresql/src/backend/executor/execProcnode.c:444
#10 0x5567f4f1d694 in ExecProcNode (node=0x5567f6db16a0) at 
/home/andres/src/postgresql/src/include/executor/executor.h:245
#11 0x5567f4f1d7d2 in ExecLimit (pstate=0x5567f6db14b8) at 
/home/andres/src/postgresql/src/backend/executor/nodeLimit.c:95
#12 0x5567f4ef0f71 in ExecProcNodeFirst (node=0x5567f6db14b8) at 
/home/andres/src/postgresql/src/backend/executor/execProcnode.c:444
#13 0x5567f4ee57c3 in ExecProcNode (node=0x5567f6db14b8) at 
/home/andres/src/postgresql/src/include/executor/executor.h:245
#14 0x5567f4ee83dd in ExecutePlan (estate=0x5567f6db1280, 
planstate=0x5567f6db14b8, use_parallel_mode=false, operation=CMD_SELECT, 
sendTuples=true,
numberTuples=0, direction=ForwardScanDirection, dest=0x5567f6db3c78, 
execute_once=true)
at /home/andres/src/postgresql/src/backend/executor/execMain.c:1646
#15 0x5567f4ee5e23 in standard_ExecutorRun (queryDesc=0x5567f6d0c490, 
direction=ForwardScanDirection, count=0, execute_once=true)
at /home/andres/src/postgresql/src/backend/executor/execMain.c:364
#16 0x5567f4ee5c35 in ExecutorRun (queryDesc=0x5567f6d0c490, 
direction=ForwardScanDirection, count=0, execute_once=true)
at /home/andres/src/postgresql/src/backend/executor/execMain.c:308
#17 0x5567f510c4de in PortalRunSelect (portal=0x5567f6d49260, forward=true, 
count=0, dest=0x5567f6db3c78)
at /home/andres/src/postgresql/src/backend/tcop/pquery.c:912
#18 0x5567f510c191 in PortalRun 

Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-01 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Apr-01, Tom Lane wrote:
>> The fact that I had to use max(age(...)) in that sample query
>> hints at one reason: it's really hard to do arithmetic correctly
>> on raw XIDs.  Dealing with wraparound is a problem, and knowing
>> what's past or future is even harder.  What use-case do you
>> foresee exactly?

> Maybe it would make sense to start exposing fullXids in these views and
> functions, for this reason.  There's no good reason to continue to
> expose bare Xids to userspace, we should use them only for storage.

+1, that would help a lot.

> But I think James' point is precisely that it's not easy to know where
> to look for things that keep Xmin from advancing.  Currently it's
> backends, replication slots, prepared transactions, and replicas with
> hot_standby_feedback.  If you forget to monitor just one of these, your
> vacuums might be useless and you won't notice until disaster strikes.

Agreed, but just knowing what the oldest xmin is doesn't help you
find *where* it is.  Maybe what we need is a view showing all of
these potential sources of an old xmin.

regards, tom lane




Re: Ltree syntax improvement

2020-04-01 Thread Tom Lane
Nikita Glukhov  writes:
> [ latest version of ltree syntax extension ]

This is going to need another rebase after all the other ltree hacking
that just got done.  However, I did include 0001 (use a switch) in
the commit I just pushed, so you don't need to worry about that.

regards, tom lane




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 14:11:11 -0700, Andres Freund wrote:
> As far as I can tell, with a large old_snapshot_threshold, it can take a
> very long time to get to a head_timestamp that's old enough for
> TransactionIdLimitedForOldSnapshots() to do anything.  Look at this
> trace of a pgbench run with old_snapshot_threshold enabled, showing some of
> the debugging output added in the patch upthread.
>
> This is with a threshold of 10min, in a freshly started database:
> [...]

I took a lot longer till the queries started to be cancelled. The last
mapping update before that was:

> 2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING:  old 
> snapshot mapping at "before update" with head ts: 31, current entries: 20 max 
> entries: 20, offset: 12
> entry 0 (ring 12): min 31: xid 2078468128
> entry 1 (ring 13): min 32: xid 2078642746
> entry 2 (ring 14): min 33: xid 2078672303
> entry 3 (ring 15): min 34: xid 2078700941
> entry 4 (ring 16): min 35: xid 2078728729
> entry 5 (ring 17): min 36: xid 2078755425
> entry 6 (ring 18): min 37: xid 2078781089
> entry 7 (ring 19): min 38: xid 2078805567
> entry 8 (ring 0): min 39: xid 2078830065
> entry 9 (ring 1): min 40: xid 2078611853
> entry 10 (ring 2): min 41: xid 2078611853
> entry 11 (ring 3): min 42: xid 2078611853
> entry 12 (ring 4): min 43: xid 2078611853
> entry 13 (ring 5): min 44: xid 2078611853
> entry 14 (ring 6): min 45: xid 2078611853
> entry 15 (ring 7): min 46: xid 2078611853
> entry 16 (ring 8): min 47: xid 2078611853
> entry 17 (ring 9): min 48: xid 2078611853
> entry 18 (ring 10): min 49: xid 2078611853
> entry 19 (ring 11): min 50: xid 2078611853
>
> 2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING:  head 31 
> min: updating existing bucket 1 for whenTaken 40 min, with xmin 2078853870
> 2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING:  old 
> snapshot mapping at "after update" with head ts: 31, current entries: 20 max 
> entries: 20, offset: 12
> entry 0  (ring 12): min 31: xid 2078468128
> entry 1  (ring 13): min 32: xid 2078642746
> entry 2  (ring 14): min 33: xid 2078672303
> entry 3  (ring 15): min 34: xid 2078700941
> entry 4  (ring 16): min 35: xid 2078728729
> entry 5  (ring 17): min 36: xid 2078755425
> entry 6  (ring 18): min 37: xid 2078781089
> entry 7  (ring 19): min 38: xid 2078805567
> entry 8  (ring 0 ): min 39: xid 2078830065
> entry 9  (ring 1 ): min 40: xid 2078853870
> entry 10 (ring 2 ): min 41: xid 2078611853
> entry 11 (ring 3 ): min 42: xid 2078611853
> entry 12 (ring 4 ): min 43: xid 2078611853
> entry 13 (ring 5 ): min 44: xid 2078611853
> entry 14 (ring 6 ): min 45: xid 2078611853
> entry 15 (ring 7 ): min 46: xid 2078611853
> entry 16 (ring 8 ): min 47: xid 2078611853
> entry 17 (ring 9 ): min 48: xid 2078611853
> entry 18 (ring 10): min 49: xid 2078611853
> entry 19 (ring 11): min 50: xid 2078611853


A query ran for fourty minutes during this, without getting cancelled.



A good while later this happens:
> 2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING:  old 
> snapshot mapping at "before update" with head ts: 82, current entries: 20 max 
> entries: 20, offset: 12
> entry 0 (ring 12): min 82: xid 2080333207
> entry 1 (ring 13): min 83: xid 2080527298
> entry 2 (ring 14): min 84: xid 2080566990
> entry 3 (ring 15): min 85: xid 2080605960
> entry 4 (ring 16): min 86: xid 2080644554
> entry 5 (ring 17): min 87: xid 2080682957
> entry 6 (ring 18): min 88: xid 2080721936
> entry 7 (ring 19): min 89: xid 2080760947
> entry 8 (ring 0): min 90: xid 2080799843
> entry 9 (ring 1): min 91: xid 2080838696
> entry 10 (ring 2): min 92: xid 2080877550
> entry 11 (ring 3): min 93: xid 2080915870
> entry 12 (ring 4): min 94: xid 2080954151
> entry 13 (ring 5): min 95: xid 2080992556
> entry 14 (ring 6): min 96: xid 2081030980
> entry 15 (ring 7): min 97: xid 2081069403
> entry 16 (ring 8): min 98: xid 2081107811
> entry 17 (ring 9): min 99: xid 2081146322
> entry 18 (ring 10): min 100: xid 2081185023
> entry 19 (ring 11): min 101: xid 2081223632
>
> 2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING:  head 82 
> min: filling 20 buckets starting at 12 for whenTaken 102 min, with xmin 
> 2081262046
> 2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING:  old 
> snapshot mapping at "after update" with head ts: 102, current entries: 1 max 
> entries: 20, offset: 0
> entry 0 (ring 0): min 102: xid 2081262046

The entire mapping reset, i.e. it'll take another 

Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-01 Thread James Coleman
On Wed, Apr 1, 2020 at 5:58 PM Alvaro Herrera  wrote:
>
> On 2020-Apr-01, Tom Lane wrote:
>
> > James Coleman  writes:
> > > To my knowledge the current oldest xmin (GetOldestXmin() if I'm not
> > > mistaken) isn't exposed directly in any view or function by Postgres.
> >
> > You could do something like
> >
> > select max(age(backend_xmin)) from pg_stat_activity;
> >
> > though I'm not sure whether that accounts for absolutely every process.

That doesn't account for for replication slots that's aren't active, right?

> > > Am I missing anything in the above description? And if not, would
> > > there be any reason why we would want to avoid exposing that
> > > information? And if not, then would exposing it as a function be
> > > acceptable?
> >
> > The fact that I had to use max(age(...)) in that sample query
> > hints at one reason: it's really hard to do arithmetic correctly
> > on raw XIDs.  Dealing with wraparound is a problem, and knowing
> > what's past or future is even harder.  What use-case do you
> > foresee exactly?

So the use case we've encountered multiple times is some (at that
point unknown) process or object that's preventing the xmin from
advancing, and thus blocking vacuum. That kind of situation can pretty
quickly lead to query plans that can result in significant business
impact.

As such, it'd be helpful to be able to monitor something like "how old
is the current xmin on the cluster". Ideally it would also tell you
what process or object is holding that xmin, but starting with the
xmin itself would at least alert to the problem so you can
investigate.

On that note, for this particular use case it would be sufficient to
have something like pg_timestamp_of_oldest_xmin() (given we have
commit timestamps tracking enabled) or even a function returning the
number of xids consumed since the oldest xmin, but it seems more
broadly useful to provide a function that gives the oldest xmin and
allow users to build on top of that.

> Maybe it would make sense to start exposing fullXids in these views and
> functions, for this reason.  There's no good reason to continue to
> expose bare Xids to userspace, we should use them only for storage.

This would be useful too (and for more reasons than the above).

> But I think James' point is precisely that it's not easy to know where
> to look for things that keep Xmin from advancing.  Currently it's
> backends, replication slots, prepared transactions, and replicas with
> hot_standby_feedback.  If you forget to monitor just one of these, your
> vacuums might be useless and you won't notice until disaster strikes.
>
>
> Maybe a useful value to publish in some monitoring view is
> RecentGlobalXmin -- which has a valid value when reading a view, since
> you had to acquire a snapshot to read the view in the first place.

If we went down that path what view do you think would be best -- an
existing one or a new one?

I go back and forth on whether this is best exposed as a monitoring
oriented view or as part of the suite of txid functions we already
have that seem to have broader applicability.

James




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 3:00 PM Peter Geoghegan  wrote:
> I like that idea. I think that I've spotted what may be an independent
> bug, but I have to wait around for a minute or two to reproduce it
> each time. Makes it hard to get to a minimal test case.

I now have simple steps to reproduce a bug when I start Postgres
master with "--old_snapshot_threshold=1"  (1 minute).

This example shows wrong answers to queries in session 2:

Session 1:

pg@regression:5432 [1444078]=# create table snapshot_bug (col int4);
CREATE TABLE
pg@regression:5432 [1444078]=# create index on snapshot_bug (col );
CREATE INDEX
pg@regression:5432 [1444078]=# insert into snapshot_bug select i from
generate_series(1, 500) i;
INSERT 0 500

Session 2 starts, and views the data in a serializable transaction:

pg@regression:5432 [1444124]=# begin isolation level serializable ;
BEGIN
pg@regression:5432 [1444124]=*# select col from snapshot_bug where col
>= 0 order by col limit 14;
┌─┐
│ col │
├─┤
│   1 │
│   2 │
│   3 │
│   4 │
│   5 │
│   6 │
│   7 │
│   8 │
│   9 │
│  10 │
│  11 │
│  12 │
│  13 │
│  14 │
└─┘
(14 rows)

So far so good. Now session 2 continues:

pg@regression:5432 [1444078]=# delete from snapshot_bug where col < 15;
DELETE 14

Session 1:

(repeats the same "select col from snapshot_bug where col >= 0 order
by col limit 14" query every 100 ms using psql's \watch 0.1)

Session 2:

pg@regression:5432 [1444078]=# vacuum snapshot_bug ;
VACUUM

Before too long, we see the following over in session 2 -- the answer
the query gives changes, even though this is a serializable
transaction:

Wed 01 Apr 2020 03:12:59 PM PDT (every 0.1s)

┌─┐
│ col │
├─┤
│   1 │
│   2 │
│   3 │
│   4 │
│   5 │
│   6 │
│   7 │
│   8 │
│   9 │
│  10 │
│  11 │
│  12 │
│  13 │
│  14 │
└─┘
(14 rows)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

┌─┐
│ col │
├─┤
│  15 │
│  16 │
│  17 │
│  18 │
│  19 │
│  20 │
│  21 │
│  22 │
│  23 │
│  24 │
│  25 │
│  26 │
│  27 │
│  28 │
└─┘
(14 rows)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

┌─┐
│ col │
├─┤
│  15 │
│  16 │
│  17 │
│  18 │
│  19 │
│  20 │
│  21 │
│  22 │
│  23 │
│  24 │
│  25 │
│  26 │
│  27 │
│  28 │
└─┘
(14 rows)

We continue to get this wrong answer for almost another minute (at
least on this occasion). Eventually we get "snapshot too old". Note
that the answer changes when we cross the "minute threshold"

Andres didn't explain anything to me that contributed to finding the
bug (though it could be a known bug, I don't think that it is). It
took me a surprisingly short amount of time to stumble upon this bug
-- I didn't find it because I have good intuitions about how to break
the feature.

-- 
Peter Geoghegan


Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 1:25 PM Robert Haas  wrote:
> Maybe that contrib module could even have some functions to simulate
> aging without the passage of any real time. Like, say you have a
> function or procedure old_snapshot_pretend_time_has_passed(integer),
> and it moves oldSnapshotControl->head_timestamp backwards by that
> amount. Maybe that would require updating some other fields in
> oldSnapshotControl too but it doesn't seem like we'd need to do a
> whole lot.

I like that idea. I think that I've spotted what may be an independent
bug, but I have to wait around for a minute or two to reproduce it
each time. Makes it hard to get to a minimal test case.

-- 
Peter Geoghegan




Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-01 Thread Alvaro Herrera
On 2020-Apr-01, Tom Lane wrote:

> James Coleman  writes:
> > To my knowledge the current oldest xmin (GetOldestXmin() if I'm not
> > mistaken) isn't exposed directly in any view or function by Postgres.
> 
> You could do something like
> 
> select max(age(backend_xmin)) from pg_stat_activity;
> 
> though I'm not sure whether that accounts for absolutely every process.
>
> > Am I missing anything in the above description? And if not, would
> > there be any reason why we would want to avoid exposing that
> > information? And if not, then would exposing it as a function be
> > acceptable?
> 
> The fact that I had to use max(age(...)) in that sample query
> hints at one reason: it's really hard to do arithmetic correctly
> on raw XIDs.  Dealing with wraparound is a problem, and knowing
> what's past or future is even harder.  What use-case do you
> foresee exactly?

Maybe it would make sense to start exposing fullXids in these views and
functions, for this reason.  There's no good reason to continue to
expose bare Xids to userspace, we should use them only for storage.

But I think James' point is precisely that it's not easy to know where
to look for things that keep Xmin from advancing.  Currently it's
backends, replication slots, prepared transactions, and replicas with
hot_standby_feedback.  If you forget to monitor just one of these, your
vacuums might be useless and you won't notice until disaster strikes.


Maybe a useful value to publish in some monitoring view is
RecentGlobalXmin -- which has a valid value when reading a view, since
you had to acquire a snapshot to read the view in the first place.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] ltree, lquery, and ltxtquery binary protocol support

2020-04-01 Thread Nino Floris
Hi Tom,

Thanks a lot for pushing this through.

In complete agreement on fixing mbstrlen, it would clearly have lead to cut
off string sends, or worse (does the binary protocol use null terminated
strings, or are they length prefixed?). Apologies anyways, it's been a
while so I don't know how it may have ended up there, thanks for catching
it.

Even though it was a bit bumpy, and vastly different from my usual github
contribution flow, I've had a good time contributing my first patch!

Best regards,
Nino Floris


On Apr 1, 2020 at 23:37, Tom Lane  wrote:

I wrote:
> Fortunately for the odds of getting this patch accepted, we just
> pushed an ALTER TYPE improvement that will solve your problem [1].
> Please replace ltree--1.2.sql with an upgrade script that uses
> that, and resubmit.

I decided it would be a shame for this to miss v13, seeing that
(a) it'd provide real-world use of that ALTER TYPE code, and
(b) we already bumped ltree's extension version for v13.

So I went ahead and rebased this, reviewed and pushed it.  The
main non-cosmetic thing I changed, other than using ALTER TYPE,
was that you had the output functions looking like this:

pq_sendtext(, res, pg_mbstrlen(res));

That's just wrong; it should be plain strlen().  Did you copy
that coding from someplace?  I could not find any similar
occurrences in our code tree.

regards, tom lane


Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-01 Thread Tom Lane
James Coleman  writes:
> To my knowledge the current oldest xmin (GetOldestXmin() if I'm not
> mistaken) isn't exposed directly in any view or function by Postgres.

You could do something like

select max(age(backend_xmin)) from pg_stat_activity;

though I'm not sure whether that accounts for absolutely every process.

> Am I missing anything in the above description? And if not, would
> there be any reason why we would want to avoid exposing that
> information? And if not, then would exposing it as a function be
> acceptable?

The fact that I had to use max(age(...)) in that sample query
hints at one reason: it's really hard to do arithmetic correctly
on raw XIDs.  Dealing with wraparound is a problem, and knowing
what's past or future is even harder.  What use-case do you
foresee exactly?

regards, tom lane




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-01 Thread Tomas Vondra

On Wed, Apr 01, 2020 at 09:05:27AM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 11:07 PM James Coleman  wrote:


On Tue, Mar 31, 2020 at 10:44 PM Tomas Vondra
 wrote:
>
> On Tue, Mar 31, 2020 at 10:12:29PM -0400, James Coleman wrote:
> >On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
> > wrote:
> >>
> >> On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:
> >> >On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
> >> > wrote:
> >> >>
> >> >> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
> >> >> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
> >> >> > wrote:

...

> >> >> >One small idea, but I'm not yet sure it helps us a whole lot: if the
> >> >> >query pathkeys is only length 1, then we could skip the additional
> >> >> >path creation.
> >> >> >
> >> >>
> >> >> I don't follow. Why would we create incremental sort in this case at
> >> >> all? With single-element query_pathkeys the path is either unsorted or
> >> >> fully sorted - there's no room for incremental sort. No?
> >> >
> >> >Well, we shouldn't, that's what I'm getting. But I didn't see anything
> >> >in the code now that explicitly excludes that case when decided
> >> >whether or not to create an incremental sort path, unless I'm missing
> >> >something obvious.
> >>
> >> Well, my point is that create_ordered_paths() looks like this:
> >>
> >>  is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);
> >>
> >>  if (is_sorted)
> >>  {
> >>  ... old code
> >>  }
> >>  else
> >>  {
> >>  if (input_path == cheapest_input_path)
> >>  {
> >>  ... old code
> >>  }
> >>
> >>  /* With incremental sort disabled, don't build those paths. */
> >>  if (!enable_incrementalsort)
> >>  continue;
> >>
> >>  /* Likewise, if the path can't be used for incremental sort. */
> >>  if (!presorted_keys)
> >>  continue;
> >>
> >>  ... incremental sort path
> >>  }
> >>
> >> Now, with single-item sort_pathkeys, the input path can't be partially
> >> sorted. It's either fully sorted - in which case it's handled by the
> >> first branch. Or it's not sorted at all, so presorted_keys==0 and we
> >> never get to the incremental path.
> >>
> >> Or did you mean to use the optimization somewhere else?
> >
> >Hmm, yes, I didn't think through that properly. I'll have to look at
> >the other cases to confirm the same logic applies there.


I looked through this more carefully, and I did end up finding a few
places where we can skip iterating through a list of paths entirely
with this check, so I added it there. I also cleaned up some comments,
added comments and asserts to the other places where
list_length(pathkeys) should be guaranteed to be > 1, removed a few
asserts I found unnecessary, and merged duplicative
pathkeys_[count_]_contained_in calls.



OK


> >One other thing:in the code above we create the regular sort path
> >inside of `if (input_path == cheapest_input_path)`, but incremental
> >sort is outside of that condition. I'm not sure I'm remembering why
> >that was, and it's not obvious to me reading it right now (though it's
> >getting late here, so maybe I'm just not thinking clearly). Do you
> >happen to remember why that is?
> >
>
> It's because for the regular sort, the path is either already sorted or
> it requires a full sort. But full sort only makes sense on the cheapest
> path, because we assume the additional sort cost is independent of the
> input cost, essentially
>
> cost(path + Sort) = cost(path) + cost(Sort)
>
> and it's always
>
>  cost(path) + cost(Sort) >= cost(cheapest path) + cost(Sort)
>
> and by checking for cheapest path we simply skip building all the paths
> that we'd end up discarding anyway.
>
> With incremental sort we can't do this, the cost of the incremental sort
> depends on how well presorted is the input path.


Thanks for the explanation. I've added a comment to that effect.



Thanks.

I've realized the way get_useful_pathkeys_for_relation() is coded kinda
works against the fastpath we added when comparing pathkeys. That
depends on comparing pointers to the list, but we've been building new
lists (and then returned those) which defeats the optimization. Attached
is a patch that returns the original list in most cases (and only
creates a copy when really necessary). This might also save a few cycles
on bulding the new list, of course.

I've done a bunch of read-only pgbench tests with fairly small scales (1
and 10). First with the built-in read-only transaction, and also with a
simple custom query doing an order-by. And I did this both on the
default schema and with a bunch of extra indexes. The script I used to
run this is attached, along with a summary of results.

There are results for master and v40 and v50 patches (the v50 also
includes the extra patch fixing get_useful_pathkeys_for_relation).

Overall, I'm happy with those results - the v50 seems to be 

Re: [PATCH] ltree, lquery, and ltxtquery binary protocol support

2020-04-01 Thread Tom Lane
I wrote:
> Fortunately for the odds of getting this patch accepted, we just
> pushed an ALTER TYPE improvement that will solve your problem [1].
> Please replace ltree--1.2.sql with an upgrade script that uses
> that, and resubmit.

I decided it would be a shame for this to miss v13, seeing that
(a) it'd provide real-world use of that ALTER TYPE code, and
(b) we already bumped ltree's extension version for v13.

So I went ahead and rebased this, reviewed and pushed it.  The
main non-cosmetic thing I changed, other than using ALTER TYPE,
was that you had the output functions looking like this:

pq_sendtext(, res, pg_mbstrlen(res));

That's just wrong; it should be plain strlen().  Did you copy
that coding from someplace?  I could not find any similar
occurrences in our code tree.

regards, tom lane




Re: backup manifests

2020-04-01 Thread David Steele

On 3/31/20 7:57 AM, Robert Haas wrote:

On Mon, Mar 30, 2020 at 7:24 PM David Steele  wrote:

I'm confused as to why you're not seeing that. What's the exact
sequence of steps?


$ pg_basebackup -D test/backup5 --manifest-checksums=SHA256

$ vi test/backup5/backup_manifest
  * Add 'X' to the checksum of backup_label

$ pg_validatebackup test/backup5
pg_validatebackup: fatal: invalid checksum for file "backup_label":
"a98e9164fd59d498d14cfdf19c67d1c2208a30e7b939d1b4a09f524c7adfc11fX"

No mention of the manifest checksum being invalid.  But if I remove the
backup label file from the manifest:

pg_validatebackup: fatal: manifest checksum mismatch


Oh, I see what's happening now. If the checksum is not an even-length
string of hexademical characters, it's treated as a syntax error, so
it bails out at that point. Generally, a syntax error in the manifest
file is treated as a fatal error, and you just die right there. You'd
get the same behavior if you had malformed JSON, like a stray { or }
or [ or ] someplace that it doesn't belong according to the rules of
JSON. On the other hand, if you corrupt the checksum by adding AA or
EE or 54 or some other even-length string of hex characters, then you
have (in this code's view) a semantic error rather than a syntax
error, so it will finish loading all the manifest data and then bail
because the checksum doesn't match.

We really can't avoid bailing out early sometimes, because if the file
is totally malformed at the JSON level, there's just no way to
continue. We could cause this particular error to get treated as a
semantic error rather than a syntax error, but I don't really see much
advantage in so doing. This way was easier to code, and I don't think
it really matters which error we find first.


I think it would be good to know that the manifest checksum is bad in 
all cases because that may well inform other errors.


That said, I know you have a lot on your plate with this patch so I'm 
not going to make a fuss about such a minor gripe. Perhaps this can be 
considered for future improvement.


Regards,
--
-David
da...@pgmasters.net




Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-01 Thread James Coleman
Currently there's no good way that I'm aware of for monitoring
software to check what the xmin horizon is being blocked at. You can
check pg_stat_replication and pg_replication_slots and
txid_snapshot_xmin(txid_current_snapshot()) and so on, but that list
can grow, and it means monitoring setups need to update any time any
new feature might hold another snapshot and expose it in a different
way.

To my knowledge the current oldest xmin (GetOldestXmin() if I'm not
mistaken) isn't exposed directly in any view or function by Postgres.

Am I missing anything in the above description? And if not, would
there be any reason why we would want to avoid exposing that
information? And if not, then would exposing it as a function be
acceptable?

Thanks,
James




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 15:11:52 -0500, Kevin Grittner wrote:
> On Wed, Apr 1, 2020 at 2:43 PM Andres Freund  wrote:
>
> > The thing that makes me really worried is that the contents of the time
> > mapping seem very wrong. I've reproduced query results in a REPEATABLE
> > READ transaction changing (pruned without triggering an error).
>
>
> That is a very big problem.  On the sort-of bright side (ironic in light of
> the fact that I'm a big proponent of using serializable transactions), none
> of the uses that I have personally seen of this feature use anything other
> than the default READ COMMITTED isolation level.  That might help explain
> the lack of complaints for those using the feature.  But yeah, I REALLY
> want to see a solid fix for that!

I don't think it's dependent on RR - it's just a bit easier to verify
that the query results are wrong that way.


> > And I've
> > reproduced rows not getting removed for much longer than than they
> > should, according to old_snapshot_threshold.
> >
> > I suspect one reason for users not noticing either is that
> >
> > a) it's plausible that users of the feature would mostly have
> >   long-running queries/transactions querying immutable or insert only
> >   data. Those would not notice that, on other tables, rows are getting
> >   removed, where access would not trigger the required error.
> >
> > b) I observe long-ish phases were no cleanup is happening (due to
> >   oldSnapshotControl->head_timestamp getting updated more often than
> >   correct). But if old_snapshot_threshold is small enough in relation to
> >   the time the generated bloat becomes problematic, there will still be
> >   occasions to actually perform cleanup.
> >
>
> Keep in mind that the real goal of this feature is not to eagerly _see_
> "snapshot too old" errors, but to prevent accidental debilitating bloat due
> to one misbehaving user connection.

I don't think it's an "intentional" inaccuracy issue leading to
this. The map contents are just wrong, in particular the head_timestamp
most of the time is so new that
TransactionIdLimitedForOldSnapshots(). When filling a new bucket,
MaintainOldSnapshotThreshold() unconditionally updates
oldSnapshotControl->head_timestamp to be the current minute, which means
it'll take old_snapshot_threshold minutes till
TransactionIdLimitedForOldSnapshots() even looks at the mapping again.

As far as I can tell, with a large old_snapshot_threshold, it can take a
very long time to get to a head_timestamp that's old enough for
TransactionIdLimitedForOldSnapshots() to do anything.  Look at this
trace of a pgbench run with old_snapshot_threshold enabled, showing some of
the debugging output added in the patch upthread.

This is with a threshold of 10min, in a freshly started database:
> 2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING:  head 0 
> min: filling 1 buckets starting at 0 for whenTaken 1 min, with xmin 2068881994
> 2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING:  old 
> snapshot mapping at "after update" with head ts: 1, current entries: 2 max 
> entries: 20, offset: 0
> entry 0 (ring 0): min 1: xid 2068447214
> entry 1 (ring 1): min 2: xid 2068881994
>
> 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING:  old snapshot 
> mapping at "before update" with head ts: 1, current entries: 2 max entries: 
> 20, offset: 0
> entry 0 (ring 0): min 1: xid 2068447214
> entry 1 (ring 1): min 2: xid 2068881994
>
> 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING:  head 1 min: 
> updating existing bucket 1 for whenTaken 2 min, with xmin 2069199511
> 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING:  old snapshot 
> mapping at "after update" with head ts: 1, current entries: 2 max entries: 
> 20, offset: 0
> entry 0 (ring 0): min 1: xid 2068447214
> entry 1 (ring 1): min 2: xid 2069199511
>
> 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING:  old 
> snapshot mapping at "before update" with head ts: 1, current entries: 2 max 
> entries: 20, offset: 0
> entry 0 (ring 0): min 1: xid 2068447214
> entry 1 (ring 1): min 2: xid 2069199511
>
> 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING:  head 1 
> min: filling 2 buckets starting at 0 for whenTaken 3 min, with xmin 2069516499
> 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING:  old 
> snapshot mapping at "after update" with head ts: 3, current entries: 4 max 
> entries: 20, offset: 0
> entry 0 (ring 0): min 3: xid 2068447214
> entry 1 (ring 1): min 4: xid 2069199511
> entry 2 (ring 2): min 5: xid 2069516499
> entry 3 (ring 3): min 6: xid 2069516499
> ...
> 2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING:  old 
> snapshot mapping at "before update" with head ts: 7, current entries: 8 max 
> entries: 20, offset: 0
> entry 0 (ring 0): min 7: xid 2068447214
> entry 1 (ring 1): min 

Re: backup manifests

2020-04-01 Thread Andres Freund
Hi,

On 2020-03-31 14:56:07 +0530, Amit Kapila wrote:
> On Tue, Mar 31, 2020 at 11:10 AM Noah Misch  wrote:
> > On Mon, Mar 30, 2020 at 12:16:31PM -0700, Andres Freund wrote:
> > > On 2020-03-30 15:04:55 -0400, Robert Haas wrote:
> > > I'm mildly inclined to name it pg_validate, pg_validate_dbdir or
> > > such. And eventually (definitely not this release) subsume pg_checksums
> > > in it. That way we can add other checkers too.
> >
> > Works for me; of those two, I prefer pg_validate.
> >
> 
> pg_validate sounds like a tool with a much bigger purpose.  I think
> even things like amcheck could also fall under it.

Intentionally so. We don't serve our users by collecting a lot of
differently named commands to work with data directories. A I wrote
above, the point would be to eventually have that tool also perform
checksum validation etc.  Potentially even in a single pass over the
data directory.


> This patch has two parts (a) Generate backup manifests for base
> backups, and (b) Validate backup (manifest).  It seems to me that
> there are not many things pending for (a), can't we commit that first
> or is it the case that (a) depends on (b)?  This is *not* a suggestion
> to leave pg_validatebackup from this release rather just to commit if
> something is ready and meaningful on its own.

IDK, it seems easier to be able to modify both at the same time.

Greetings,

Andres Freund




Re: backup manifests

2020-04-01 Thread Andres Freund
Hi,

On 2020-03-31 22:15:04 -0700, Noah Misch wrote:
> On Tue, Mar 31, 2020 at 03:50:34PM -0700, Andres Freund wrote:
> > On 2020-03-31 14:10:34 -0400, Robert Haas wrote:
> > > +/*
> > > + * Attempt to parse the WAL files required to restore from backup using
> > > + * pg_waldump.
> > > + */
> > > +static void
> > > +parse_required_wal(validator_context *context, char *pg_waldump_path,
> > > +char *wal_directory, manifest_wal_range 
> > > *first_wal_range)
> > > +{
> > > + manifest_wal_range *this_wal_range = first_wal_range;
> > > +
> > > + while (this_wal_range != NULL)
> > > + {
> > > + char *pg_waldump_cmd;
> > > +
> > > + pg_waldump_cmd = psprintf("\"%s\" --quiet --path=\"%s\" 
> > > --timeline=%u --start=%X/%X --end=%X/%X\n",
> > > +pg_waldump_path, wal_directory, this_wal_range->tli,
> > > +(uint32) (this_wal_range->start_lsn >> 32),
> > > +(uint32) this_wal_range->start_lsn,
> > > +(uint32) (this_wal_range->end_lsn >> 32),
> > > +(uint32) this_wal_range->end_lsn);
> > > + if (system(pg_waldump_cmd) != 0)
> > > + report_backup_error(context,
> > > + "WAL parsing 
> > > failed for timeline %u",
> > > + 
> > > this_wal_range->tli);
> > > +
> > > + this_wal_range = this_wal_range->next;
> > > + }
> > > +}
> > 
> > Should we have a function to properly escape paths in cases like this?
> > Not that it's likely or really problematic, but the quoting for path
> > could be "circumvented".
> 
> Are you looking for appendShellString(), or something different?

Looks like that'd be it. Thanks.

Greetings,

Andres Freund




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 3:43 PM Andres Freund  wrote:
> The thing that makes me really worried is that the contents of the time
> mapping seem very wrong. I've reproduced query results in a REPEATABLE
> READ transaction changing (pruned without triggering an error). And I've
> reproduced rows not getting removed for much longer than than they
> should, according to old_snapshot_threshold.

I think it would be a good idea to add a system view that shows the
contents of the mapping. We could make it a contrib module, if you
like, so that it can even be installed on back branches. We'd need to
move the structure definition from snapmgr.c to a header file, but
that doesn't seem like such a big deal.

Maybe that contrib module could even have some functions to simulate
aging without the passage of any real time. Like, say you have a
function or procedure old_snapshot_pretend_time_has_passed(integer),
and it moves oldSnapshotControl->head_timestamp backwards by that
amount. Maybe that would require updating some other fields in
oldSnapshotControl too but it doesn't seem like we'd need to do a
whole lot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Kevin Grittner
On Wed, Apr 1, 2020 at 2:43 PM Andres Freund  wrote:

> The thing that makes me really worried is that the contents of the time
> mapping seem very wrong. I've reproduced query results in a REPEATABLE
> READ transaction changing (pruned without triggering an error).


That is a very big problem.  On the sort-of bright side (ironic in light of
the fact that I'm a big proponent of using serializable transactions), none
of the uses that I have personally seen of this feature use anything other
than the default READ COMMITTED isolation level.  That might help explain
the lack of complaints for those using the feature.  But yeah, I REALLY
want to see a solid fix for that!


> And I've
> reproduced rows not getting removed for much longer than than they
> should, according to old_snapshot_threshold.
>
> I suspect one reason for users not noticing either is that
>
> a) it's plausible that users of the feature would mostly have
>   long-running queries/transactions querying immutable or insert only
>   data. Those would not notice that, on other tables, rows are getting
>   removed, where access would not trigger the required error.
>
> b) I observe long-ish phases were no cleanup is happening (due to
>   oldSnapshotControl->head_timestamp getting updated more often than
>   correct). But if old_snapshot_threshold is small enough in relation to
>   the time the generated bloat becomes problematic, there will still be
>   occasions to actually perform cleanup.
>

Keep in mind that the real goal of this feature is not to eagerly _see_
"snapshot too old" errors, but to prevent accidental debilitating bloat due
to one misbehaving user connection.  This is particularly easy to see (and
therefore unnervingly common) for those using ODBC, which in my experience
tends to correspond to the largest companies which are using PostgreSQL.
In some cases, the snapshot which is preventing removal of the rows will
never be used again; removal of the rows will not actually affect the
result of any query, but only the size and performance of the database.
This is a "soft limit" -- kinda like max_wal_size.  Where there was a
trade-off between accuracy of the limit and performance, the less accurate
way was intentionally chosen.  I apologize for not making that more clear
in comments.

While occasional "snapshot too old" errors are an inconvenient side effect
of achieving the primary goal, it might be of interest to know that the
initial (very large corporate) user of this feature had, under Oracle,
intentionally used a cursor that would be held open as long as a user chose
to leave a list open for scrolling around.  They used cursor features for
as long as the cursor allowed.  This could be left open for days or weeks
(or longer?).  Their query ordered by a unique index, and tracked the ends
of the currently displayed portion of the list so that if they happened to
hit the "snapshot too old" error they could deallocate and restart the
cursor and reposition before moving forward or back to the newly requested
rows.  They were not willing to convert to PostgreSQL unless this approach
continued to work.

In Summary:
(1) It's not urgent that rows always be removed as soon as possible after
the threshold is crossed as long as they don't often linger too awfully far
past that limit and allow debilitating bloat.
(2) It _is_ a problem if results inconsistent with the snapshot are
returned -- a "snapshot too old" error is necessary.
(3) Obviously, wraparound problems need to be solved.

I hope this is helpful.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/


Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

Nice to have you back for a bit! Even if the circumstances aren't
great...

It's very understandable that the lists are past your limits, I barely
keep up these days. Without any health issues.


On 2020-04-01 14:10:09 -0500, Kevin Grittner wrote:
> Perhaps the lack of evidence for usage in the archives indicates a low
> frequency of real-world failures due to the feature, rather than lack of
> use?  I'm not doubting that Andres found real issues that should be fixed,
> but perhaps not very many people who are using the feature have more than
> two billion transactions within the time threshold, and perhaps the other
> problems are not as big as the problems solved by use of the feature -- at
> least in some cases.

> To save readers who have not yet done the math some effort, at the 20
> minute threshold used by the initial user, they would need to have a
> sustained rate of consumption of transaction IDs of over 66 million per
> second to experience wraparound problems, and at the longest threshold I
> have seen it would need to exceed an average of 461,893 TPS for three days
> solid to hit wraparound.  Those aren't impossible rates to hit, but in
> practice it might not be a frequent occurrence yet on modern hardware with
> some real-world applications.  Hopefully we can find a way to fix this
> before those rates become common.

The wraparound issue on their own wouldn't be that bad - when I found it
I did play around with a few ideas for how to fix it. The most practical
would probably be to have MaintainOldSnapshotTimeMapping() scan all
buckets when a new oldSnapshotControl->oldest_xid is older than
RecentGlobalXmin. There's no benefit in the contents of those buckets
anyway, since we know that we can freeze those independent of
old_snapshot_threshold.

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error). And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.


I suspect one reason for users not noticing either is that

a) it's plausible that users of the feature would mostly have
  long-running queries/transactions querying immutable or insert only
  data. Those would not notice that, on other tables, rows are getting
  removed, where access would not trigger the required error.

b) I observe long-ish phases were no cleanup is happening (due to
  oldSnapshotControl->head_timestamp getting updated more often than
  correct). But if old_snapshot_threshold is small enough in relation to
  the time the generated bloat becomes problematic, there will still be
  occasions to actually perform cleanup.

Greetings,

Andres Freund




Re: error context for vacuum to include block number

2020-04-01 Thread Alvaro Herrera
On 2020-Apr-01, Andres Freund wrote:

> On 2020-04-01 07:54:45 +0530, Amit Kapila wrote:
> > Pushed. I think we are done here.  The patch is marked as committed in
> > CF.  Thank you!
> 
> Awesome! Thanks for all your work on this, all. This'll make it a lot
> easier to debug errors during autovacuum.

Seconded!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: error context for vacuum to include block number

2020-04-01 Thread Andres Freund
On 2020-04-01 07:54:45 +0530, Amit Kapila wrote:
> Pushed. I think we are done here.  The patch is marked as committed in
> CF.  Thank you!

Awesome! Thanks for all your work on this, all. This'll make it a lot
easier to debug errors during autovacuum.




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Kevin Grittner
On Wed, Apr 1, 2020 at 10:09 AM Andres Freund  wrote:

First off, many thanks to Andres for investigating this, and apologies for
the bugs.  Also thanks to Michael for making sure I saw the thread.  I must
also apologize that for not being able to track the community lists
consistently due to health issues that are exacerbated by stress, and the
fact that these lists often push past my current limits.  I'll try to help
in this as best I can.

Do we actually have any evidence of this feature ever beeing used? I
> didn't find much evidence for that in the archives (except Thomas
> finding a problem).


This was added because a very large company trying to convert from Oracle
had a test that started to show some slowdown on PostgreSQL after 8 hours,
serious slowdown by 24 hours, and crashed hard before it could get to 48
hours -- due to lingering WITH HOLD cursors left by ODBC code.  They had
millions of lines of code that would need to be rewritten without this
feature.  With this feature (set to 20 minutes, if I recall correctly),
their unmodified code ran successfully for at least three months solid
without failure or corruption.  Last I heard, they were converting a large
number of instances from Oracle to PostgreSQL, and those would all fail
hard within days of running with this feature removed or disabled.

Also, VMware is using PostgreSQL as an embedded part of many products, and
this feature was enabled to deal with similar failures due to ODBC cursors;
so the number of instances running 24/7 under high load which have shown a
clear benefit from enabling this feature has a lot of zeros.

Perhaps the lack of evidence for usage in the archives indicates a low
frequency of real-world failures due to the feature, rather than lack of
use?  I'm not doubting that Andres found real issues that should be fixed,
but perhaps not very many people who are using the feature have more than
two billion transactions within the time threshold, and perhaps the other
problems are not as big as the problems solved by use of the feature -- at
least in some cases.

To save readers who have not yet done the math some effort, at the 20
minute threshold used by the initial user, they would need to have a
sustained rate of consumption of transaction IDs of over 66 million per
second to experience wraparound problems, and at the longest threshold I
have seen it would need to exceed an average of 461,893 TPS for three days
solid to hit wraparound.  Those aren't impossible rates to hit, but in
practice it might not be a frequent occurrence yet on modern hardware with
some real-world applications.  Hopefully we can find a way to fix this
before those rates become common.

I am reviewing the issue and patches now, and hope I can make some useful
contribution to the discussion.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/


Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 2:37 PM Andres Freund  wrote:
> Just continuing is easier said than done. Especially with the background
> of knowing that several users had hit the bug that allowed all of the
> above to be hit, and that advancing relfrozenxid further would make it
> worse.

Fair point, but it seems we're arguing over nothing here, or at least
nothing relevant to this thread, because it sounds like if we are
going to disable that you're OK with doing that by just shutting it
off the code rather than trying to remove it all. I had the opposite
impression from your first email.

Sorry to have derailed the thread, and for my poor choice of words.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: tweaking perfect hash multipliers

2020-04-01 Thread John Naylor
On Tue, Mar 31, 2020 at 4:05 PM John Naylor  wrote:
>
> On Tue, Mar 31, 2020 at 2:31 AM Andres Freund  wrote:
> > I think the form of lea generated here is among the ones that can only
> > be executed on port 1. Whereas e.g. an register+register/immediate add
> > can be executed on four different ports.
>
> I looked into slow vs. fast leas, and I think the above are actually
> fast because they have 2 operands.

No, scratch that, it seems the two forms of lea are:

leal (,%rdx,8), %ecx
leal (%rdx,%rdx,8), %ecx

The first operand in both is the implicit zero, so with 3 and 5 we do
get the slow lea on some architectures. So I've only kept the
shift-and-add multipliers in v2. I also changed the order of iteration
of the parameters, for speed. Before, it took over 30 seconds to build
the unicode quick check tables, now it takes under 2 seconds.

--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v2-tweak-perfect-hash-gen.patch
Description: Binary data


Re: [PATCH] Check operator when creating unique index on partition table

2020-04-01 Thread Tom Lane
Guancheng Luo  writes:
> On Mar 26, 2020, at 01:00, Tom Lane  wrote:
>> This would reject, for example, a hash index associated with a btree-based
>> partition constraint, but I'm not sure we're losing anything much thereby.

> There is cases when a BTREE index associated with a HASH partition key, but I 
> think we should allow them,
> as long as their equality operators consider the same value as equal.

Ah, yeah, I see we already have regression test cases that require that.

I pushed the patch with some cosmetic revisions.  I left out the
regression test case though; it seemed pretty expensive considering
that the code is already being exercised by existing cases.

Thanks for the report and patch!

regards, tom lane




Re: BUG #16109: Postgres planning time is high across version (Expose buffer usage during planning in EXPLAIN)

2020-04-01 Thread Julien Rouhaud
On Wed, Apr 1, 2020 at 7:51 PM Fujii Masao  wrote:
>
>
> On 2020/03/31 10:31, Justin Pryzby wrote:
> > On Wed, Jan 29, 2020 at 12:15:59PM +0100, Julien Rouhaud wrote:
> >> Rebase due to conflict with 3ec20c7091e97.
> >
> > This is failing to apply probably since 
> > 4a539a25ebfc48329fd656a95f3c1eb2cda38af3.
> > Could you rebase?   (Also, not sure if this can be set as RFC?)
>
> I updated the patch. Attached.

Thanks a lot!  I'm sorry I missed Justin's ping, and it I just
realized that my cron job that used to warn me about cfbot failure was
broken :(

> +/* Compute the difference between two BufferUsage */
> +BufferUsage
> +ComputeBufferCounters(BufferUsage *start, BufferUsage *stop)
>
> Since BufferUsageAccumDiff() was exported, ComputeBufferCounters() is
> no longer necessary. In the patched version, BufferUsageAccumDiff() is
> used to calculate the difference of buffer usage.

Indeed, exposing BufferUsageAccumDiff wa definitely a good thing!

> +   if (es->summary && (planduration || es->buffers))
> +   ExplainOpenGroup("Planning", "Planning", true, es);
>
> Isn't it more appropriate to check "bufusage" instead of "es->buffers" here?
> The patch changes the code so that "bufusage" is checked.

AFAICS not unless ExplainOneQuery is also changed to pass a NULL
pointer if the BUFFER option wasn't provided (and maybe also
optionally skip the planning buffer computation).  With this version
you now get:

=# explain (analyze, buffers off) update t1 set id = id;
  QUERY PLAN
---
 Update on t1  (cost=0.00..22.70 rows=1270 width=42) (actual
time=0.170..0.170 rows=0 loops=1)
   ->  Seq Scan on t1  (cost=0.00..22.70 rows=1270 width=42) (actual
time=0.050..0.054 rows=1 loops=1)
 Planning Time: 1.461 ms
   Buffers: shared hit=25
 Execution Time: 1.071 ms
(5 rows)

which seems wrong to me.

I reused the es->buffers to avoid having needing something like:

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b1f3fe13c6..9dbff97a32 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -375,7 +375,9 @@ ExplainOneQuery(Query *query, int cursorOptions,
BufferUsage bufusage_start,
bufusage;

-   bufusage_start = pgBufferUsage;
+   if (ex->buffers)
+   bufusage_start = pgBufferUsage;
+
INSTR_TIME_SET_CURRENT(planstart);

/* plan the query */
@@ -384,13 +386,16 @@ ExplainOneQuery(Query *query, int cursorOptions,
INSTR_TIME_SET_CURRENT(planduration);
INSTR_TIME_SUBTRACT(planduration, planstart);

-   /* calc differences of buffer counters. */
-   memset(, 0, sizeof(BufferUsage));
-   BufferUsageAccumDiff(, , _start);
+   if (es->buffers)
+   {
+   /* calc differences of buffer counters. */
+   memset(, 0, sizeof(BufferUsage));
+   BufferUsageAccumDiff(, , _start);
+   }

/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-  , );
+  , (es->buffers ?  : NULL));
}

which seemed like a win, but I'm not opposed to do that if you prefer.

>
> +   "Planning Time": N.N,+
> + "Shared Hit Blocks": N,+
> + "Shared Read Blocks": N,   +
> + "Shared Dirtied Blocks": N,+
>
> Doesn't this indent look strange? IMO no indent for buffer usage is
> necessary when the format is either json, xml, and yaml. This looks
> better at least for me. OTOH, in text format, it seems better to indent
> the buffer usage for more readability. Thought?
> The patch changes the code so that "es->indent" is
> increment/decrement only when the format is text.

Indeed, that's way better!




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 13:27:56 -0400, Robert Haas wrote:
> Perhaps "irresponsible" is the wrong word, but it's certainly caused
> problems for multiple EnterpriseDB customers, and in my view, those
> problems weren't necessary. Either a WARNING or an ERROR would have
> shown up in the log, but an ERROR terminates VACUUM for that table and
> thus basically causes autovacuum to be completely broken. That is a
> really big problem. Perhaps you will want to argue, as Andres did,
> that the value of having ERROR rather than WARNING in the log
> justifies that outcome, but I sure don't agree.

If that had been a really viable option, I would have done so. At the
very least in the back branches, but quite possibly also in master. Or
if somebody had brought them up as an issue at the time.

What is heap_prepare_freeze_tuple/FreezeMultiXactId supposed to do after
issuing a WARNING in these cases. Without the ERROR, e.g.,
if (!TransactionIdDidCommit(xid))
ereport(ERROR,

(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("uncommitted 
xmin %u from before xid cutoff %u needs to be frozen",

 xid, cutoff_xid)));
would make a deleted tuple visible.


if (TransactionIdPrecedes(xid, relfrozenxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("found xmin %u from 
before relfrozenxid %u",
 xid, 
relfrozenxid)));
would go on replace xmin of a potentially uncommitted tuple with
relfrozenxid, making it appear visible.


if (TransactionIdPrecedes(xid, relfrozenxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("found xmax %u from 
before relfrozenxid %u",
 xid, 
relfrozenxid)));
would replace the xmax indicating a potentially deleted tuple with ?, either
making the tuple become, potentially wrongly, visible/invisible

or
else if (MultiXactIdPrecedes(multi, relminmxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("found multixact %u from 
before relminmxid %u",
 multi, 
relminmxid)));
or ...


Just continuing is easier said than done. Especially with the background
of knowing that several users had hit the bug that allowed all of the
above to be hit, and that advancing relfrozenxid further would make it
worse.

Greetings,

Andres Freund




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 11:04:43 -0700, Peter Geoghegan wrote:
> On Wed, Apr 1, 2020 at 10:28 AM Robert Haas  wrote:
> > Is there any chance that you're planning to look into the details?
> > That would certainly be welcome from my perspective.

+1

This definitely needs more eyes. I am not even close to understanding
the code fully.


> I had a few other things that I was going to work on this week, but
> those seems less urgent. I'll take a look into it, and report back
> what I find.

Thanks you!

I attached a slightly evolved version of my debugging patch.


Greetings,

Andres Freund
diff --git i/src/backend/utils/time/snapmgr.c w/src/backend/utils/time/snapmgr.c
index 1c063c592ce..6a722863bcf 100644
--- i/src/backend/utils/time/snapmgr.c
+++ w/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@ typedef struct OldSnapshotControlData
 	int			head_offset;	/* subscript of oldest tracked time */
 	TimestampTz head_timestamp; /* time corresponding to head xid */
 	int			count_used;		/* how many slots are in use */
+	TimestampTz starttime; // rounded to minute
 	TransactionId xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
 } OldSnapshotControlData;
 
@@ -290,6 +291,8 @@ SnapMgrInit(void)
 		oldSnapshotControl->head_offset = 0;
 		oldSnapshotControl->head_timestamp = 0;
 		oldSnapshotControl->count_used = 0;
+		oldSnapshotControl->starttime =
+			AlignTimestampToMinuteBoundary(GetCurrentTimestamp());
 	}
 }
 
@@ -1762,6 +1765,39 @@ SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
 	SpinLockRelease(>mutex_threshold);
 }
 
+static void PrintOldSnapshotMapping(const char *head, bool already_locked)
+{
+	StringInfoData s;
+
+	initStringInfo();
+
+	if (!already_locked)
+		LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+
+
+	appendStringInfo(, "old snapshot mapping at \"%s\" with head ts: %lu, current entries: %d max entries: %d, offset: %d\n",
+	 head,
+	 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE,
+	 oldSnapshotControl->count_used,
+	 OLD_SNAPSHOT_TIME_MAP_ENTRIES,
+	 oldSnapshotControl->head_offset);
+
+	for (int off = 0; off < oldSnapshotControl->count_used; off++)
+	{
+		int ringoff = (off + oldSnapshotControl->head_offset) % OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+		appendStringInfo(, "  entry %d (ring %d): min %ld: xid %d\n",
+		 off, ringoff,
+		 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE + off,
+		 oldSnapshotControl->xid_by_minute[ringoff]);
+	}
+
+	if (!already_locked)
+		LWLockRelease(OldSnapshotTimeMapLock);
+
+	elog(WARNING, "%s", s.data);
+}
+
 /*
  * TransactionIdLimitedForOldSnapshots
  *
@@ -1826,9 +1862,15 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 		{
 			if (ts == update_ts)
 			{
+PrintOldSnapshotMapping("non cached limit via update_ts", false);
+
 xlimit = latest_xmin;
 if (NormalTransactionIdFollows(xlimit, recentXmin))
+{
+	elog(LOG, "increasing threshold from %u to %u (via update_ts)",
+		 recentXmin, xlimit);
 	SetOldSnapshotThresholdTimestamp(ts, xlimit);
+}
 			}
 			else
 			{
@@ -1839,6 +1881,8 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 {
 	int			offset;
 
+	PrintOldSnapshotMapping("non cached limit via bins", true);
+
 	offset = ((ts - oldSnapshotControl->head_timestamp)
 			  / USECS_PER_MINUTE);
 	if (offset > oldSnapshotControl->count_used - 1)
@@ -1848,7 +1892,15 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 	xlimit = oldSnapshotControl->xid_by_minute[offset];
 
 	if (NormalTransactionIdFollows(xlimit, recentXmin))
+	{
+		elog(LOG, "increasing threshold from %u to %u (via bins)",
+			 recentXmin, xlimit);
 		SetOldSnapshotThresholdTimestamp(ts, xlimit);
+	}
+}
+else
+{
+	// currently debugging output here is pretty darn verbose
 }
 
 LWLockRelease(OldSnapshotTimeMapLock);
@@ -1869,7 +1921,11 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 			xlimit = latest_xmin;
 
 		if (NormalTransactionIdFollows(xlimit, recentXmin))
+		{
+			elog(LOG, "increasing prune threshold from %u to %u",
+ recentXmin, xlimit);
 			return xlimit;
+		}
 	}
 
 	return recentXmin;
@@ -1923,14 +1979,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	 */
 	if (whenTaken < 0)
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with negative whenTaken = %ld",
 			 (long) whenTaken);
 		return;
 	}
 	if (!TransactionIdIsNormal(xmin))
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with xmin = %lu",
 			 (unsigned long) xmin);
 		return;
@@ -1938,6 +1994,8 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 
 	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
 
+	PrintOldSnapshotMapping("before update", true);
+
 	Assert(oldSnapshotControl->head_offset >= 0);
 	

Re: proposal \gcsv

2020-04-01 Thread Alvaro Herrera
On 2020-Apr-01, Pavel Stehule wrote:

> It can work, but it is not user friendly - my proposal was motivated by
> using some quick csv exports to gplot's pipe.

Can we fix that by adding some syntax to allow command aliases?
So you could add to your .psqlrc something like

\alias \gcsv \pset push all \; \cbuf; \; \pset pop

where the \cbuf is a hypothetical "function" that expands to the current
query buffer.  This needs some refining I guess, but it'd allow you to
create your own shortcuts for the most common features you want without
excessive typing effort.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 10:28 AM Robert Haas  wrote:
> Sure, but not all levels of risk are equal. Jumping out of a plane
> carries some risk of death whether or not you have a parachute, but
> that does not mean that we shouldn't worry about whether you have one
> or not before you jump.
>
> In this case, I think it is pretty clear that hard-disabling the
> feature by always setting old_snapshot_threshold to -1 carries less
> risk of breaking unrelated things than removing code that caters to
> the feature all over the code base. Perhaps it is not quite as
> dramatic as my parachute example, but I think it is pretty clear all
> the same that one is a lot more likely to introduce new bugs than the
> other. A carefully targeted modification of a few lines of code in 1
> file just about has to carry less risk than ~1k lines of code spread
> across 40 or so files.

Yeah, that's certainly true. But is that fine point really what
anybody disagrees about? I didn't think that Andres was focussed on
literally ripping it out over just disabling it.

> Is there any chance that you're planning to look into the details?
> That would certainly be welcome from my perspective.

I had a few other things that I was going to work on this week, but
those seems less urgent. I'll take a look into it, and report back
what I find.

-- 
Peter Geoghegan




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 12:02:18 -0400, Robert Haas wrote:
> On Wed, Apr 1, 2020 at 11:09 AM Andres Freund  wrote:
> > There's really no reason at all to have bins of one minute. As it's a
> > PGC_POSTMASTER GUC, it should just have didided time into bins of
> > (old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
> > of a week there's no need to keep 10k bins, and the minimum threshold of
> > 1 minute obviously is problematic.
>
> I am very doubtful that this approach would have been adequate. It
> would mean that, with old_snapshot_threshold set to a week, the
> threshold for declaring a snapshot "old" would jump forward 16.8 hours
> at a time. It's hard for me to make a coherent argument right now as
> to exactly what problems that would create, but it's not very
> granular, and a lot of bloat-related things really benefit from more
> granularity. I also don't really see what the problem with keeping a
> bucket per minute in memory is, even for a week. It's only 60 * 24 * 7
> = ~10k buckets, isn't it? That's not really insane for an in-memory
> data structure. I agree that the code that does that maintenance being
> buggy is a problem to whatever extent that is the case, but writing
> the code to have fewer buckets wouldn't necessarily have made it any
> less buggy.

My issue isn't really that it's too many buckets right now, but that it
doesn't scale down to smaller thresholds. I think to be able to develop
this reasonably, it'd need to be able to support thresholds in the
sub-second range. And I don't see how you can have the same binning for
such small thresholds, and for multi-day thresholds - we'd quickly go to
millions of buckets for longer thresholds.

I really think we'd need to support millisecond resolution to make this
properly testable.


> > GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
> > basis to detect conflicts seems dangerous to me. Isn't that ignoring
> > inserts that are already in progress?

> How so?

Because it returns the end of the reserved WAL, not how far we've
actually inserted. I.e. there can be in-progress, but not finished,
modifications that will have an LSN < GetXLogInsertRecPtr(). But the
whenTaken timestamp could reflect one that should throw an error for
these in-progress modifications (since the transaction limiting happens
before the WAL logging).

I am not 100%, but I suspect that that could lead to errors not being
thrown that should, because TestForOldSnapshot() will not see these
in-progress modifications as conflicting.

Hm, also, shouldn't
&& PageGetLSN(page) > (snapshot)->lsn)
in TestForOldSnapshot() be an >=?


> > It currently silently causes wrong query results. There's no
> > infrastructure to detect / protect against that (*).
>
> Sure, and what if you break more stuff ripping it out? Ripping this
> volume of code out in a supposedly-stable branch is totally insane
> almost no matter how broken the feature is.

For the backbranches I was just thinking of forcing the GUC to be off
(either by disallowing it to be set to on, or just warning when its set
to true, but not propagating the value).


> I have no objection to the idea that *if* the feature is hopelessly
> broken, it should be removed.

I would be a lot less inclined to go that way if old_snapshot_threshold

a) weren't explicitly about removing still-needed data - in contrast to
  a lot of other features, where the effects of bugs is temporary, here
  it can be much larger.
b) were a previously working feature, but as far as I can tell, it never really 
did
c) had tests that verify that my fixes actually do the right thing. As
  it stands, I'd not just have to fix the bugs, I'd also have to develop
  a test framework that can test this

While I wish I had been more forceful, and reviewed more of the code to
point out more of the quality issues, I did argue hard against the
feature going in. On account of it being architecturally bad and
impactful. Which I think it has proven to be several times over by
now. And now I'm kind of on the hook to fix it, it seems?


> I also think, and we've had this disagreement before, that you're far
> too willing to say "well, that's wrong so we need to hit it with a
> nuke." I complained when you added those error checks to vacuum in
> back-branches, and since that release went out people are regularly
> tripping those checks and taking prolonged outages for a problem that
> wasn't making them unhappy before. I know that in theory those people
> are better off because their database was always corrupted and now
> they know. But for some of them, those prolonged outages are worse
> than the problem they had before.

I think this is a somewhat revisionist. Sure, the errors were added
after like the 10th data corruption bug around freezing that we didn't
find for a long time, because of the lack of errors being thrown. But
the error checks weren't primarily added to find further bugs, but to
prevent data loss due 

Re: BUG #16109: Postgres planning time is high across version (Expose buffer usage during planning in EXPLAIN)

2020-04-01 Thread Fujii Masao



On 2020/03/31 10:31, Justin Pryzby wrote:

On Wed, Jan 29, 2020 at 12:15:59PM +0100, Julien Rouhaud wrote:

Rebase due to conflict with 3ec20c7091e97.


This is failing to apply probably since 
4a539a25ebfc48329fd656a95f3c1eb2cda38af3.
Could you rebase?   (Also, not sure if this can be set as RFC?)


I updated the patch. Attached.

+/* Compute the difference between two BufferUsage */
+BufferUsage
+ComputeBufferCounters(BufferUsage *start, BufferUsage *stop)

Since BufferUsageAccumDiff() was exported, ComputeBufferCounters() is
no longer necessary. In the patched version, BufferUsageAccumDiff() is
used to calculate the difference of buffer usage.

+   if (es->summary && (planduration || es->buffers))
+   ExplainOpenGroup("Planning", "Planning", true, es);

Isn't it more appropriate to check "bufusage" instead of "es->buffers" here?
The patch changes the code so that "bufusage" is checked.

+   "Planning Time": N.N,+
+ "Shared Hit Blocks": N,+
+ "Shared Read Blocks": N,   +
+ "Shared Dirtied Blocks": N,+

Doesn't this indent look strange? IMO no indent for buffer usage is
necessary when the format is either json, xml, and yaml. This looks
better at least for me. OTOH, in text format, it seems better to indent
the buffer usage for more readability. Thought?
The patch changes the code so that "es->indent" is
increment/decrement only when the format is text.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee0e638f33..b1f3fe13c6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -372,7 +372,10 @@ ExplainOneQuery(Query *query, int cursorOptions,
PlannedStmt *plan;
instr_time  planstart,
planduration;
+   BufferUsage bufusage_start,
+   bufusage;
 
+   bufusage_start = pgBufferUsage;
INSTR_TIME_SET_CURRENT(planstart);
 
/* plan the query */
@@ -381,9 +384,13 @@ ExplainOneQuery(Query *query, int cursorOptions,
INSTR_TIME_SET_CURRENT(planduration);
INSTR_TIME_SUBTRACT(planduration, planstart);
 
+   /* calc differences of buffer counters. */
+   memset(, 0, sizeof(BufferUsage));
+   BufferUsageAccumDiff(, , 
_start);
+
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-  );
+  , );
}
 }
 
@@ -476,7 +483,8 @@ ExplainOneUtility(Node *utilityStmt, IntoClause *into, 
ExplainState *es,
 void
 ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
   const char *queryString, ParamListInfo params,
-  QueryEnvironment *queryEnv, const instr_time 
*planduration)
+  QueryEnvironment *queryEnv, const instr_time 
*planduration,
+  const BufferUsage *bufusage)
 {
DestReceiver *dest;
QueryDesc  *queryDesc;
@@ -560,6 +568,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, 
ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
 
+   if (es->summary && (planduration || bufusage))
+   ExplainOpenGroup("Planning", "Planning", true, es);
+
if (es->summary && planduration)
{
double  plantime = INSTR_TIME_GET_DOUBLE(*planduration);
@@ -567,6 +578,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, 
ExplainState *es,
ExplainPropertyFloat("Planning Time", "ms", 1000.0 * plantime, 
3, es);
}
 
+   /* Show buffer usage */
+   if (es->summary && bufusage)
+   {
+   if (es->format == EXPLAIN_FORMAT_TEXT)
+   es->indent++;
+   show_buffer_usage(es, bufusage);
+   if (es->format == EXPLAIN_FORMAT_TEXT)
+   es->indent--;
+   }
+
+   if (es->summary && (planduration || bufusage))
+   ExplainCloseGroup("Planning", "Planning", true, es);
+
/* Print info about runtime of triggers */
if (es->analyze)
ExplainPrintTriggers(es, queryDesc);
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 284a5bfbec..d54568e7b2 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -616,7 +616,10 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause 
*into, ExplainState *es,
EState *estate = NULL;
instr_time  planstart;
instr_time  planduration;
+   BufferUsage bufusage_start,
+ 

Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 1:03 PM Peter Geoghegan  wrote:
> I don't think that it's fair to characterize Andres' actions in that
> situation as in any way irresponsible. We had an extremely complicated
> data corruption bug that he went to great lengths to fix, following
> two other incorrect fixes. He was jet lagged from travelling to India
> at the time. He went to huge lengths to make sure that the bug was
> correctly squashed.

I don't mean it as a personal attack on Andres, and I know and am glad
that he worked hard on the problem, but I don't agree that it was the
right decision. Perhaps "irresponsible" is the wrong word, but it's
certainly caused problems for multiple EnterpriseDB customers, and in
my view, those problems weren't necessary. Either a WARNING or an
ERROR would have shown up in the log, but an ERROR terminates VACUUM
for that table and thus basically causes autovacuum to be completely
broken. That is a really big problem. Perhaps you will want to argue,
as Andres did, that the value of having ERROR rather than WARNING in
the log justifies that outcome, but I sure don't agree.

> > Actually removing the code is unnecessary, protects
> > nobody, and has risk.
>
> Every possible approach has risk. We are deciding among several
> unpleasant and risky alternatives here, no?

Sure, but not all levels of risk are equal. Jumping out of a plane
carries some risk of death whether or not you have a parachute, but
that does not mean that we shouldn't worry about whether you have one
or not before you jump.

In this case, I think it is pretty clear that hard-disabling the
feature by always setting old_snapshot_threshold to -1 carries less
risk of breaking unrelated things than removing code that caters to
the feature all over the code base. Perhaps it is not quite as
dramatic as my parachute example, but I think it is pretty clear all
the same that one is a lot more likely to introduce new bugs than the
other. A carefully targeted modification of a few lines of code in 1
file just about has to carry less risk than ~1k lines of code spread
across 40 or so files.

However, I still think that without some more analysis, it's not clear
whether we should go this direction at all. Andres's results suggest
that there are some bugs here, but I think we need more senior hackers
to study the situation before we make a decision about what to do
about them. I certainly haven't had enough time to even fully
understand the problems yet, and nobody else has posted on that topic
at all. I have the highest respect for Andres and his technical
ability, and if he says this stuff has problems, I'm sure it does. Yet
I'm not willing to conclude that because he's tired and frustrated
with this stuff right now, it's unsalvageable. For the benefit of the
whole community, such a claim deserves scrutiny from multiple people.

Is there any chance that you're planning to look into the details?
That would certainly be welcome from my perspective.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-01 Thread Ashutosh Bapat
On Thu, 26 Mar 2020 at 00:35, Tomas Vondra 
wrote:

> Hi,
>
> I've started reviewing the patch a couple days ago. I haven't done any
> extensive testing, but I do have a bunch of initial comments that I can
> share now.
>
> 1) I wonder if this needs to update src/backend/optimizer/README, which
> does have a section about partitionwise joins. It seems formulated in a
> way that that probably covers even this more advanced algorithm, but
> maybe it should mention handling of default partitions etc.?
>

Done. Please check the wording. It might need some word smithy.


>
> There certainly needs to be some description of the algorithm somewhere,
> either in a README or before a suitable function. It doesn't have to be
> particularly detailed, a rough outline of the matching would be enough,
> so that readers don't have to rebuild the knowledge from pieces
> scattered around various comments.
>

The algorithm for list and range partitioned tables is slightly different.
So, I have added separate prologue to each list_merge_bounds() and
range_merge_bounds(). Please check if that serves the purpose.


>
> 2) Do we need another GUC enabling this more complex algorithm? In PG11
> the partitionwise join is disabled by default, on the grounds that it's
> expensive and not worth it by default. How much more expensive is this?
> Maybe it makes sense to allow enabling only the "simple" approach?
>

We have reduced the complexity of merging bounds quite a bit so this
shouldn't be costly. Further more we handle the usual case of equal bounds
quickly using the merge flag so most of the cases should be fine. It's only
when two partitioned tables with same partition scheme are joined but do
not have merge-able bounds that this algorithm would not yield useful
result - but that would be rare in the field. enable_partitionwise_join =
false should take care of such scenarios easily. I am not in favour of
adding another GUC which we set to false by default and then take another
few releases to make it true by default.


>
> 3) This comment in try_partitionwise_join is now incorrect, because the
> condition may be true even for partitioned tables with (nparts == 0).
>
>  /* Nothing to do, if the join relation is not partitioned. */
>  if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>  return;
>
>
If part_scheme = NULL, npart should be 0, fixed that in
build_joinrel_partition_info(). If partscheme != NULL but bounds can not be
merged, nparts = 0. So this condition is correct. Encapsulated in a macro
IS_JOINREL_NOT_PARTITITIONED(). and added comments for the macro. Given
that the macro is used exactly at one place, it may not be necessary to
define it but it looks *nice*.


> Moreover, the condition used to be
>
>  if (!IS_PARTITIONED_REL(joinrel))
>  return;
>
> which is way more readable. I think it's net negative to replace these
> "nice" macros with clear meaning with complex conditions. If needed, we
> can invent new macros. There are many other places where the patch
> replaces macros with less readable conditions.
>
>
The only other place where we have replaced a *nice* macro is in
build_joinrel_partition_info(). But I think it's a legit replacement. I
have added a comment there.


>
> 4) I'm a bit puzzled how we could get here with non-partitioned rels?
>
>  /*
>   * We can not perform partitionwise join if either of the joining
> relations
>   * is not partitioned.
>   */
>  if (!IS_PARTITIONED_REL(rel1) || !IS_PARTITIONED_REL(rel2))
>  return;
>

See the comment I have added in build_joinrel_partition_info(). Not all
joining pairs for a given relation are partitioned.


>
> 5) I find the "merged" flag in RelOptInfo rather unclear, because it
> does not clearly indicate what was merged. Maybe something like
> partbounds_merged would be better?
>

Done.


>
> 6) The try_partitionwise_join function is getting a bit too long and
> harder to understand. The whole block in
>
>  if (joinrel->nparts == -1)
>  {
>  ...
>  }
>
> seems rather well isolated, so I propose to move it to a separate
> function responsible only for the merging. We can simply call it on the
> joinrel, and make it return right away if (joinrel->nparts == -1).
>

Looks like you have already taken care of this one in one of your patches.


>
> 7) I'd suggest not to reference exact functions in comments unless
> abolutely necessary, because it's harder to maintain and it does not
> really explain purpose of the struct/code. E.g. consider this:
>
>  /* Per-partitioned-relation data for
> merge_list_bounds()/merge_range_bounds() */
>  typedef struct PartitionMap
>  { ... }
>
> Why does it matter where is the struct used? That's pretty trivial to
> find using 'git grep' or something. Instead the comment should explain
> the purpose of the struct.
>

Adjusted names and comments a bit.

-- 
Best Wishes,
Ashutosh


Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 9:02 AM Robert Haas  wrote:
> I complained
> when you added those error checks to vacuum in back-branches, and
> since that release went out people are regularly tripping those checks
> and taking prolonged outages for a problem that wasn't making them
> unhappy before. I know that in theory those people are better off
> because their database was always corrupted and now they know. But for
> some of them, those prolonged outages are worse than the problem they
> had before. I believe it was irresponsible to decide on behalf of our
> entire user base that they were better off with such a behavior change
> in a supposedly-stable branch, and I believe the same thing here.

I agreed with that decision, FWIW. Though I don't deny that there is
some merit in what you say. This is the kind of high level
philosophical question where large differences of opinion are quite
normal.

I don't think that it's fair to characterize Andres' actions in that
situation as in any way irresponsible. We had an extremely complicated
data corruption bug that he went to great lengths to fix, following
two other incorrect fixes. He was jet lagged from travelling to India
at the time. He went to huge lengths to make sure that the bug was
correctly squashed.

> Actually removing the code is unnecessary, protects
> nobody, and has risk.

Every possible approach has risk. We are deciding among several
unpleasant and risky alternatives here, no?

-- 
Peter Geoghegan




Re: proposal \gcsv

2020-04-01 Thread Tom Lane
Pavel Stehule  writes:
> It can work, but it is not user friendly - my proposal was motivated by
> using some quick csv exports to gplot's pipe.

I kind of liked the stack idea, myself.  It's simpler than what I was
suggesting and it covers probably 90% of the use-case.

However, if we prefer something closer to Plan A ... I took a look at
the psql lexer, and the only difference between OT_FILEPIPE and OT_NORMAL
parsing is if the argument starts with '|'.  So we could make it work
I think.  I'd modify my first proposal so far as to make it

\g ( pset-option pset-value ... ) filename-or-pipe

That is, require spaces around the parens, and require a value for each
pset-option (no fair using the shortcuts like "\pset expanded").  Then
it's easy to separate the option names and values from the paren markers.
The \g parser would consume its first argument in OT_FILEPIPE mode, and
then if it sees '(' it would consume arguments in OT_NORMAL mode until
it's found the ')'.

This way also narrows the backwards-compatibility problem from "fails if
your filename starts with '('" to "fails if your filename is exactly '('",
which seems acceptably improbable to me.

regards, tom lane




Re: wraparound dangers in snapshot too old

2020-04-01 Thread Andres Freund
Hi,

On 2020-03-31 21:53:04 -0700, Andres Freund wrote:
> I am trying to change the snapshot too old infrastructure so it
> cooperates with my snapshot scalability patch. While trying to
> understand the code sufficiently, I think I found a fairly serious
> issue:

I accidentally sent this email, I was intending to instead only send
https://www.postgresql.org/message-id/20200401064008.qob7bfnnbu4w5cw4%40alap3.anarazel.de

It started to happen because of some default keybinding changes between
mutt / emacs / magit, leading to emacs sending a buffer as an email,
which I didn't intend.

Sorry for that,

Andres Freund




Re: A bug when use get_bit() function for a long bytea string

2020-04-01 Thread Tom Lane
"movead...@highgo.ca"  writes:
> [ long_bytea_string_bug_fix_ver5.patch ]

I don't think this has really solved the overflow hazards.  For example,
in binary_encode we've got

resultlen = enc->encode_len(VARDATA_ANY(data), datalen);
result = palloc(VARHDRSZ + resultlen);

and all you've done about that is changed resultlen from int to int64.
On a 64-bit machine, sure, palloc will be able to detect if the
result exceeds what can be allocated --- but on a 32-bit machine
it'd be possible for the size_t argument to overflow altogether.
(Or if you want to argue that it can't overflow because no encoder
expands the data more than 4X, then we don't need to be making this
change at all.)

I don't think there's really any way to do that safely without an
explicit check before we call palloc.

I also still find the proposed signatures for the encoding-specific
functions to be just plain weird:

-   unsigned(*encode_len) (const char *data, unsigned dlen);
-   unsigned(*decode_len) (const char *data, unsigned dlen);
-   unsigned(*encode) (const char *data, unsigned dlen, char *res);
-   unsigned(*decode) (const char *data, unsigned dlen, char *res);
+   int64   (*encode_len) (const char *data, unsigned dlen);
+   int64   (*decode_len) (const char *data, unsigned dlen);
+   int64   (*encode) (const char *data, unsigned dlen, char *res);
+   int64   (*decode) (const char *data, unsigned dlen, char *res);

Why did you change the outputs from unsigned to signed?  Why didn't
you change the dlen inputs?  I grant that we know that the input
can't exceed 1GB in Postgres' usage, but these signatures are just
randomly inconsistent, and you didn't even add a comment explaining
why.  Personally I think I'd make them like

uint64  (*encode_len) (const char *data, size_t dlen);

which makes it explicit that the dlen argument describes the length
of a chunk of allocated memory, while the result might exceed that.

Lastly, there is a component of this that can be back-patched and
a component that can't --- we do not change system catalog contents
in released branches.  Cramming both parts into the same patch
is forcing the committer to pull them apart, which is kind of
unfriendly.

regards, tom lane




Re: proposal \gcsv

2020-04-01 Thread Pavel Stehule
st 1. 4. 2020 v 17:52 odesílatel Daniel Verite 
napsal:

> Tom Lane wrote:
>
> >  I could see having a command to copy the current primary formatting
> > parameters to the alternate area, too.
>
> We could have a stack to store parameters before temporary
> changes, for instance if you want to do one csv export and
> come back to normal without assuming what "normal"
> values were.
>
> \pset push format csv_fieldsep
> \pset format csv
> \pset  csv_fielsep '\t'
> some command \g somefile
> \pset pop
>
> So \pset pop would reset the pushed parameters
> to their values when pushed, which also could be all
> parameters:
>
> \pset push all
> \pset param1 something
> \pset param2 something-else
> ...other commands...
> \pset pop
>
> or
>
> \pset push all
> \i somescript.sql
> \pset pop
>
>
It can work, but it is not user friendly - my proposal was motivated by
using some quick csv exports to gplot's pipe.

Regards

Pavel

>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


Re: pgbench - add pseudo-random permutation function

2020-04-01 Thread David Steele

Hi Fabien,

On 2/1/20 5:12 AM, Fabien COELHO wrote:


Attached is an attempt at improving things. I have added a explicit note 
and hijacked an existing example to better illustrate the purpose of the 
function.


This patch does not build on Linux due to some unused functions and 
variables: http://cfbot.cputube.org/patch_27_1712.log


The CF entry has been updated to Waiting on Author.

A few committers have expressed doubts about whether this patch is 
needed and it doesn't make sense to keep moving it from CF to CF.


I'm planning to mark this patch RwF on April 8 and I suggest you 
resubmit if you are able to get some consensus.


Regards,
--
-David
da...@pgmasters.net




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 11:09 AM Andres Freund  wrote:
> That doesn't exist in all the back branches. Think it'd be easier to add
> code to explicitly prune it during MaintainOldSnapshotTimeMapping().

That's reasonable.

> There's really no reason at all to have bins of one minute. As it's a
> PGC_POSTMASTER GUC, it should just have didided time into bins of
> (old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
> of a week there's no need to keep 10k bins, and the minimum threshold of
> 1 minute obviously is problematic.

I am very doubtful that this approach would have been adequate. It
would mean that, with old_snapshot_threshold set to a week, the
threshold for declaring a snapshot "old" would jump forward 16.8 hours
at a time. It's hard for me to make a coherent argument right now as
to exactly what problems that would create, but it's not very
granular, and a lot of bloat-related things really benefit from more
granularity. I also don't really see what the problem with keeping a
bucket per minute in memory is, even for a week. It's only 60 * 24 * 7
= ~10k buckets, isn't it? That's not really insane for an in-memory
data structure. I agree that the code that does that maintenance being
buggy is a problem to whatever extent that is the case, but writing
the code to have fewer buckets wouldn't necessarily have made it any
less buggy.

> They probably are fixable. But there's a lot more, I think:
>
> Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
> update_ts threshold actually needs to be ts >= update_ts, right now we
> don't handle being newer than the newest bin correctly afaict (mitigated
> by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
> to say, because there's no comments.

That test and the following one for "if (ts == update_ts)" both make
me nervous too.  If only two of <, >, and = are expected, there should
be an Assert() to that effect, at least. If all three values are
expected then we need an explanation of why we're only checking for
equality.

> The whole lock nesting is very hazardous. Most (all?)
> TestForOldSnapshot() calls happen with locks on on buffers held, and can
> acquire lwlocks itself. In some older branches we do entire *catalog
> searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

The catalog searches are clearly super-bad, but I'm not sure that the
other ones have a deadlock risk or anything. They might, but I think
we'd need some evidence of that.

> GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
> basis to detect conflicts seems dangerous to me. Isn't that ignoring
> inserts that are already in progress?

How so?

> It currently silently causes wrong query results. There's no
> infrastructure to detect / protect against that (*).

Sure, and what if you break more stuff ripping it out? Ripping this
volume of code out in a supposedly-stable branch is totally insane
almost no matter how broken the feature is. I also think, and we've
had this disagreement before, that you're far too willing to say
"well, that's wrong so we need to hit it with a nuke." I complained
when you added those error checks to vacuum in back-branches, and
since that release went out people are regularly tripping those checks
and taking prolonged outages for a problem that wasn't making them
unhappy before. I know that in theory those people are better off
because their database was always corrupted and now they know. But for
some of them, those prolonged outages are worse than the problem they
had before. I believe it was irresponsible to decide on behalf of our
entire user base that they were better off with such a behavior change
in a supposedly-stable branch, and I believe the same thing here.

I have no objection to the idea that *if* the feature is hopelessly
broken, it should be removed. But I don't have confidence at this
point that you've established that, and I think ripping out thousands
of lines of codes in the back-branches is terrible. Even
hard-disabling the feature in the back-branches without actually
removing the code is an awfully strong reaction, but it could be
justified if we find out that things are actually super-bad and not
really fixable. Actually removing the code is unnecessary, protects
nobody, and has risk.

> Do we actually have any evidence of this feature ever beeing used? I
> didn't find much evidence for that in the archives (except Thomas
> finding a problem). Given that it currently will switch between not
> preventing bloat and causing wrong query results, without that being
> noticed...

I believe that at least one EnterpriseDB customer used it, and
possibly more than one. I am not sure how extensively, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: proposal \gcsv

2020-04-01 Thread Isaac Morland
On Wed, 1 Apr 2020 at 11:52, Daniel Verite  wrote:

> Tom Lane wrote:
>
> >  I could see having a command to copy the current primary formatting
> > parameters to the alternate area, too.
>
> We could have a stack to store parameters before temporary
> changes, for instance if you want to do one csv export and
> come back to normal without assuming what "normal"
> values were.
>

I think it might be a good idea to decide whether psql is to be a
programming environment, or just a command shell.

If it is to be a programming environment, we should either adopt an
existing language or strike a committee of programming language experts to
design a new one.

If it is to be a command shell, new features should be evaluated in part on
whether they move psql significantly closer to being a programming language
and rejected if they do.


Re: proposal \gcsv

2020-04-01 Thread Daniel Verite
Tom Lane wrote:

>  I could see having a command to copy the current primary formatting
> parameters to the alternate area, too.

We could have a stack to store parameters before temporary
changes, for instance if you want to do one csv export and
come back to normal without assuming what "normal"
values were.

\pset push format csv_fieldsep
\pset format csv
\pset  csv_fielsep '\t'
some command \g somefile
\pset pop

So \pset pop would reset the pushed parameters
to their values when pushed, which also could be all
parameters:

\pset push all
\pset param1 something
\pset param2 something-else
...other commands...
\pset pop

or

\pset push all
\i somescript.sql
\pset pop


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite




Re: BufFileRead() error signalling

2020-04-01 Thread David Steele

Hi Thomas,

On 11/29/19 9:46 PM, Thomas Munro wrote:


Ok.  Here is a first attempt at that.


It's been a few CFs since this patch received an update, though there 
has been plenty of discussion.


Perhaps it would be best to mark it RwF until you have a chance to 
produce an update patch?


Regards,
--
-David
da...@pgmasters.net




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 11:15:14 -0400, Robert Haas wrote:
> On Wed, Apr 1, 2020 at 2:40 AM Andres Freund  wrote:
> > I added some debug output to print the mapping before/after changes by
> > MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
> > to the server start in minutes/seconds to make it easier to interpret).
> >
> > And the output turns out to be something like:
> >
> > WARNING:  old snapshot mapping at "before update" with head ts: 7, current 
> > entries: 8 max entries: 15, offset: 0
> >   entry 0 (ring 0): min 7: xid 582921233
> >   entry 1 (ring 1): min 8: xid 654154155
> >   entry 2 (ring 2): min 9: xid 661972949
> >   entry 3 (ring 3): min 10: xid 666899382
> >   entry 4 (ring 4): min 11: xid 644169619
> >   entry 5 (ring 5): min 12: xid 644169619
> >   entry 6 (ring 6): min 13: xid 644169619
> >   entry 7 (ring 7): min 14: xid 644169619
> >
> > WARNING:  head 420 s: updating existing bucket 4 for sec 660 with xmin 
> > 666899382
> >
> > WARNING:  old snapshot mapping at "after update" with head ts: 7, current 
> > entries: 8 max entries: 15, offset: 0
> >   entry 0 (ring 0): min 7: xid 582921233
> >   entry 1 (ring 1): min 8: xid 654154155
> >   entry 2 (ring 2): min 9: xid 661972949
> >   entry 3 (ring 3): min 10: xid 666899382
> >   entry 4 (ring 4): min 11: xid 666899382
> >   entry 5 (ring 5): min 12: xid 644169619
> >   entry 6 (ring 6): min 13: xid 644169619
> >   entry 7 (ring 7): min 14: xid 644169619
> >
> > It's pretty obvious that the xids don't make a ton of sense, I think:
> > They're not monotonically ordered. The same values exist multiple times,
> > despite xids being constantly used. Also, despite the ringbuffer
> > supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
> > always add), and the workload having run for 14min, we only have 8
> > entries.
> 
> The function header comment for MaintainOldSnapshotTimeMapping could
> hardly be more vague, as it's little more than a restatement of the
> function name. However, it looks to me like the idea is that this
> function might get called multiple times for the same or similar
> values of whenTaken. I suppose that's the point of this code:

Right. We enforce whenTaken to be monotonic
(cf. GetSnapshotCurrentTimestamp()), but since
GetSnapshotCurrentTimestamp() reduces the granularity of the timestamp
to one-minute (the AlignTimestampToMinuteBoundary() call), it's
obviously possible to end up in the same bin as a previous


> What I interpret this to be doing is saying - if we got a new call to
> this function with a rounded-to-the-minute timestamp that we've seen
> previously and for which we still have an entry, and if the XID passed
> to this function is newer than the one passed by the previous call,
> then advance the xid_by_minute[] bucket to the newer value. Now that
> begs the question - what does this XID actually represent? The
> comments don't seem to answer that question, not even the comments for
> OldSnapshotControlData, which say that we should "Keep one xid per
> minute for old snapshot error handling." but don't say which XIDs we
> should keep or how they'll be used. However, the only call to
> MaintainOldSnapshotTimeMapping() is in GetSnapshotData(). It appears
> that we call this function each time a new snapshot is taken and pass
> the current time (modulo some fiddling) and snapshot xmin. Given that,
> one would expect that any updates to the map would be tight races,
> i.e. a bunch of processes that all took their snapshots right around
> the same time would all update the same map entry in quick succession,
> with the newest value winning.

Right.


> And that make the debugging output which I quoted from your message
> above really confusing. At this point, the "head timestamp" is 7
> minutes after this facility started up. The first we entry we have is
> for minute 7, and the last is for minute 14. But the one we're
> updating is for minute 11. How the heck can that happen?

If I undestand what your reference correctly, I think that is because,
due to the bug, the "need a new bucket" branch doesn't just extend by
one bucket, it extends it by many in common cases. Basically filling
buckets "into the future".

the advance = ... variable in the branch will not always be 1, even when
we continually call Maintain*.  Here's some debug output showing that
(slightly modified from the patch I previously sent):

WARNING:  old snapshot mapping at "before update" with head ts: 1, current 
entries: 2 max entries: 15, offset: 0
  entry 0 (ring 0): min 1: xid 1089371384
  entry 1 (ring 1): min 2: xid 1099553206

WARNING:  head 1 min: filling 2 buckets starting at 0 for whenTaken 3 min, with 
xmin 1109840204

WARNING:  old snapshot mapping at "after update" with head ts: 3, current 
entries: 4 max entries: 15, offset: 0
  entry 0 (ring 0): min 3: xid 1089371384
  entry 1 (ring 1): min 4: xid 1099553206
  entry 2 (ring 2): min 5: xid 1109840204
  entry 3 (ring 3): min 6: xid 

Re: Removing unneeded self joins

2020-04-01 Thread David Steele



On 1/27/20 11:10 PM, Andrey Lepikhov wrote:

Rebased version v.22.
- Added enable_self_join_removal GUC (true is default)
- The joinquals of the relation that is being removed, redistributed in 
accordance with the remove_rel_from_query () machinery.


This patch no longer applies cleanly on 
src/test/regress/sql/equivclass.sql: 
http://cfbot.cputube.org/patch_27_1712.log


The CF entry has been updated to Waiting on Author.

Regards,
--
-David
da...@pgmasters.net




Re: Online checksums patch - once again

2020-04-01 Thread David Steele

On 1/18/20 6:18 PM, Daniel Gustafsson wrote:


Attached is a v16 rebased on top of current master which addresses the above
commented points, and which I am basing the concurrency work on.


This patch no longer applies cleanly: 
http://cfbot.cputube.org/patch_27_2260.log


The CF entry has been updated to Waiting on Author.

Regards,

--
-David
da...@pgmasters.net




Re: proposal \gcsv

2020-04-01 Thread Tom Lane
Vik Fearing  writes:
> On 4/1/20 1:53 AM, Tom Lane wrote:
>> Consider some syntax along the lines of
>> \gpset (pset-option-name [pset-option-value]) ... filename

> If parens are going to be required, why don't we just add them to \g?
> TABLE blah \g (format csv) filename

Yeah, if we're willing to assume that nobody uses filenames beginning
with '(', we could just extend \g's syntax rather than adding a new
command.

After sleeping on it, though, I'm liking my Plan B idea better than
Plan A.  Plan B is very clearly implementable without needing surgery
on the backslash-command parser (I didn't look at the lexer to see
what paren-handling would involve, but it might be painful).  And it
doesn't put any new limits on what pset parameters can look like;
Plan A would likely result in some problems if anybody wants to use
parens in future pset options.

I think that maybe the best terminology for Plan B would be to say
that there's an "alternate" formatting parameter set, which is
manipulated by \apset and then used by \ga.

Another thought, bearing in mind the dictum that the only good numbers
in computer science are 0, 1, and N, is to introduce a concept of named
formatting parameter sets, which you'd manipulate with say
\npset set-name [param-name [param-value]]
and use with
\gn set-name filename-or-command
A likely usage pattern for that would be to set up a few favorite
formats in your ~/.psqlrc, and then they'd be available to just use
immediately in \gn.  (In this universe, \gn should not destroy or
reset the parameter set it uses.)

This is way beyond what anyone has asked for, so I'm not seriously
proposing that we do it right now, but it might be something to keep
in mind as a possible future direction.  The main thing that it calls
into question is whether we really want \ga to reset the alternate
parameter values after use.  Maybe it'd be better not to --- I can
think of about-equally-likely usage patterns where you would want
that or not.  We could invent an explicit "\apset reset" command
instead of auto-reset.  I could see having a command to copy the
current primary formatting parameters to the alternate area, too.

There's an argument that this is all way too complicated, of course,
and maybe it is.  But considering that we've already had two requests
for things you can't do with \gfmt as it stands, I think the patch
is too simple as it is.

regards, tom lane




Re: Verify true root on replicas with amcheck

2020-04-01 Thread David Steele

On 1/9/20 3:55 AM, godjan • wrote:

Hi, we have trouble to detect true root corruptions on replicas. I made a patch 
for resolving it with the locking meta page and potential root page. I heard 
that amcheck has an invariant about locking no more than 1 page at a moment for 
avoiding deadlocks. Is there possible a deadlock situation?


This patch no longer applies cleanly: 
http://cfbot.cputube.org/patch_27_2418.log


The CF entry has been updated to Waiting on Author.

Also, it would be a good idea to answer Peter's questions down-thread if 
you are interested in moving this patch forward.


Regards,
--
-David
da...@pgmasters.net




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 2:40 AM Andres Freund  wrote:
> I added some debug output to print the mapping before/after changes by
> MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
> to the server start in minutes/seconds to make it easier to interpret).
>
> And the output turns out to be something like:
>
> WARNING:  old snapshot mapping at "before update" with head ts: 7, current 
> entries: 8 max entries: 15, offset: 0
>   entry 0 (ring 0): min 7: xid 582921233
>   entry 1 (ring 1): min 8: xid 654154155
>   entry 2 (ring 2): min 9: xid 661972949
>   entry 3 (ring 3): min 10: xid 666899382
>   entry 4 (ring 4): min 11: xid 644169619
>   entry 5 (ring 5): min 12: xid 644169619
>   entry 6 (ring 6): min 13: xid 644169619
>   entry 7 (ring 7): min 14: xid 644169619
>
> WARNING:  head 420 s: updating existing bucket 4 for sec 660 with xmin 
> 666899382
>
> WARNING:  old snapshot mapping at "after update" with head ts: 7, current 
> entries: 8 max entries: 15, offset: 0
>   entry 0 (ring 0): min 7: xid 582921233
>   entry 1 (ring 1): min 8: xid 654154155
>   entry 2 (ring 2): min 9: xid 661972949
>   entry 3 (ring 3): min 10: xid 666899382
>   entry 4 (ring 4): min 11: xid 666899382
>   entry 5 (ring 5): min 12: xid 644169619
>   entry 6 (ring 6): min 13: xid 644169619
>   entry 7 (ring 7): min 14: xid 644169619
>
> It's pretty obvious that the xids don't make a ton of sense, I think:
> They're not monotonically ordered. The same values exist multiple times,
> despite xids being constantly used. Also, despite the ringbuffer
> supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
> always add), and the workload having run for 14min, we only have 8
> entries.

The function header comment for MaintainOldSnapshotTimeMapping could
hardly be more vague, as it's little more than a restatement of the
function name. However, it looks to me like the idea is that this
function might get called multiple times for the same or similar
values of whenTaken. I suppose that's the point of this code:

else if (ts <= (oldSnapshotControl->head_timestamp +
((oldSnapshotControl->count_used - 1)
 * USECS_PER_MINUTE)))
{
/* existing mapping; advance xid if possible */
int bucket = (oldSnapshotControl->head_offset
  + ((ts - oldSnapshotControl->head_timestamp)
 / USECS_PER_MINUTE))
% OLD_SNAPSHOT_TIME_MAP_ENTRIES;

if (TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[bucket],
xmin))
oldSnapshotControl->xid_by_minute[bucket] = xmin;
}

What I interpret this to be doing is saying - if we got a new call to
this function with a rounded-to-the-minute timestamp that we've seen
previously and for which we still have an entry, and if the XID passed
to this function is newer than the one passed by the previous call,
then advance the xid_by_minute[] bucket to the newer value. Now that
begs the question - what does this XID actually represent? The
comments don't seem to answer that question, not even the comments for
OldSnapshotControlData, which say that we should "Keep one xid per
minute for old snapshot error handling." but don't say which XIDs we
should keep or how they'll be used. However, the only call to
MaintainOldSnapshotTimeMapping() is in GetSnapshotData(). It appears
that we call this function each time a new snapshot is taken and pass
the current time (modulo some fiddling) and snapshot xmin. Given that,
one would expect that any updates to the map would be tight races,
i.e. a bunch of processes that all took their snapshots right around
the same time would all update the same map entry in quick succession,
with the newest value winning.

And that make the debugging output which I quoted from your message
above really confusing. At this point, the "head timestamp" is 7
minutes after this facility started up. The first we entry we have is
for minute 7, and the last is for minute 14. But the one we're
updating is for minute 11. How the heck can that happen? I might
suspect that you'd stopped a process inside GetSnapshotData() with a
debugger, but that can't explain it either, because GetSnapshotData()
gets the xmin first and only afterwards gets the timestamp - so if
you'd stopped for ~3 minutes it just before the call to
MaintainOldSnapshotTimeMapping(), it would've been updating the map
with an *old* XID. In reality, though, it changed the XID from
644169619 to 666899382, advancing over 22 million XIDs. I don't
understand what's going on there. How is this function getting called
with a 4-minute old value of whenTaken?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 10:01:07 -0400, Robert Haas wrote:
> On Wed, Apr 1, 2020 at 2:40 AM Andres Freund  wrote:
> > The problem is that there's no protection again the xids in the
> > ringbuffer getting old enough to wrap around. Given that practical uses
> > of old_snapshot_threshold are likely to be several hours to several
> > days, that's not particularly hard to hit.
> 
> Presumably this could be fixed by changing it to use FullTransactionId.

That doesn't exist in all the back branches. Think it'd be easier to add
code to explicitly prune it during MaintainOldSnapshotTimeMapping().


> > The problem, as far as I can tell, is that
> > oldSnapshotControl->head_timestamp appears to be intended to be the
> > oldest value in the ring. But we update it unconditionally in the "need
> > a new bucket, but it might not be the very next one" branch of
> > MaintainOldSnapshotTimeMapping().
> 
> I agree, that doesn't look right. It's correct, I think, for the "if
> (advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)" case, but not in the
> "else" case. In the "else" case, it should advance by 1 (wrapping if
> needed) each time we take the "if (oldSnapshotControl->count_used ==
> OLD_SNAPSHOT_TIME_MAP_ENTRIES)" branch, and should remain unchanged in
> the "else" branch for that if statement.

Yea.


> > As far as I can tell, this code has been wrong since the feature has
> > been committed.  The tests don't show a problem, because none of this
> > code is reached when old_snapshot_threshold = 0 (which has no real world
> > use, it's purely for testing).
> 
> I'm pretty sure I complained about the fact that only the
> old_snapshot_threshold = 0 case was tested at the time this went in,
> but I don't think Kevin was too convinced that we needed to do
> anything else, and realistically, if he'd tried for a regression test
> that ran for 15 minutes, Tom would've gone ballistic.

I think it's not just Tom that'd have gone ballistic. I think it's the
reason why, as I think is pretty clear, the feature was *never* actually
tested. The results of whats being removed are not quite random, but
it's not far from it. And there's long stretches of time where it never
removes things.

It's also a completely self-made problem.

There's really no reason at all to have bins of one minute. As it's a
PGC_POSTMASTER GUC, it should just have didided time into bins of
(old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
of a week there's no need to keep 10k bins, and the minimum threshold of
1 minute obviously is problematic.


> > I really don't know what to do here. The feature never worked and will
> > silently cause wrong query results. Fixing it seems like a pretty large
> > task - there's a lot more bugs. But ripping out a feature in stable
> > branches is pretty bad too.
> 
> I don't know what other bugs there are, but the two you mention above
> look fixable.

They probably are fixable. But there's a lot more, I think:

Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
update_ts threshold actually needs to be ts >= update_ts, right now we
don't handle being newer than the newest bin correctly afaict (mitigated
by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
to say, because there's no comments.

The whole lock nesting is very hazardous. Most (all?)
TestForOldSnapshot() calls happen with locks on on buffers held, and can
acquire lwlocks itself. In some older branches we do entire *catalog
searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
basis to detect conflicts seems dangerous to me. Isn't that ignoring
inserts that are already in progress?


> Even if we decide that the feature can't be salvaged, I would vote
> against ripping it out in back branches. I would instead argue for
> telling people not to use it and ripping it out in master.

It currently silently causes wrong query results. There's no
infrastructure to detect / protect against that (*).

I'm sure we can fix individual instances of problems. But I don't know
how one is supposed to verify that the fixes actually work. There's
currently no tests for the actual feature. And manual tests are painful
due to the multi-minute thresholds needed, and it's really hard to
manually verify that only the right rows are removed due to the feature,
and that all necessary errors are thrown. Given e.g. the bugs in my
email upthread, there's periods of several minutes where we'd not see
any row removed and then periods where the wrong ones would be removed,
so the manual tests would have to be repeated numerous times to actually
ensure anything.

If somebody wants to step up to the plate and fix these, it'd perhaps be
more realistic to say that we'll keep the feature. But even if somebody
does, I think it'd require a lot of development in the back branches. On
a feature whose purpose is to eat data that is still required.

I think even if we 

Re: control max length of parameter values logged

2020-04-01 Thread Tom Lane
Justin Pryzby  writes:
> On Wed, Apr 01, 2020 at 10:10:55AM +0100, Alexey Bashtanov wrote:
>>> Could you make zero a normal value and -1 the "special" value to disable
>>> trimming ?

>> I can, but then for the sake of consistency I'll have to do the same for
>> log_parameter_max_length.
>> Then we'll end up with two ways to enable/disable parameter logging on
>> error:
>> using the primary boolean setting and setting length to 0.
>> One of them will require superuser privileges, the other one won't.

> I guess you're referring to log_parameters_on_error.
> Does it have to be SUSET ?
> Or maybe log_parameters_on_error doesn't need to exist at all, and setting
> log_parameter_max_length=0 can be used to disable parameter logging.
> I showed up late to this thread, so let's see what others think.

I think Justin's got a point: defining zero this way is weirdly
inconsistent.  -1, being clearly outside the domain of possible
length limits, makes more sense as a marker for "don't trim".

Alexey's right that having a separate boolean flag is pointless, but
I think we could just drop the boolean; we haven't shipped that yet.
The privilege argument seems irrelevant to me.  We already decided
that the plan is (a) SUSET for non-error statement logging purposes and
(b) USERSET for logging caused by errors, and that would have to apply
to length limits as well as enable/disable ability.  Otherwise a user
could pretty effectively disable logging by setting the length to 1.

regards, tom lane




Re: Less-silly selectivity for JSONB matching operators

2020-04-01 Thread Tom Lane
Alexey Bashtanov  writes:
> On 31/03/2020 18:53, Tom Lane wrote:
>> Renamed "matchsel" to "matchingsel" etc, added DEFAULT_MATCHING_SEL,
>> rebased over commit 911e70207.  Since that commit already created
>> new versions of the relevant contrib modules, I think we can just
>> redefine what those versions contain, rather than making yet-newer
>> versions.  (Of course, that assumes we're going to include this in
>> v13.)

> Looks good to me.

Pushed, thanks for reviewing!

regards, tom lane




Re: WAL usage calculation patch

2020-04-01 Thread Julien Rouhaud
Hi,

I'm replying here to all reviews that have been sent, thanks a lot!

On Wed, Apr 01, 2020 at 04:29:16PM +0530, Amit Kapila wrote:
> On Wed, Apr 1, 2020 at 1:32 PM Julien Rouhaud  wrote:
> >
> > So here's a v9, rebased on top of the latest versions of Sawada-san's bug 
> > fixes
> > (Amit's v6 for vacuum and Sawada-san's v2 for create index), with all
> > previously mentionned changes.
> >
> 
> Few other comments:
> v9-0003-Add-infrastructure-to-track-WAL-usage
> 1.
>  static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
> -
> +static void WalUsageAdd(WalUsage *dst, WalUsage *add);
> 
> Looks like a spurious line removal


Fixed.


> 2.
> + /* Report a full page imsage constructed for the WAL record */
> + *num_fpw += 1;
> 
> Typo. /imsage/image


Ah sorry I though I fixed it previously, fixed.


> 3. Doing some testing with and without parallelism to ensure WAL usage
> data is correct would be great and if possible, share the results?


I just saw that Dilip did some testing, but just in case here is some
additional one

- vacuum, after a truncate, loading 1M row and a "UPDATE t1 SET id = id"

=# select query, calls, wal_bytes, wal_records, wal_num_fpw from 
pg_stat_statements where query ilike '%vacuum%';
 query  | calls | wal_bytes | wal_records | wal_num_fpw
+---+---+-+-
 vacuum (parallel 3) t1 | 1 |  20098962 |   34104 |   2
 vacuum (parallel 0) t1 | 1 |  20098962 |   34104 |   2
(2 rows)

- create index, overload t1's parallel_workers, using the 1M line just
  vacuumed:

=# alter table t1 set (parallel_workers = 2);
ALTER TABLE

=# create index t1_parallel_2 on t1(id);
CREATE INDEX

=# alter table t1 set (parallel_workers = 0);
ALTER TABLE

=# create index t1_parallel_0 on t1(id);
CREATE INDEX

=# select query, calls, wal_bytes, wal_records, wal_num_fpw from 
pg_stat_statements where query ilike '%create index%';
query | calls | wal_bytes | wal_records | 
wal_num_fpw
--+---+---+-+-
 create index t1_parallel_0 on t1(id) | 1 |  20355540 |2762 |   
 2745
 create index t1_parallel_2 on t1(id) | 1 |  20406811 |2762 |   
 2758
(2 rows)

It all looks good to me.


> v9-0005-Keep-track-of-WAL-usage-in-pg_stat_statements
> 4.
> +-- SELECT usage data, check WAL usage is reported, wal_records equal
> rows count for INSERT/UPDATE/DELETE
> +SELECT query, calls, rows,
> +wal_bytes > 0 as wal_bytes_generated,
> +wal_records > 0 as wal_records_generated,
> +wal_records = rows as wal_records_as_rows
> +FROM pg_stat_statements ORDER BY query COLLATE "C";
> +  query   |
> calls | rows | wal_bytes_generated | wal_records_generated |
> wal_records_as_rows
> +--+---+--+-+---+-
> + DELETE FROM pgss_test WHERE a > $1   |
>   1 |1 | t   | t | t
> + DROP TABLE pgss_test |
>   1 |0 | t   | t | f
> + INSERT INTO pgss_test (a, b) VALUES ($1, $2), ($3, $4), ($5, $6) |
>   1 |3 | t   | t | t
> + INSERT INTO pgss_test VALUES(generate_series($1, $2), $3)|
>   1 |   10 | t   | t | t
> + SELECT * FROM pgss_test ORDER BY a   |
>   1 |   12 | f   | f | f
> + SELECT * FROM pgss_test WHERE a > $1 ORDER BY a  |
>   2 |4 | f   | f | f
> + SELECT * FROM pgss_test WHERE a IN ($1, $2, $3, $4, $5)  |
>   1 |8 | f   | f | f
> + SELECT pg_stat_statements_reset()|
>   1 |1 | f   | f | f
> + SET pg_stat_statements.track_utility = FALSE |
>   1 |0 | f   | f | t
> + UPDATE pgss_test SET b = $1 WHERE a = $2 |
>   6 |6 | t   | t | t
> + UPDATE pgss_test SET b = $1 WHERE a > $2 |
>   1 |3 | t   | t | t
> +(11 rows)
> +
> 
> I am not sure if the above tests make much sense as they are just
> testing that if WAL is generated for these commands.  I understand it
> is not easy to make these tests reliable but in that case, we can
> think of some simple tests.  It seems to me that the difficulty is due
> to full_page_writes as that depends on the checkpoint.  Can we make
> full_page_writes = off for these tests and check some simple
> Insert/Update/Delete cases? 

Re: potential stuck lock in SaveSlotToPath()

2020-04-01 Thread Peter Eisentraut

On 2020-03-27 08:48, Michael Paquier wrote:

On Thu, Mar 26, 2020 at 02:16:05PM +0100, Peter Eisentraut wrote:

committed and backpatched


The patch committed does that in three places:
 /* rename to permanent file, fsync file and directory */
 if (rename(tmppath, path) != 0)
 {
+   LWLockRelease(>io_in_progress_lock);
ereport(elevel,
 (errcode_for_file_access(),
  errmsg("could not rename file \"%s\" to \"%s\": %m",

But why do you assume that LWLockRelease() never changes errno?  It
seems to me that you should save errno before calling LWLockRelease(),
and then restore it back before using %m in the log message, no?  See
for example the case where trace_lwlocks is set.


Good catch.  How about the attached patch?

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 484d734e9a985fa3e9242644ca9173658b9efe2c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 1 Apr 2020 16:24:47 +0200
Subject: [PATCH] Save errno across LWLockRelease() calls

Fixup for "Drop slot's LWLock before returning from SaveSlotToPath()"

Reported-by: Michael Paquier 
---
 src/backend/replication/slot.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d90c7235e9..47851ec4c1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1259,9 +1259,13 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, 
int elevel)
/*
 * If not an ERROR, then release the lock before returning.  In 
case
 * of an ERROR, the error recovery path automatically releases 
the
-* lock, but no harm in explicitly releasing even in that case.
+* lock, but no harm in explicitly releasing even in that case. 
 Note
+* that LWLockRelease() could affect errno.
 */
+   int save_errno = errno;
+
LWLockRelease(>io_in_progress_lock);
+   errno = save_errno;
ereport(elevel,
(errcode_for_file_access(),
 errmsg("could not create file \"%s\": %m",
@@ -1325,7 +1329,10 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, 
int elevel)
 
if (CloseTransientFile(fd) != 0)
{
+   int save_errno = errno;
+
LWLockRelease(>io_in_progress_lock);
+   errno = save_errno;
ereport(elevel,
(errcode_for_file_access(),
 errmsg("could not close file \"%s\": %m",
@@ -1336,7 +1343,10 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, 
int elevel)
/* rename to permanent file, fsync file and directory */
if (rename(tmppath, path) != 0)
{
+   int save_errno = errno;
+
LWLockRelease(>io_in_progress_lock);
+   errno = save_errno;
ereport(elevel,
(errcode_for_file_access(),
 errmsg("could not rename file \"%s\" to 
\"%s\": %m",
-- 
2.26.0



Re: Commitfest 2020-03 Now in Progress

2020-04-01 Thread David Steele

On 3/17/20 8:10 AM, David Steele wrote:

On 3/1/20 4:10 PM, David Steele wrote:

The last Commitfest for v13 is now in progress!

Current stats for the Commitfest are:

Needs review: 192
Waiting on Author: 19
Ready for Committer: 4
Total: 215


Halfway through, here's where we stand.  Note that a CF entry was added 
after the stats above were recorded so the total has increased.


Needs review: 151 (-41)
Waiting on Author: 20 (+1)
Ready for Committer: 9 (+5)
Committed: 25
Moved to next CF: 1
Withdrawn: 4
Returned with Feedback: 6
Total: 216


After regulation time here's where we stand:

Needs review: 111 (-40)
Waiting on Author: 26 (+6)
Ready for Committer: 11 (+2)
Committed: 51 (+11)
Moved to next CF: 2 (+1)
Returned with Feedback: 10 (+4)
Rejected: 1
Withdrawn: 5 (+1)
Total: 217

We have one more patch so it appears that one in a completed state 
(committed, etc.) at the beginning of the CF has been moved to an 
uncompleted state. Or perhaps my math is just bad.


The RMT has determined that the CF will be extended for one week so I'll 
hold off on moving and marking patches until April 8.


Regards,
--
-David
da...@pgmasters.net




Re: allow online change primary_conninfo

2020-04-01 Thread Peter Eisentraut

On 2020-03-28 11:49, Sergei Kornilov wrote:

I attached updated patch: walreceiver will use configured primary_slot_name as 
temporary slot name if wal_receiver_create_temp_slot is enabled.


The original setup worked consistently with pg_basebackup.  In 
pg_basebackup, if you specify -S/--slot, then it uses a permanent slot 
with the given name.  This is the same behavior as primary_slot_name has 
now.  We should be careful that we don't create two sets of 
options/settings that look similar but combine in different ways.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-03-31 23:40:08 -0700, Andres Freund wrote:
> I added some debug output to print the mapping before/after changes by
> MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
> to the server start in minutes/seconds to make it easier to interpret).

Now attached.

Greetings,

Andres Freund
diff --git i/src/backend/utils/time/snapmgr.c w/src/backend/utils/time/snapmgr.c
index 1c063c592ce..cbfc462ee26 100644
--- i/src/backend/utils/time/snapmgr.c
+++ w/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@ typedef struct OldSnapshotControlData
 	int			head_offset;	/* subscript of oldest tracked time */
 	TimestampTz head_timestamp; /* time corresponding to head xid */
 	int			count_used;		/* how many slots are in use */
+	TimestampTz starttime; // rounded to minute
 	TransactionId xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
 } OldSnapshotControlData;
 
@@ -290,6 +291,8 @@ SnapMgrInit(void)
 		oldSnapshotControl->head_offset = 0;
 		oldSnapshotControl->head_timestamp = 0;
 		oldSnapshotControl->count_used = 0;
+		oldSnapshotControl->starttime =
+			AlignTimestampToMinuteBoundary(GetCurrentTimestamp());
 	}
 }
 
@@ -1762,6 +1765,39 @@ SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
 	SpinLockRelease(>mutex_threshold);
 }
 
+static void PrintOldSnapshotMapping(const char *head, bool already_locked)
+{
+	StringInfoData s;
+
+	initStringInfo();
+
+	if (!already_locked)
+		LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+
+
+	appendStringInfo(, "old snapshot mapping at \"%s\" with head ts: %lu, current entries: %d max entries: %d, offset: %d\n",
+	 head,
+	 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE,
+	 oldSnapshotControl->count_used,
+	 OLD_SNAPSHOT_TIME_MAP_ENTRIES,
+	 oldSnapshotControl->head_offset);
+
+	for (int off = 0; off < oldSnapshotControl->count_used; off++)
+	{
+		int ringoff = (off + oldSnapshotControl->head_offset) % OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+		appendStringInfo(, "  entry %d (ring %d): min %ld: xid %d\n",
+		 off, ringoff,
+		 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE + off,
+		 oldSnapshotControl->xid_by_minute[ringoff]);
+	}
+
+	if (!already_locked)
+		LWLockRelease(OldSnapshotTimeMapLock);
+
+	elog(WARNING, "%s", s.data);
+}
+
 /*
  * TransactionIdLimitedForOldSnapshots
  *
@@ -1824,6 +1860,7 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 
 		if (!same_ts_as_threshold)
 		{
+			PrintOldSnapshotMapping("non cached limit", false);
 			if (ts == update_ts)
 			{
 xlimit = latest_xmin;
@@ -1923,14 +1960,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	 */
 	if (whenTaken < 0)
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with negative whenTaken = %ld",
 			 (long) whenTaken);
 		return;
 	}
 	if (!TransactionIdIsNormal(xmin))
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with xmin = %lu",
 			 (unsigned long) xmin);
 		return;
@@ -1938,6 +1975,8 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 
 	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
 
+	PrintOldSnapshotMapping("before update", true);
+
 	Assert(oldSnapshotControl->head_offset >= 0);
 	Assert(oldSnapshotControl->head_offset < OLD_SNAPSHOT_TIME_MAP_ENTRIES);
 	Assert((oldSnapshotControl->head_timestamp % USECS_PER_MINUTE) == 0);
@@ -1956,7 +1995,7 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	{
 		/* old ts; log it at DEBUG */
 		LWLockRelease(OldSnapshotTimeMapLock);
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with old whenTaken = %ld",
 			 (long) whenTaken);
 		return;
@@ -1971,6 +2010,12 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  / USECS_PER_MINUTE))
 		% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
 
+		elog(WARNING, "head %ld s: updating existing bucket %d for sec %ld with xmin %u",
+			 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 bucket,
+			 (ts - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 xmin);
+
 		if (TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[bucket], xmin))
 			oldSnapshotControl->xid_by_minute[bucket] = xmin;
 	}
@@ -1980,6 +2025,13 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 		int			advance = ((ts - oldSnapshotControl->head_timestamp)
 			   / USECS_PER_MINUTE);
 
+		elog(WARNING, "head %ld s: filling %d buckets starting at %d for sec %ld with xmin %u",
+			 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 advance,
+			 oldSnapshotControl->head_offset,
+			 (ts - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 xmin);
+
 		oldSnapshotControl->head_timestamp = ts;
 
 		if (advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)
@@ -2021,6 +2073,8 @@ 

Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 2:40 AM Andres Freund  wrote:
> The problem is that there's no protection again the xids in the
> ringbuffer getting old enough to wrap around. Given that practical uses
> of old_snapshot_threshold are likely to be several hours to several
> days, that's not particularly hard to hit.

Presumably this could be fixed by changing it to use FullTransactionId.

> The problem, as far as I can tell, is that
> oldSnapshotControl->head_timestamp appears to be intended to be the
> oldest value in the ring. But we update it unconditionally in the "need
> a new bucket, but it might not be the very next one" branch of
> MaintainOldSnapshotTimeMapping().

I agree, that doesn't look right. It's correct, I think, for the "if
(advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)" case, but not in the
"else" case. In the "else" case, it should advance by 1 (wrapping if
needed) each time we take the "if (oldSnapshotControl->count_used ==
OLD_SNAPSHOT_TIME_MAP_ENTRIES)" branch, and should remain unchanged in
the "else" branch for that if statement.

> As far as I can tell, this code has been wrong since the feature has
> been committed.  The tests don't show a problem, because none of this
> code is reached when old_snapshot_threshold = 0 (which has no real world
> use, it's purely for testing).

I'm pretty sure I complained about the fact that only the
old_snapshot_threshold = 0 case was tested at the time this went in,
but I don't think Kevin was too convinced that we needed to do
anything else, and realistically, if he'd tried for a regression test
that ran for 15 minutes, Tom would've gone ballistic.

> I really don't know what to do here. The feature never worked and will
> silently cause wrong query results. Fixing it seems like a pretty large
> task - there's a lot more bugs. But ripping out a feature in stable
> branches is pretty bad too.

I don't know what other bugs there are, but the two you mention above
look fixable. Even if we decide that the feature can't be salvaged, I
would vote against ripping it out in back branches. I would instead
argue for telling people not to use it and ripping it out in master.
However, much as I'm not in love with all of the complexity this
feature adds, I don't see the problems you've reported here as serious
enough to justify ripping it out.

What exactly is the interaction of this patch with your snapshot
scalability work?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




  1   2   >