Re: Detecting File Damage & Inconsistencies

2020-11-11 Thread Simon Riggs
On Thu, 12 Nov 2020 at 06:42, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Simon Riggs 
> > I would like to propose a few points that will help us detect file
> > damage, inconsistencies in files and track actions of users.
>
> Hello, Simon san.  Long time no see.  I'm happy to see you be back here 
> recently.

Thank you, happy to be back. It's good to have the time to contribute again.

> What kind of improvement do you expect?  What problems would this make 
> detectable?

If a rogue user/process is suspected, this would allow you to identify
more easily the changes made by specific sessions/users.

> > * 2-byte pid (from MyProcPid)
>
> pid is 4 bytes on Windows.  Isn't it also 4 byte on Linux when some kernel 
> parameter is set to a certain value?

4 bytes is no problem, thanks for pointing that out.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: proposal: possibility to read dumped table's name from file

2020-11-11 Thread Pavel Stehule
st 11. 11. 2020 v 16:17 odesílatel Alvaro Herrera 
napsal:

> On 2020-Nov-11, Pavel Stehule wrote:
>
> > I think the proposed feature is very far to be the config file for
> pg_dump
> > (it implements a option "--filter"). This is not the target. It is not
> > designed for this. This is just an alternative for options like -t, -T,
> ...
> > and I am sure so nobody will generate this file manually. Main target of
> > this patch is eliminating problems with the max length of the command
> line.
> > So it is really not designed to be the config file for pg_dump.
>
> I agree that a replacement for existing command line arguments is a good
> goal, but at the same time it's good to keep in mind the request that
> more object types are supported as dumpable.  While it's not necessary
> that this infrastructure supports all object types in the first cut,
> it'd be good to have it support that.  I would propose that instead of a
> single letter 't' etc we support keywords, maybe similar to those
> returned by getObjectTypeDescription() (with additions -- for example
> for "table data").  Then we can extend for other object types later
> without struggling to find good letters for each.
>
> Of course we could allow abbrevations for common cases, such that "t"
> means "table".
>
> For example: it'll be useful to support selective dumping of functions,
> materialized views, foreign objects, etc.
>

Implementation of this is trivial.

The hard work is mapping pg_dump options on database objects. t -> table is
simple, but n -> schema looks a little bit inconsistent - although it is
consistent with pg_dump. d or D - there is no system object like data. I am
afraid so there are two independent systems - pg_dump options, and database
objects, and it can be hard or not very practical to join these systems.
Unfortunately there is not good consistency in the short options of pg_dump
today. More - a lot of object names are multi words with inner space. This
is not too practical.

What about supporting two syntaxes?

1. first short current +-tndf filter - but the implementation should not be
limited to one char - there can be any string until space

2. long syntax - all these pg_dump options has long options, and then we
can extend this feature without any problem in future

table|exclude-table|exclude-table-data|schema|exclude-schema|include-foreign-data=PATTERN

so the content of filter file can looks like:

+t mytable
+t tabprefix*
-t bigtable

table=mytable2
exclude-table=mytable2

This format allows quick work for most common database objects, and it is
extensible and consistent with pg_dump's long options.

What do you think about it?

Personally, I am thinking that it is over-engineering a little bit, maybe
we can implement this feature just test first string after +- symbols
(instead first char like now) - and any enhanced syntax can be implemented
in future when there will be this requirement. Second syntax can be
implemented very simply, because it can be identified by first char
processing. We can implement second syntax only too. It will work too, but
I think so short syntax is more practical for daily work (for common
options). I expect so 99% percent of this objects will be "+t tablename".


Regards

Pavel


Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-11 Thread Fujii Masao




On 2020/11/12 14:58, Fujii Masao wrote:



On 2020/11/06 10:25, Masahiro Ikeda wrote:

On 2020-10-30 11:50, Fujii Masao wrote:

On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.


We should avoid that duplication as possible even if the both number
are important.


Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse pgWalUsage's
value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?


I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;
...
   pgstat_send(&WalStats, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;


Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.


+    /* fill in some values using pgWalUsage */
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes;
+    WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;

It's better to use WalUsageAccumDiff() here?


Yes, thanks. I fixed it.


prevWalUsage needs to be initialized with pgWalUsage?

+    if (AmWalWriterProcess()){
+    WalStats.m_wal_write_walwriter++;
+    }
+    else
+    {
+    WalStats.m_wal_write_backend++;
+    }

I think that it's better not to separate m_wal_write_xxx into two for
walwriter and other processes. Instead, we can use one m_wal_write_xxx
counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the counters
per process type if necessary. If we adopt this approach, we can easily
extend pg_stat_wal so that any fields can be reported per process type.


I'll remove the above source code because these counters are not useful.


On 2020-10-30 12:00, Fujii Masao wrote:

On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune
"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, auto_explain,
autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire database,
they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.


2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact on the performance of
a write-heavy workload generating lots of WAL. If this number is reported high,
to reduce the number of this initialization, we can tune WAL-related parameters
so that more "recycled" WAL files can be held.



3. Number of when WAL 

Re: abstract Unix-domain sockets

2020-11-11 Thread Michael Paquier
On Wed, Nov 11, 2020 at 01:39:17PM +0100, Peter Eisentraut wrote:
> Thinking about it further, I think the hint in the Unix-domain socket case
> is bogus.  A socket in the file-system namespace never reports EADDRINUSE
> anyway, it just overwrites the file.  For sockets in the abstract namespace,
> you can get this error, but of course there is no file to remove.
> 
> Perhaps we should change the hint in both the Unix and the IP cases to:
> 
> "Is another postmaster already running at this address?"
> (This also resolves the confusing reference to "port" in the Unix case.)

Er, it is perfectly possible for two postmasters to use the same unix
socket path, abstract or not, as long as they listen to different
ports (all nodes in a single TAP test do that for example).  So we
should keep a reference to the port used in the log message, no?

> Or we just drop the hint in the Unix case.  The primary error message is
> clear enough.

Dropping the hint for the abstract case sounds fine to me.
--
Michael


signature.asc
Description: PGP signature


Re: Clean up optional rules in grammar

2020-11-11 Thread Peter Eisentraut

On 2020-11-11 11:54, Heikki Linnakangas wrote:

On 11/11/2020 11:12, Peter Eisentraut wrote:

The attached patch cleans this up to make them all look like the first
style.


+1


done




Re: [HACKERS] logical decoding of two-phase transactions

2020-11-11 Thread Peter Smith
The subscriber tests are updated to include test cases for "cascading"
pub/sub scenarios.

i.e.. NODE_A publisher => subscriber NODE_B publisher => subscriber NODE_C

PSA only the modified v20-0006 patch (the other 5 patches remain unchanged)

Kind Regards,
Peter Smith.
Fujitsu Australia.


v20-0006-Support-2PC-txn-subscriber-tests.patch
Description: Binary data


Re: In-place persistance change of a relation

2020-11-11 Thread Kyotaro Horiguchi
At Wed, 11 Nov 2020 09:56:44 -0500, Stephen Frost  wrote in 
> Greetings,
> 
> * Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> > At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost  
> > wrote in 
> > > * Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> > > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place
> > > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
> > > > trials of several ways, I drifted to the following way after poking
> > > > several ways.
> > > > 
> > > > 1. Flip BM_PERMANENT of active buffers
> > > > 2. adding/removing init fork
> > > > 3. sync files,
> > > > 4. Flip pg_class.relpersistence.
> > > > 
> > > > It always skips table copy in the SET UNLOGGED case, and only when
> > > > wal_level=minimal in the SET LOGGED case.  Crash recovery seems
> > > > working by some brief testing by hand.
> > > 
> > > Somehow missed that this patch more-or-less does what I was referring to
> > > down-thread, but I did want to mention that it looks like it's missing a
> > > necessary FlushRelationBuffers() call before the sync, otherwise there
> > > could be dirty buffers for the relation that's being set to LOGGED (with
> > > wal_level=minimal), which wouldn't be good.  See the comments above
> > > smgrimmedsync().
> > 
> > Right. Thanks.  However, since SetRelFileNodeBuffersPersistence()
> > called just above scans shared buffers so I don't want to just call
> > FlushRelationBuffers() separately.  Instead, I added buffer-flush to
> > SetRelFileNodeBuffersPersistence().
> 
> Maybe I'm missing something, but it sure looks like in the patch that
> SetRelFileNodeBuffersPersistence() is being called after the
> smgrimmedsync() call, and I don't think you get to just switch the order
> of those- the sync is telling the kernel to make sure it's written to
> disk, while the FlushBuffer() is just writing it into the kernel but
> doesn't provide any guarantee that the data has actually made it to
> disk.  We have to FlushBuffer() first, and then call smgrimmedsync().
> Perhaps there's a way to avoid having to go through shared buffers
> twice, and I generally agreed it'd be good if we could avoid doing so,
> but this approach doesn't look like it actually works.

Yeah, sorry for the rare-baked version.. I was confused about the
order at the time.  The next version works like this:

LOGGED->UNLOGGED


for each relations:
   minimal>
  
  if it is index call ambuildempty() (which syncs the init fork)
  else WAL-log smgr_create then sync the init file.
  
...
commit time:
  
abort time:
  
  

UNLOGGED->LOGGED


for each relations:
   minimal>
  
  
  
...
commit time:
  
  
abort time:
  


> > FWIW this is a revised version of the PoC, which has some known
> > problems.
> > 
> > - Flipping of Buffer persistence is not WAL-logged nor even be able to
> >   be safely roll-backed. (It might be better to drop buffers).
> 
> Not sure if it'd be better to drop buffers or not, but figuring out how
> to deal with rollback seems pretty important.  How is the persistence
> change in the catalog not WAL-logged though..?

Rollback works as the above. Buffer persistence change is registered
in pending-deletes.  Persistence change in catalog is rolled back in
the ordinary way (or automatically).

If wal_level > minimal, persistence change of buffers is propagated to
standbys by WAL.  However I'm not sure we need wal-logging otherwise,
the next version emits WAL since SMGR_CREATE is always logged by
existing code.

> > - This version handles indexes but not yet handle toast relatins.
> 
> Would need to be fixed, of course.

Fixed.

> > - tableAMs are supposed to support this feature. (but I'm not sure
> >   it's worth allowing them not to do so).
> 
> Seems like they should.

Init fork of index relations needs a call to ambuildempty() instead of
"log_smgrcreate-smgrimmedsync" after smgrcreate. Instead of adding
similar interface in indexAm, I reverted changes of tableam and make
RelationCreate/DropInitFork() directly do that.  That introduces new
include of amapi.h to storage.c, which is a bit uneasy.

The previous version give up the in-place persistence change in the
case where wal_level > minimal and SET LOGGED since that needs WAL to
be emitted.  However, in-place change still has the advantage of not
running a table copy. So the next verson always runs persistence
change in-place.

As suggested by Andres, I'll send a summary of this patch. The patch
will be attached to the coming mail.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2020-11-11 Thread Kyotaro Horiguchi
At Wed, 11 Nov 2020 14:18:04 -0800, Andres Freund  wrote in 
> Hi,
> 
> I suggest outlining what you are trying to achieve here. Starting a new
> thread and expecting people to dig through another thread to infer what
> you are actually trying to achive isn't great.

Agreed. I'll post that. Thanks.

> FWIW, I'm *extremely* doubtful it's worth adding features that depend on
> a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as
> I understand.  If somebody added support for dynamically adapting
> wal_level (e.g. wal_level=auto, that increases wal_level to
> replica/logical depending on the presence of replication slots), it'd
> perhaps be different.

Yes, this depends on wal_level=minimal for switching from UNLOGGED to
LOGGED, that's similar to COPY/INSERT-to-intransaction-created-tables
optimization for wal_level=minimal.  And it expands that optimization
to COPY/INSERT-to-existent-tables, which seems worth doing.

Switching to LOGGED needs to emit the initial state to WAL... Hmm.. I
came to think that even in that case skipping table copy reduces I/O
significantly, even though FPI-WAL is emitted.


> On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote:
> > FWIW this is a revised version of the PoC, which has some known
> > problems.
> > 
> > - Flipping of Buffer persistence is not WAL-logged nor even be able to
> >   be safely roll-backed. (It might be better to drop buffers).
> 
> That's obviously a no-go. I think you might be able to address this if
> you accept that the command cannot be run in a transaction (like
> CONCURRENTLY). Then you can first do the catalog changes, change the
> persistence level, and commit.

Of course.  The next version reverts persistence change at abort.

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: Detecting File Damage & Inconsistencies

2020-11-11 Thread tsunakawa.ta...@fujitsu.com
From: Simon Riggs 
> I would like to propose a few points that will help us detect file
> damage, inconsistencies in files and track actions of users.

Hello, Simon san.  Long time no see.  I'm happy to see you be back here 
recently.

What kind of improvement do you expect?  What problems would this make 
detectable?


> * 2-byte pid (from MyProcPid)

pid is 4 bytes on Windows.  Isn't it also 4 byte on Linux when some kernel 
parameter is set to a certain value?


Regards
Takayuki Tsunakawa



Re: logical streaming of xacts via test_decoding is broken

2020-11-11 Thread Dilip Kumar
On Thu, 12 Nov 2020 at 11:37 AM, Amit Kapila 
wrote:

> On Thu, Nov 12, 2020 at 11:29 AM Dilip Kumar 
> wrote:
> >
> > On Thu, Nov 12, 2020 at 8:45 AM Amit Kapila 
> wrote:
> > >
> >
> > > Another thing I am thinking let's just not expose skip_empty_stream to
> > > the user and consider the behavior based on the value of
> > > skip_empty_xacts. Internally, in the code, we can still have different
> > > variables to distinguish between empty_xacts and empty_streams.
> >
> > Yeah,  even I think in most of the cases it makes more sense to have
> > skip_empty_xacts and skip_empty_stream similar values.  So better we
> > don't expose skip_empty_stream.  I agree that we need to keep two
> > variables to track the empty stream and empty xacts.
> >
>
> So, let's try to do this way and if we see any problems then we can
> re-think.


Sounds good to me, I will send the updated patch.

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: Feature request: Improve allowed values for generate series

2020-11-11 Thread David G. Johnston
On Wed, Nov 11, 2020 at 7:54 PM Eugen Konkov  wrote:

> Hello David,
>
> I have a table with services, each service have a period. After which
> service is auto renewal
>
> Services also could be one-time. At this case its interval is '00:00:00'
>

In which case the concept of interval is undefined - there is no meaningful
"second date" here, just the one expiration date - yet you are choosing to
keep it in order to introduce an artificial similarity between one-time
service and auto-renewal service.  This use case isn't convincing for me.
Writing the one-time service query without generate_series leaves out
extraneous stuff that isn't important, which I would recommend even if
generate_series were to work as described.

If you are going to introduce code-specific stuff to make this work just
write: SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL
'1000 years' );  It is just as much a construction of code as the other.

David J.


Re: logical streaming of xacts via test_decoding is broken

2020-11-11 Thread Amit Kapila
On Thu, Nov 12, 2020 at 11:29 AM Dilip Kumar  wrote:
>
> On Thu, Nov 12, 2020 at 8:45 AM Amit Kapila  wrote:
> >
>
> > Another thing I am thinking let's just not expose skip_empty_stream to
> > the user and consider the behavior based on the value of
> > skip_empty_xacts. Internally, in the code, we can still have different
> > variables to distinguish between empty_xacts and empty_streams.
>
> Yeah,  even I think in most of the cases it makes more sense to have
> skip_empty_xacts and skip_empty_stream similar values.  So better we
> don't expose skip_empty_stream.  I agree that we need to keep two
> variables to track the empty stream and empty xacts.
>

So, let's try to do this way and if we see any problems then we can re-think.


-- 
With Regards,
Amit Kapila.




Re: logical streaming of xacts via test_decoding is broken

2020-11-11 Thread Dilip Kumar
On Thu, Nov 12, 2020 at 8:45 AM Amit Kapila  wrote:
>
> On Wed, Nov 11, 2020 at 7:05 PM Dilip Kumar  wrote:
> >
> > On Wed, Nov 11, 2020 at 6:59 PM Amit Kapila  wrote:
> > >
> > > On Wed, Nov 11, 2020 at 10:00 AM Dilip Kumar  
> > > wrote:
> > > >
> > > > On Tue, Nov 10, 2020 at 7:20 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Tue, Nov 10, 2020 at 2:25 PM Dilip Kumar  
> > > > > wrote:
> > > > > >
> > > > >
> > > > > You can probably add a comment as to why we are disallowing this case.
> > > > > I thought of considering 'stream-changes' parameter here because it
> > > > > won't make sense to give this parameter without it, however, it seems
> > > > > that is not necessary but maybe adding a comment
> > > > > here in that regard would be a good idea.
> > > >
> > > > Should we also consider the case that if the user just passed
> > > > skip_empty_streams to true then we should automatically set
> > > > skip_empty_xacts to true?
> > > >
> > >
> > > Is there any problem if we don't do this? Actually, adding
> > > dependencies on parameters is confusing so I want to avoid that unless
> > > it is really required.
> > >
> > > > And we will allow the 'skip_empty_streams' parameter only if
> > > > stream-changes' is true.
> >
> > The reason behind this thought is that if the user doesn't pass any
> > value for 'skip_empty_xacts' then the default value will be false and
> > if the user only pass 'skip_empty_streams' to true then we will error
> > out assuming that skip_empty_xacts is false but skip_empty_streams is
> > true.  So it seems instead of error out we can assume that
> > skip_empty_streams true mean skip_empty_xacts is also true if nothing
> > is passed for that.
> >
>
> So, let's see the overall picture here. We can have four options:
> skip_empty_xacts = true, skip_empty_stream = false;
> skip_empty_xacts = true, skip_empty_stream = true;
> skip_empty_xacts = false, skip_empty_stream = false;
> skip_empty_xacts = false, skip_empty_stream = true;
>
> I think we want to say the first three could be supported and for the
> last one either we can either give an error or make its behavior
> similar to option-2? Is this what your understanding as well?

For the last one if the user has specifically passed false for the
skip_empty_xacts then error and if the user did not pass anything for
skip_empty_xacts then make its behavior similar to option-2.

> Another thing I am thinking let's just not expose skip_empty_stream to
> the user and consider the behavior based on the value of
> skip_empty_xacts. Internally, in the code, we can still have different
> variables to distinguish between empty_xacts and empty_streams.

Yeah,  even I think in most of the cases it makes more sense to have
skip_empty_xacts and skip_empty_stream similar values.  So better we
don't expose skip_empty_stream.  I agree that we need to keep two
variables to track the empty stream and empty xacts.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-11 Thread Fujii Masao




On 2020/11/06 10:25, Masahiro Ikeda wrote:

On 2020-10-30 11:50, Fujii Masao wrote:

On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.


We should avoid that duplication as possible even if the both number
are important.


Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse pgWalUsage's
value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?


I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;
...
   pgstat_send(&WalStats, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;


Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.


+    /* fill in some values using pgWalUsage */
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes;
+    WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;

It's better to use WalUsageAccumDiff() here?


Yes, thanks. I fixed it.


prevWalUsage needs to be initialized with pgWalUsage?

+    if (AmWalWriterProcess()){
+    WalStats.m_wal_write_walwriter++;
+    }
+    else
+    {
+    WalStats.m_wal_write_backend++;
+    }

I think that it's better not to separate m_wal_write_xxx into two for
walwriter and other processes. Instead, we can use one m_wal_write_xxx
counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the counters
per process type if necessary. If we adopt this approach, we can easily
extend pg_stat_wal so that any fields can be reported per process type.


I'll remove the above source code because these counters are not useful.


On 2020-10-30 12:00, Fujii Masao wrote:

On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune
"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, auto_explain,
autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire database,
they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.


2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact on the performance of
a write-heavy workload generating lots of WAL. If this number is reported high,
to reduce the number of this initialization, we can tune WAL-related parameters
so that more "recycled" WAL files can be held.



3. Number of when WAL is flushed

- wal_write_backend : Total num

Re: Proposition for autoname columns

2020-11-11 Thread David G. Johnston
On Wed, Nov 11, 2020 at 5:56 PM Bruce Momjian  wrote:

> On Thu, Nov 12, 2020 at 12:18:49AM +, Dagfinn Ilmari Mannsåker wrote:
> > Bruce Momjian  writes:
> > > I think we could do it, but it would only work if the column was output
> > > as a single json value, and not a multi-key/value field.  I am afraid
> if
> > > we tried to do it, the result would be too inconsistent to be useful.
> >
> > Could this be done via the support function, so that the top-level
> > operator/function in each select list item can return a suggested column
> > name if the relevant arguments are constants?
>
> Yes, the user explicitly calling a function would be much easier to
> predict.
>
>
For the user an operator and a function are different ways to invoke the
same underlying thing using different syntax.  I'm not seeing how this
syntax difference makes this any easier to implement for explicit function
invocation compared to operator function invocation.

David J.


Re: Is it useful to record whether plans are generic or custom?

2020-11-11 Thread Pavel Stehule
čt 12. 11. 2020 v 2:50 odesílatel torikoshia 
napsal:

> On 2020-09-29 02:39, legrand legrand wrote:
> > Hi Atsushi,
> >
> > +1: Your proposal is a good answer for time based performance analysis
> > (even if parsing durationor blks are not differentiated) .
> >
> > As it makes pgss number of columns wilder, may be an other solution
> > would be to create a pg_stat_statements_xxx view with the same key
> > as pgss (dbid,userid,queryid) and all thoses new counters.
>
> Thanks for your ideas and sorry for my late reply.
>
> It seems creating pg_stat_statements_xxx views both for generic and
> custom plans is better than my PoC patch.
>
> However, I also began to wonder how effective it would be to just
> distinguish between generic and custom plans.  Custom plans can
> include all sorts of plans. and thinking cache validation, generic
> plans can also include various plans.
>
> Considering this, I'm starting to feel that it would be better to
> not just keeping whether generic or cutom but the plan itself as
> discussed in the below thread.
>
>
> https://www.postgresql.org/message-id/flat/CAKU4AWq5_jx1Vyai0_Sumgn-Ks0R%2BN80cf%2Bt170%2BzQs8x6%3DHew%40mail.gmail.com#f57e64b8d37697c808e4385009340871
>
>
> Any thoughts?
>

yes, the plan self is very interesting information - and information if
plan was generic or not is interesting too. It is other dimension of query
- maybe there can be rule - for any query store max 100 most slows plans
with all attributes. The next issue is fact so first first 5 execution of
generic plans are not generic in real. This fact should be visible too.

Regards

Pavel



>
> Regards,
>
> --
> Atsushi Torikoshi
>
>
>


Re: Detecting File Damage & Inconsistencies

2020-11-11 Thread Jose Luis Tallon

On 11/11/20 21:56, Simon Riggs wrote:

[ŝnip]

REINDEX VERIFY
After the new index is created, but before we drop the old index:
Check whether the two indexes match:
* checks whether the previous index had pointers to row versions that
don't exist
* checks whether the heap has rows that were not in the old index
This approach piggybacks on existing operations. AccessShareLock is
held on both indexes before the old one is dropped.


FWIW, as long as it's optional (due to the added runtime), it'd be a 
welcome feature.


Maybe something along the lines of:

    REINDEX (verify yes) 



Other ideas are welcome.


I still have nightmares from an specific customer case w/ shared storage 
(using VxFS) among two postmaster instances ---supposedly could never be 
active concurrently, not completely sure that it didn't actually 
happen--- and the corruption that we found there. I seem to remember 
that they even had scripts to remove the locking when switching over and 
back :S


I don't think Postgres can do much about this, but maybe someone can 
come up with a suitable countermeasure.



Just my .02€

Thanks,

    / J.L.






Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-11 Thread Fujii Masao




On 2020/11/10 21:30, Bharath Rupireddy wrote:

On Tue, Nov 10, 2020 at 3:04 PM Fujii Masao  wrote:



The main reason for having SetLatch() in
SignalHandlerForConfigReload() is to wake up the calling process if
waiting in WaitLatchOrSocket() or WaitLatch() and reload the new
config file and use the reloaded config variables. Maybe we should
give a thought on the scenarios in which the walreceiver process
waits, and what happens in case the latch is set when SIGHUP is
received.


The difference is whether the config file is processed at the next
wakeup (by force-reply-request or SIGTERM) of walreceiver or
immediately. If server-reload happened frequently, say, several times
per second(?), we should consider to reduce the useless reloading, but
actually that's not the case.


So, attached is the patch that makes walreceiver use both standard
SIGTERM and SIGHUP handlers. Currently I've not found any actual
issues by making walreceiver use standard SIGHUP handler, yet.



I think it makes sense to replace WalRcv->latch with
MyProc->procLatch(as they point to the same latch) in the functions
that are called in the walreceiver process. However, we are using
walrcv->latch with spinlock, say in WalRcvForceReply() and
RequestXLogStreaming() both are called from the startup process. See
commit 45f9d08684, that made the access to the walreceiver's
latch(WalRcv->latch) by the other process(startup) spinlock protected

And looks like, in general it's a standard practice to set latch to
wake up the process if waiting in case a SIGHUP signal is received and
reload the relevant config variables.

Going by the above analysis, the v3 patch looks good to me.


Thanks for the analysis! I pushed the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Background writer and checkpointer in crash recovery

2020-11-11 Thread Thomas Munro
On Wed, Nov 11, 2020 at 9:57 PM Simon Riggs  wrote:
> Having said that, we did raise the checkpoint_timeout by a lot, so the
> situation today might be quite different. A large checkpoint_timeout
> could eventually overflow shared buffers, with the right workload.

FWIW Jakuk Wartak did manage to show a 1.64x speedup while running
crash recovery of an insert-only workload (with a variant of this
patch that I shared in another thread), albeit with aggressive tuning:

https://www.postgresql.org/message-id/VI1PR0701MB6960EEB838D53886D8A180E3F6520%40VI1PR0701MB6960.eurprd07.prod.outlook.com

> We don't have any stats to show whether this patch is worthwhile or
> not, so I suggest adding the attached instrumentation patch as well so
> we can see on production systems whether checkpoint_timeout is too
> high by comparison with pg_stat_bgwriter. The patch is written in the
> style of log_checkpoints.

Very useful.  I've also been wondering how to get that sort of
information in hot standby.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-11 Thread tsunakawa.ta...@fujitsu.com
The patch looks OK.  I think as Thomas-san suggested, we can remove the 
modification to smgrnblocks() and don't care wheter the size is cached or not.  
But I think the current patch is good too, so I'd like to leave it up to a 
committer to decide which to choose.

I measured performance in a different angle -- the time 
DropRelFileNodeBuffers() and DropRelFileNodeAllBuffers() took.  That reveals 
the direct improvement and degradation.

I used 1,000 tables, each of which is 1 MB.  I used shared_buffers = 128 MB for 
the case where the traditional full buffer scan is done, and shared_buffers = 
100 GB for the case where the optimization path takes effect.

The results are almost good as follows:

A. UNPATCHED

128 MB shared_buffers
1. VACUUM = 0.04 seconds
2. TRUNCATE = 0.04 seconds

100 GB shared_buffers
3. VACUUM = 69.4 seconds
4. TRUNCATE = 69.1 seconds


B. PATCHED

128 MB shared_buffers (full scan)
5. VACUUM = 0.04 seconds
6. TRUNCATE = 0.07 seconds

100 GB shared_buffers (optimized path)
7. VACUUM = 0.02 seconds
8. TRUNCATE = 0.08 seconds


So, I'd like to mark this as ready for committer.


Regards
Takayuki Tsunakawa





RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-11 Thread k.jami...@fujitsu.com
On Tuesday, November 10, 2020 3:10 PM, Tsunakawa-san wrote:
> From: Jamison, Kirk/ジャミソン カーク 
> > So I proceeded to update the patches using the "cached" parameter and
> > updated the corresponding comments to it in 0002.
> 
> OK, I'm in favor of the name "cached" now, although I first agreed with
> Horiguchi-san in that it's better to use a name that represents the nature
> (accurate) of information rather than the implementation (cached).  Having
> a second thought, since smgr is a component that manages relation files on
> storage (file system), lseek(SEEK_END) is the accurate value for smgr.  The
> cached value holds a possibly stale size up to which the relation has
> extended.
> 
> 
> The patch looks almost good except for the minor ones:

Thank you for the review!

> (1)
> +extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber
> forknum,
> +bool *accurate);
> 
> It's still accurate here.

Already fixed in 0002.

> (2)
> + *   This is only called in recovery when the block count of any
> fork is
> + *   cached and the total number of to-be-invalidated blocks per
> relation
> 
> count of any fork is
> -> counts of all forks are

Fixed in 0003/
 
> (3)
> In 0004, I thought you would add the invalidated block counts of all relations
> to determine if the optimization is done, as Horiguchi-san suggested.  But I
> find the current patch okay too.

Yeah, I found my approach easier to implement. The new change in 0004 is that
when entering the optimized path we now call FindAndDropRelFileNodeBuffers()
instead of DropRelFileNodeBuffers().

I have attached all the updated patches.
I'd appreciate your feedback.

Regards,
Kirk Jamison


v31-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v31-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v31-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v31-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v31-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v31-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v31-0004-TRUNCATE-optimization.patch
Description: v31-0004-TRUNCATE-optimization.patch


Re: Strange GiST logic leading to uninitialized memory access in pg_trgm gist code

2020-11-11 Thread Alexander Korotkov
Hi!

On Wed, Nov 11, 2020 at 12:53 PM Andrew Gierth
 wrote:
> Now the obvious simple fix is just to reorder those last two operations,
> and the original reporter verified that doing so fixed their problem
> (patch attached). But I'd really like to understand the logic here and
> whether there is any reason to have this special treatment at all - why
> would it not be better to just cache all N items upfront and consider
> them all as potential seeds?

I think this comes from the idea that when N items are passed to the
picksplit method, then the first N-1 are existing items on the page,
while the last Nth is the new item to be inserted.  So, we are trying
to split first N-1 items and then insert the last Nth item there.  But
this is wrong for two reasons.

1) As you've pointed out, GiST code doesn't necessarily pass items to
the picksplit method in that way.
2) Even if items are passed as assumed, there is no point in having
special handling of the item to be inserted.  It's better to consider
the whole set of items to produce a better split.

> Another issue I don't understand yet is that even though this code is
> largely unchanged since 8.x, the original reporter could not reproduce
> the crash on any version before 13.0.

I think this is related to my commit 911e702077.  It has changed the
memory allocation for the signatures to support the signatures of
variable length.  So, it seems that despite the error existing since
8.x, it started causing segfaults only since 911e702077.

> Anyone have any ideas? (If not, I'll commit and backpatch something like
> the attached patch at some suitable time.)

I would rather propose to rip off special handling of the last item
completely (see the attached patch).

--
Regards,
Alexander Korotkov


0001-Fix-handling-of-the-last-item-in-gtrgm_picksplit.patch
Description: Binary data


Re: Delay of standby shutdown

2020-11-11 Thread Fujii Masao



On 2020/11/04 9:35, Soumyadeep Chakraborty wrote:

Hello Fujii,

On Thu, Sep 17, 2020 at 6:49 AM Fujii Masao  wrote:

As far as I understand after debugging, the problem is as follows:


Yes.




1. After the primary is stopped, and the standby startup process is
waiting inside:

(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,
wait_time,
WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);

it receives SIGTERM on account of stopping the standby and it leads to
the WaitLatch call returning with WL_LATCH_SET.

2. WaitForWALToBecomeAvailable() then will return true after calling
XLogFileReadAnyTLI() and eventually, XLogReadRecord() will return NULL
since there is no new WAL to read, which means ReadRecord() will loop
back and perform another XLogReadRecord().

3. The additional XLogReadRecord() will lead to a 5s wait inside
WaitForWALToBecomeAvailable() - a different WaitLatch() call this time:

/*
  * Wait for more WAL to arrive. Time out after 5 seconds
  * to react to a trigger file promptly and to check if the
  * WAL receiver is still active.
  */
(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,
5000L, WAIT_EVENT_RECOVERY_WAL_STREAM);

4. And then eventually, the code will handle interrupts here inside
WaitForWALToBecomeAvailable() after the 5s wait:

/*
  * This possibly-long loop needs to handle interrupts of startup
  * process.
  */
HandleStartupProcInterrupts();


To avoid this issue, I think that ReadRecord() should call
HandleStartupProcInterrupts() whenever looping back to retry.


Alternatively, perhaps we can place it inside
WaitForWALToBecomeAvailable() (which already has a similar call),
since it is more semantically aligned to checking for interrupts, rather
than ReadRecord()? Like this:


Yes. Your approach looks better to me.
Attached is the updated version of the patch implementing that.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index aa63f37615..7d97b96e72 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12225,6 +12225,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
randAccess,

 WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);
ResetLatch(MyLatch);
now = GetCurrentTimestamp();
+
+   /* Handle interrupt signals of 
startup process */
+   HandleStartupProcInterrupts();
}
last_fail_time = now;
currentSource = XLOG_FROM_ARCHIVE;


Re: logical streaming of xacts via test_decoding is broken

2020-11-11 Thread Amit Kapila
On Wed, Nov 11, 2020 at 7:05 PM Dilip Kumar  wrote:
>
> On Wed, Nov 11, 2020 at 6:59 PM Amit Kapila  wrote:
> >
> > On Wed, Nov 11, 2020 at 10:00 AM Dilip Kumar  wrote:
> > >
> > > On Tue, Nov 10, 2020 at 7:20 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Tue, Nov 10, 2020 at 2:25 PM Dilip Kumar  
> > > > wrote:
> > > > >
> > > >
> > > > You can probably add a comment as to why we are disallowing this case.
> > > > I thought of considering 'stream-changes' parameter here because it
> > > > won't make sense to give this parameter without it, however, it seems
> > > > that is not necessary but maybe adding a comment
> > > > here in that regard would be a good idea.
> > >
> > > Should we also consider the case that if the user just passed
> > > skip_empty_streams to true then we should automatically set
> > > skip_empty_xacts to true?
> > >
> >
> > Is there any problem if we don't do this? Actually, adding
> > dependencies on parameters is confusing so I want to avoid that unless
> > it is really required.
> >
> > > And we will allow the 'skip_empty_streams' parameter only if
> > > stream-changes' is true.
>
> The reason behind this thought is that if the user doesn't pass any
> value for 'skip_empty_xacts' then the default value will be false and
> if the user only pass 'skip_empty_streams' to true then we will error
> out assuming that skip_empty_xacts is false but skip_empty_streams is
> true.  So it seems instead of error out we can assume that
> skip_empty_streams true mean skip_empty_xacts is also true if nothing
> is passed for that.
>

So, let's see the overall picture here. We can have four options:
skip_empty_xacts = true, skip_empty_stream = false;
skip_empty_xacts = true, skip_empty_stream = true;
skip_empty_xacts = false, skip_empty_stream = false;
skip_empty_xacts = false, skip_empty_stream = true;

I think we want to say the first three could be supported and for the
last one either we can either give an error or make its behavior
similar to option-2? Is this what your understanding as well?

Another thing I am thinking let's just not expose skip_empty_stream to
the user and consider the behavior based on the value of
skip_empty_xacts. Internally, in the code, we can still have different
variables to distinguish between empty_xacts and empty_streams.

-- 
With Regards,
Amit Kapila.




RE: Disable WAL logging to speed up data loading

2020-11-11 Thread tsunakawa.ta...@fujitsu.com
From: Stephen Frost 
> I'm not sure that I see this as really being much of an issue.  Perhaps there 
> are
> some things we can do, as I mentioned before, to make it easier for users to
> have tables be created as unlogged from the start, or to be able to ALTER
> TABLE a bunch of tables at once (using all in tablespace, or maybe having an
> ALTER TABLE on a partitioned table cascade to the partitions), but overall the
> risk here seems very low- clearly whatever processing is running to load the
> data into a particular table knows what the table is and adding an ALTER
> TABLE into it would be trivial.

Thank you for your opinion.  I'm glad to see great people like Robert, Magnus 
and you have given comments (honestly, I didn't expect this trivial feature 
wouldn't get attention.)  I'll read those mails from the oldest including this, 
and reply to them if needed.


> > Likewise, can't we do ALTER TABLE SET UNLOGGED/LOGGED against a
> partitioned table?  Currently, the statement succeeds but none of the
> partitioned table nor its partitions is set unlogged (pg_class.relpersistence
> remains 'p').  Is this intended?  If it's a bug, I'm willing to fix it so 
> that it reports
> an eror.  Of course, it's good to make all partitions unlogged at once.
> 
> I agree that this doesn't seem quite right and considering the way other
> commands work like CREATE INDEX, I would think that doing such an ALTER
> TABLE would recurse to the individual partitions (skipping over any which are
> already set to the persistance desired..).

OK, I'll try to propagate the alteration to partitions.  (I wish the fix will 
be easy, but I'm afraid some devil is lurking...)  I'd appreciate it if someone 
could kindly point out the devil if he/she knows the past history.


Regards
Takayuki Tsunakawa





Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-11 Thread David Rowley
On Tue, 10 Nov 2020 at 12:55, David Rowley  wrote:
>
> On Tue, 10 Nov 2020 at 12:49, Tom Lane  wrote:
> >
> > Alvaro Herrera  writes:
> > > Are you taking into account the possibility that generated machine code
> > > is a small percent slower out of mere bad luck?  I remember someone
> > > suggesting that they can make code 2% faster or so by inserting random
> > > no-op instructions in the binary, or something like that.  So if the
> > > difference between v8 and v9 is that small, then it might be due to this
> > > kind of effect.
> >
> > Yeah.  I believe what this arises from is good or bad luck about relevant
> > tight loops falling within or across cache lines, and that sort of thing.
> > We've definitely seen performance changes up to a couple percent with
> > no apparent change to the relevant code.
>
> It possibly is this issue.
>
> Normally how I build up my confidence in which is faster is why just
> rebasing on master as it advances and see if the winner ever changes.
> The theory here is if one patch is consistently the fastest, then
> there's more chance if there being a genuine reason for it.

I kicked off a script last night that ran benchmarks on master, v8 and
v9 of the patch on 1 commit per day for the past 30 days since
yesterday.  The idea here is that as the code changes that if the
performance differences are due to code alignment then there should be
enough churn in 30 days to show if this is the case.

The quickly put together script is attached. It would need quite a bit
of modification to run on someone else's machine.

This took about 20 hours to run.  I found that v8 is faster on 28 out
of 30 commits. In the two cases where v9 was faster, v9 took 99.8% and
98.5% of the time of v8.   In the 28 cases where v8 was faster it was
generally about 2-4% faster, but a couple of times 8-10% faster. Full
results attached in .csv file. Also the query I ran to compare the
results once loaded into Postgres.

David
#!/bin/bash
seconds=30
scale=10

pg_ctl stop -D pgdata -w

for sha in e578c17d8 ae0f7b11f b94109ce3 7f4708818 bc49f8780 540849814 
929c69aa1 f49b85d78 bbb927b4d 555eb1a4f f8721bd75 83d727e5b 21d36747d ba9f18abd 
20d3fe900 f893e68d7 ad1c36b07 f90149e62 b401fa206 f90e80b91 b17ff07aa 5b3dca096 
bf797a8d9 113d3591b 5b7bfc397 5ee180a39 b4c9695e7 8b39345a9 8f113698b d2e4bf688
do
cd ~/pg_src
git reset --hard
git clean -f
git checkout $sha
./configure --prefix=/home/drowley/pg > /dev/null
for branch in master resultcache_v8 resultcache_v9
do
cd ~/pg_src
git reset --hard
git clean -f
for file in /home/drowley/$branch/*
do
patch -p1 < $file
done
make clean -s
make -j -s
make install -s
cd contrib/pg_prewarm
make -j -s
make install -s
cd
sleep 1
# create database and load data when doing the first branch
if [ $branch = master ]
then
rm -rf pgdata
initdb -D pgdata > /dev/null
cp postgresql.conf pgdata
pg_ctl start -D pgdata -l pg.log

psql -c "drop table if exists hundredk, lookup1, 
lookup100;" postgres
psql -c "create table hundredk (hundredk int, tenk int, 
thousand int, hundred int, ten int, one int);" postgres
psql -c "insert into hundredk select 
x%10,x%1,x%1000,x%100,x%10,1 from generate_Series(1,$scale) x;" postgres

psql -c "create table lookup100 (a int);" postgres
psql -c "insert into lookup100 select x from 
generate_Series(1,$scale)x,generate_Series(1,100);" postgres
psql -c "create index on lookup100 (a);" postgres

psql -c "create table lookup1 (a int);" postgres
psql -c "insert into lookup1 select x from 
generate_Series(1,$scale)x;" postgres
psql -c "create index on lookup1 (a);" postgres

psql -c "vacuum analyze lookup1, lookup100, hundredk;" 
postgres
psql -c "create extension pg_prewarm;" postgres
pg_ctl stop -D pgdata -w
fi

pg_ctl start -D pgdata -l pg.log -w
psql -c "select pg_prewarm('lookup1'), pg_prewarm('lookup100'), 
pg_prewarm('hundredk');" postgres
for tbl in lookup1 lookup100
do
for target in "count(*)" "count(l.a)" "'*'"
do
for col in thousand hundred ten one
do
echo "select $ta

Re: enable pg_stat_statements to track rows processed by REFRESH MATERIALIZED VIEW

2020-11-11 Thread Fujii Masao




On 2020/11/10 17:29, Fujii Masao wrote:



On 2020/11/05 23:54, Seino Yuki wrote:

2020-11-02 20:01 に Fujii Masao さんは書きました:

On 2020/11/02 14:02, Yuki Seino wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:    tested, passed

+1.
I checked the patch and there were no problems.


+    PG_END_TRY();
+    SetQueryCompletion(qc, CMDTAG_REFRESH_MATERIALIZED_VIEW, processed);

Isn't it better to call SetQueryCompletion() in ExecRefreshMatView()
instead of ProcessUtilitySlow() (e.g., ExecCreateTableAs() does)?

Regards,



Sorry. I missed it.
I've incorporated your point into this patch.
So the changes to "matview.h" and "utility.c" have been canceld.

We also confirmed that the new patch passed the regression test.


Thanks for updating the patch!

+    /* save the rowcount if we're given a qc to fill */
+    SetQueryCompletion(qc, CMDTAG_REFRESH_MATERIALIZED_VIEW, processed);

I added the check "if (qc)" into the above. I also added the following
comments about that we don't display the rowcount in the command
completion tag output though we save it in qc. There is the discussion
related to this topic, at [1]. Thought?

+ * Save the rowcount so that pg_stat_statements can track the total number
+ * of rows processed by REFRESH MATERIALIZED VIEW command. Note that we
+ * still don't display the rowcount in the command completion tag output,
+ * i.e., the display_rowcount flag of CMDTAG_REFRESH_MATERIALIZED_VIEW
+ * command tag is left false in cmdtaglist.h. Otherwise, the change of
+ * completion tag output might break applications using it.

Attached is the updated version of the patch.
Barring no objection, I will commit that.


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Add docs stub for recovery.conf

2020-11-11 Thread Craig Ringer
On Thu, Nov 12, 2020 at 4:01 AM Bruce Momjian  wrote:

> On Wed, Nov 11, 2020 at 08:59:40PM +0100, Daniel Gustafsson wrote:
> > > On 11 Nov 2020, at 20:44, Bruce Momjian  wrote:
> > > On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote:
> >
> > >> I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!)
> the docs for
> > >> it were removed completely as well. That's largely sensible, but is
> confusing
> > >> when users have upgraded and are trying to find out what happened, or
> how to
> > >> configure equivalent functionality.
> > >
> > > I don't see the logic in carrying doc stuff that we don't have anymore.
> >
> > Well, we do have that already in 's sprinkled across the docs where
> it
> > makes sense to help transitioning users, like this one in func.sgml:
> >
> > "Prior to PostgreSQL 12, it was possible to skip arbitrary text in
> the
> > input string using non-letter or non-digit characters..."
> >
> > It doesn't seem like a terrible idea to do a similar one for
> recovery.conf.
>
> I am fine with a tip.  The patch looked like it was creating a new
> chapter for it.
>
>
It is. Or rather, an appendix right at the end to hold info on things we
removed or renamed and where to find them now.

You can't AFAICS make docbook create a toplevel linkable file for a  .
A  won't un-break
https://www.postgresql.org/docs/current/recovery-config.html or make help
people who visit https://www.postgresql.org/docs/11/recovery-config.html
figure out what's going on if they're using pg12 and there's no link to
version 12 in the nav section. A  won't add index entries for renamed
settings, so someone looking up "standby_mode" can find out that we've
switched to a file called "standby.signal" instead.

Pretend you're a user who has upgraded from pg 11. You're looking at the Pg
12 docs. How long does it take you to find out how to make a server into a
standby now? It took me longer than I would've expected...


Re: Add docs stub for recovery.conf

2020-11-11 Thread Craig Ringer
On Thu, Nov 12, 2020 at 3:44 AM Bruce Momjian  wrote:

> On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote:
> > Hi all
> >
> > I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the
> docs for
> > it were removed completely as well. That's largely sensible, but is
> confusing
> > when users have upgraded and are trying to find out what happened, or
> how to
> > configure equivalent functionality.
>
> I don't see the logic in carrying doc stuff that we don't have anymore.
>

I explained why.

Here's how the rendered docs look: https://imgur.com/a/VyjzEw5

Think. You're used to recovery.conf. You've recently switched to pg 12. You
search for "recovery.conf" or "primary_slot_name" or "standby_mode" or
something. You of course land up at
https://www.postgresql.org/docs/11/recovery-config.html or another version.

What do you do now? There's no "12" or "current" link for your version. You
don't follow postgres development closely, and you have no idea we removed
the file. You might work it out from the release notes. But even then, if
you try searching for "standby_mode" in the updated docs will tell you
basically nothing, it has just vanished

Also by simply deleting the page, we've broken web links to
https://www.postgresql.org/docs/current/recovery-config.html

So there are three really good reasons:

* Help users of the web docs navigate to the right place when we remove
things
* Don't break web links (breaking links without redirects is bad, the web
is sad)
* Help upgrading users who know the old terms find the new terms

I'd welcome your suggestions on other ways to arrange this, so long as it
meets the basic requirement "retain the linktable target 'recovery-config' "

In general I think it's quite unpleasant for users to have docs sections
just vanish. I strongly suggest that we enact a policy going forwards that
any  or  removal should be accompanied by a redirect or
stub that helps users find the new contents. It regularly annoys me when
I'm trying to navigate around various versions of the docs and things just
vanish.


RE: POC: postgres_fdw insert batching

2020-11-11 Thread tsunakawa.ta...@fujitsu.com
From: t...@corona.is.ed.ac.uk  On Behalf Of
> Does this patch affect trigger semantics on the base table?
> 
> At the moment when I insert 1000 rows into a postgres_fdw table using a
> single insert statement (e.g. INSERT INTO fdw_foo SELECT ... FROM bar) I
> naively expect a "statement level" trigger on the base table to trigger
> once. But this is not the case. The postgres_fdw implements this
> operation as 1000 separate insert statements on the base table, so the
> trigger happens 1000 times instead of once. Hence there is no
> distinction between using a statement level and a row level trigger on
> the base table in this context.
> 
> So would this patch change the behaviour so only 10 separate insert
> statements (each of 100 rows) would be made against the base table?
> If so thats useful as it means improving performance using statement
> level triggers becomes possible. But it would also result in more
> obscure semantics and might break user processes dependent on the
> existing behaviour after the patch is applied.

Yes, the times the statement trigger defined on the base (remote) table will be 
reduced, as you said.


> BTW is this subtlety documented, I haven't found anything but happy
> to be proved wrong?

Unfortunately, there doesn't seem to be any description on triggers on base 
tables.  For example, if the local foreign table has an AFTER ROW trigger and 
its remote base table has a BEFORE ROW trigger that modifies the input record, 
it seems that the AFTER ROW trigger doesn't see the modified record.


Regards
Takayuki Tsunakawa





Re: Avoiding useless SHA256 initialization with backup manifests, breaking base backups with FIPS

2020-11-11 Thread Michael Paquier
On Tue, Nov 10, 2020 at 11:00:14AM +0900, Michael Paquier wrote:
> Attached is a patch that I would like to back-patch down to v13 to
> avoid this useless initialization, giving users the possibility to
> take base backups with FIPS when not using a backup manifest.  Without
> the solution in the first paragraph, you cannot make use of backup 
> manifests at all with OpenSSL+FIPS (one can still enforce the use of
> the in-core SHA2 implementation even if building with OpenSSL), but at
> least it gives an escape route with 13.

Okay.  Hearing nothing, I have applied that.
--
Michael


signature.asc
Description: PGP signature


Re: Is it useful to record whether plans are generic or custom?

2020-11-11 Thread torikoshia

On 2020-09-29 02:39, legrand legrand wrote:

Hi Atsushi,

+1: Your proposal is a good answer for time based performance analysis
(even if parsing durationor blks are not differentiated) .

As it makes pgss number of columns wilder, may be an other solution
would be to create a pg_stat_statements_xxx view with the same key
as pgss (dbid,userid,queryid) and all thoses new counters.


Thanks for your ideas and sorry for my late reply.

It seems creating pg_stat_statements_xxx views both for generic and
custom plans is better than my PoC patch.

However, I also began to wonder how effective it would be to just
distinguish between generic and custom plans.  Custom plans can
include all sorts of plans. and thinking cache validation, generic
plans can also include various plans.

Considering this, I'm starting to feel that it would be better to
not just keeping whether generic or cutom but the plan itself as
discussed in the below thread.

https://www.postgresql.org/message-id/flat/CAKU4AWq5_jx1Vyai0_Sumgn-Ks0R%2BN80cf%2Bt170%2BzQs8x6%3DHew%40mail.gmail.com#f57e64b8d37697c808e4385009340871


Any thoughts?


Regards,

--
Atsushi Torikoshi




Re: recovery_target immediate timestamp

2020-11-11 Thread Fujii Masao




On 2020/11/12 6:00, Euler Taveira wrote:

Hi,

While restoring a backup using recovery_target immediate, I noticed that there
isn't a timestamp available.

LOG:  consistent recovery state reached at 0/A000100
LOG:  recovery stopping after reaching consistency
LOG:  pausing at the end of recovery
HINT:  Execute pg_wal_replay_resume() to promote.
LOG:  database system is ready to accept read only connections

if you decide to use one of the targets or just recover until the end of the
WAL, you get a (last completed transaction) timestamp.

LOG:  redo done at 0/10FFEC38 system usage: CPU: user: 0.10 s, system: 0.05 s, 
elapsed: 1.65 s
LOG:  last completed transaction was at log time 2020-11-11 17:27:31.715251-03
LOG:  restored log file "00010010" from archive
cp: cannot stat '/a/pgarchive/0002.history': No such file or directory
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
cp: cannot stat '/a/pgarchive/0001.history': No such file or directory
LOG:  database system is ready to accept connections

I dig into the pg_waldump output to figure out the timestamp, however, the
checkpoint timestamp isn't printed by pg_waldump. The checkpoint timestamp
might be useful information at least when you set large values for a checkpoint
or need to investigate a performance/corruption issue.

The first patch adds a new message that prints the latest completed checkpoint
when the consistent state is reached.


I'm not sure how useful this information is in practice.


It also exposes the checkpoint timestamp
in debug messages.


ereport(DEBUG1,
(errmsg("checkpoint record is at %X/%X",
(uint32) (checkPointLoc 
>> 32), (uint32) checkPointLoc)));
+   ereport(DEBUG1,
+   (errmsg("checkpoint time is %s", 
str_time(checkPoint.time;

The above first debug message displays the LSN of the checkpoint record.
OTOH, the second message displays the time when the checkpoint started
(not the time when checkpoint record was written at the end of checkpoint).
So isn't it confusing to display those inconsistent information together?



LOG:  consistent recovery state reached at 0/A000100
DETAIL:  Last completed checkpoint was at log time 2020-11-11 17:31:50 -03.
LOG:  recovery stopping after reaching consistency
LOG:  pausing at the end of recovery
HINT:  Execute pg_wal_replay_resume() to promote.
LOG:  database system is ready to accept read only connections
.
.
.
DEBUG:  checkpoint record is at 0/A60
DEBUG:  checkpoint time is 2020-11-11 17:34:19 -03

The second patch provides the checkpoint timestamp in the pg_waldump output and
also when you enable wal_debug parameter. The pg_waldump output looks like


+1

+#ifdef FRONTEND
+   strftime(checkpointstr, sizeof(checkpointstr), "%c", 
localtime(&time_tmp));
+#else
+   pg_strftime(checkpointstr, sizeof(checkpointstr), "%c", 
pg_localtime(&time_tmp, log_timezone));
+#endif

You can simplify the code by using timestamptz_to_str() here instead, like 
xact_desc_commit() does.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: pg_upgrade analyze script

2020-11-11 Thread Michael Paquier
On Wed, Nov 11, 2020 at 04:51:27PM +0100, Magnus Hagander wrote:
> Gah. For some reason, I did not spot this email until after I had
> applied the other patch. And not only that, I also missed one line in
> my patch that you included in yours :/
> 
> My apologies.

No worries, thanks.
--
Michael


signature.asc
Description: PGP signature


Re: Proposition for autoname columns

2020-11-11 Thread Bruce Momjian
On Thu, Nov 12, 2020 at 12:18:49AM +, Dagfinn Ilmari Mannsåker wrote:
> Bruce Momjian  writes:
> > I think we could do it, but it would only work if the column was output
> > as a single json value, and not a multi-key/value field.  I am afraid if
> > we tried to do it, the result would be too inconsistent to be useful.
> 
> Could this be done via the support function, so that the top-level
> operator/function in each select list item can return a suggested column
> name if the relevant arguments are constants?

Yes, the user explicitly calling a function would be much easier to
predict.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Proposition for autoname columns

2020-11-11 Thread Dagfinn Ilmari Mannsåker
Bruce Momjian  writes:

> On Mon, Nov  2, 2020 at 05:05:29PM +0200, Eugen Konkov wrote:
>> Hello Pgsql-hackers,
>> 
>> When selecting data from json column it named as '?column?'
>> tucha=# select info->>'suma', docn from document order by id desc limit 5;
>>  ?column? | docn 
>> --+--
>>  665.97   | 695
>>  513.51   | 632
>>  665.97   | 4804
>>  492.12   | 4315
>>  332.98   | 1302
>> (5 rows)
>> 
>> It would be useful if the name of column will be autoassigned based on
>> name of json key. Like at next query:
>> 
>> tucha=# select info->>'suma' as suma, docn from document order by id desc 
>> limit 5;
>>   suma  | docn 
>> +--
>>  665.97 | 695
>>  513.51 | 632
>>  665.97 | 4804
>>  492.12 | 4315
>>  332.98 | 1302
>> (5 rows)
>> 
>> 
>> Would it be useful this auto assigned name for column from json?
>
> I think we could do it, but it would only work if the column was output
> as a single json value, and not a multi-key/value field.  I am afraid if
> we tried to do it, the result would be too inconsistent to be useful.

Could this be done via the support function, so that the top-level
operator/function in each select list item can return a suggested column
name if the relevant arguments are constants?

- ilmari
-- 
- Twitter seems more influential [than blogs] in the 'gets reported in
  the mainstream press' sense at least.   - Matt McLeod
- That'd be because the content of a tweet is easier to condense down
  to a mainstream media article.  - Calle Dybedahl




Returning NULL from satisfies_hash_partition() is a bad idea

2020-11-11 Thread Tom Lane
So far as I can tell, the partition machinery is critically dependent
on the idea that partition constraint conditions can never return
NULL, only true or false.  This is explicitly noted in the comments in
get_qual_for_list, for instance, and it's visibly true by construction
for both list and range constraints.

But hash partition constraints are just calls to
satisfies_hash_partition(), and look what we've got there:

/* Return null if the parent OID, modulus, or remainder is NULL. */
if (PG_ARGISNULL(0) || PG_ARGISNULL(1) || PG_ARGISNULL(2))
PG_RETURN_NULL();

parent = try_relation_open(parentId, AccessShareLock);
if (parent == NULL)
PG_RETURN_NULL();

OK, the first one probably shouldn't happen, but it's far from clear that
the second cannot.  If we did return NULL, that would be taken as a "pass"
of the constraint, possibly allowing a non-matching row to be inserted
into a partition.

So this seems like a seriously bad idea.  I don't have a strong position
on whether it'd be better to return FALSE (causing a constraint failure
error) or just throw an error outright.  But we can't leave it like this.

regards, tom lane




Re: PATCH: Batch/pipelining support for libpq

2020-11-11 Thread Andres Freund
Hi,

On 2020-11-03 10:42:34 -0300, Alvaro Herrera wrote:
> It would definitely help if you (and others) could think about the API
> being added: Does it fulfill the promises being made?  Does it offer the
> guarantees that real-world apps want to have?  I'm not much of an
> application writer myself -- particularly high-traffic apps that would
> want to use this.

Somewhere earlier in this thread there was a patch with support for
batching in pgbench. I think it'd be good to refresh that. Both because
it shows at least some real-world-lite usage of the feature and because
we need a way to stress it to see whether it has unnecessary
bottlenecks.

Greetings,

Andres Freund




Tracking cluster upgrade and configuration history

2020-11-11 Thread Mark Dilger
Hello Hackers,

While supporting customers, it would frequently be useful to have more 
information about the history of a cluster.  For example, which prior versions 
were ever installed and running on the cluster?  Has the cluster ever been run 
with fsync=off?  When did the server last enter recovery, if ever?  Was a 
backup_label file present at that time?

Some of this type of information could strictly be fixed size, such as a fixed 
set of timestamps for the time at which a fixed set of things last occurred, or 
a fixed set of bits indicating whether a fixed set of things ever happened.

Some other types would be variable size, but hopefully short in practice, like 
a list of all postgres versions that have ever been run on the cluster.

Logging the information via the usual log mechanism seems insufficient, as log 
files may get rotated and this information lost.

Would it be acceptable to store some fixed set of flag bits and timestamps in 
pg_control?  Space there is at a premium.

Would it make sense to alternately, or additionally, store some of this 
information in a flat text file in pg_data, say a new file named 
"cluster_history" or such?

I'm happy to put together a more concrete proposal, but solicit your opinions 
first on the merits of the idea generally, and if you think the idea good, on 
the specifics you'd like to see included.

Thanks!

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: In-placre persistance change of a relation

2020-11-11 Thread Andres Freund
Hi,

I suggest outlining what you are trying to achieve here. Starting a new
thread and expecting people to dig through another thread to infer what
you are actually trying to achive isn't great.

FWIW, I'm *extremely* doubtful it's worth adding features that depend on
a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as
I understand.  If somebody added support for dynamically adapting
wal_level (e.g. wal_level=auto, that increases wal_level to
replica/logical depending on the presence of replication slots), it'd
perhaps be different.


On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote:
> FWIW this is a revised version of the PoC, which has some known
> problems.
> 
> - Flipping of Buffer persistence is not WAL-logged nor even be able to
>   be safely roll-backed. (It might be better to drop buffers).

That's obviously a no-go. I think you might be able to address this if
you accept that the command cannot be run in a transaction (like
CONCURRENTLY). Then you can first do the catalog changes, change the
persistence level, and commit.

Greetings,

Andres Freund




Re: Allow matching whole DN from a client certificate

2020-11-11 Thread Stephen Frost
Greetings,

* Andrew Dunstan (and...@dunslane.net) wrote:
> Currently we only match the Common Name (CN) of a client certificate
> when authenticating a user. The attached patch allows matching the
> entire Distinguished Name (DN) of the certificate. This is enabled by
> the HBA line option "clientname", which can take the values "CN" or
> "DN". "CN" is the default.
> 
> The idea is that you might have a role with a CN of, say, "dbauser" in
> two different parts of the organization, say one with "OU=marketing" and
> the other with "OU=engineering", and you only want to allow access to
> one of them.
> 
> This feature is best used in conjunction with a map. e.g. in testing I
> have this pg_hba.conf line:
> 
> hostssl all all 127.0.0.1/32 cert clientname=DN map=dn
> 
> and this pg_ident.conf line:
> 
> dn /^C=US,ST=North.Carolina,O=test,OU=eng,CN=andrew$ andrew
> 
> If people like this idea I'll add tests and docco and add it to the next CF.

Yeah, this is definitely a worthwhile feature.

Thanks,

Stephen


signature.asc
Description: PGP signature


recovery_target immediate timestamp

2020-11-11 Thread Euler Taveira
Hi,

While restoring a backup using recovery_target immediate, I noticed that
there
isn't a timestamp available.

LOG:  consistent recovery state reached at 0/A000100
LOG:  recovery stopping after reaching consistency
LOG:  pausing at the end of recovery
HINT:  Execute pg_wal_replay_resume() to promote.
LOG:  database system is ready to accept read only connections

if you decide to use one of the targets or just recover until the end of the
WAL, you get a (last completed transaction) timestamp.

LOG:  redo done at 0/10FFEC38 system usage: CPU: user: 0.10 s, system: 0.05
s, elapsed: 1.65 s
LOG:  last completed transaction was at log time 2020-11-11
17:27:31.715251-03
LOG:  restored log file "00010010" from archive
cp: cannot stat '/a/pgarchive/0002.history': No such file or directory
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
cp: cannot stat '/a/pgarchive/0001.history': No such file or directory
LOG:  database system is ready to accept connections

I dig into the pg_waldump output to figure out the timestamp, however, the
checkpoint timestamp isn't printed by pg_waldump. The checkpoint timestamp
might be useful information at least when you set large values for a
checkpoint
or need to investigate a performance/corruption issue.

The first patch adds a new message that prints the latest completed
checkpoint
when the consistent state is reached. It also exposes the checkpoint
timestamp
in debug messages.

LOG:  consistent recovery state reached at 0/A000100
DETAIL:  Last completed checkpoint was at log time 2020-11-11 17:31:50 -03.
LOG:  recovery stopping after reaching consistency
LOG:  pausing at the end of recovery
HINT:  Execute pg_wal_replay_resume() to promote.
LOG:  database system is ready to accept read only connections
.
.
.
DEBUG:  checkpoint record is at 0/A60
DEBUG:  checkpoint time is 2020-11-11 17:34:19 -03

The second patch provides the checkpoint timestamp in the pg_waldump output
and
also when you enable wal_debug parameter. The pg_waldump output looks like

rmgr: XLOGlen (rec/tot):114/   114, tx:  0, lsn:
0/0A60, prev 0/0A28, desc: CHECKPOINT_ONLINE redo 0/A28;
timestamp qua 11 nov 2020 17:34:19 -03; tli 1; prev tli 1; fpw true; xid
0:519; oid 24576; multi 1; offset 0; oldest xid 501 in DB 1; oldest multi 1
in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 519;
online

and the debug messages are

DEBUG:  checkpoint record is at 0/A60
DEBUG:  checkpoint time is 2020-11-11 17:37:47 -03
LOG:  REDO @ 0/A60; LSN 0/AD8: prev 0/A28; xid 0; len 88 -
XLOG/CHECKPOINT_ONLINE: redo 0/A28; timestamp Wed Nov 11 17:37:47 2020;
tli 1; prev tli 1; fpw true; xid 0:519; oid 24576; multi 1; offset 0;
oldest xid 501 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 519; online


Regards,

-- 
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 4b725a45dfb89ee9a463e6ee7987cd8b660053cd Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Wed, 11 Nov 2020 11:24:25 -0300
Subject: [PATCH 1/2] Report latest completed checkpoint timestamp

A new detail message prints the latest completed checkpoint when
reaching the consistent state. This information is useful when you use
recovery_target = immediate because it is not clear what timestamp it
stops applying the WAL.

It also adds new debug messages that report the checkpoint timestamp.
It might be useful for debug purposes.
---
 src/backend/access/transam/xlog.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index aa63f37615..cf70950932 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6524,6 +6524,8 @@ StartupXLOG(void)
 			ereport(DEBUG1,
 	(errmsg("checkpoint record is at %X/%X",
 			(uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
+			ereport(DEBUG1,
+	(errmsg("checkpoint time is %s", str_time(checkPoint.time;
 			InRecovery = true;	/* force recovery even if SHUTDOWNED */
 
 			/*
@@ -6657,6 +6659,8 @@ StartupXLOG(void)
 			ereport(DEBUG1,
 	(errmsg("checkpoint record is at %X/%X",
 			(uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
+			ereport(DEBUG1,
+	(errmsg("checkpoint time is %s", str_time(checkPoint.time;
 		}
 		else
 		{
@@ -8033,7 +8037,9 @@ CheckRecoveryConsistency(void)
 		ereport(LOG,
 (errmsg("consistent recovery state reached at %X/%X",
 		(uint32) (lastReplayedEndRecPtr >> 32),
-		(uint32) lastReplayedEndRecPtr)));
+		(uint32) lastReplayedEndRecPtr),
+ errdetail("Last completed checkpoint was at log time %s.",
+		   str_time(ControlFile->checkPointCopy.time;
 	}
 
 	/*
-- 
2.20.1

From 9b7b3effad8c860a003dc60eb1ea53aa15a26ad9 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Wed, 11 Nov 202

Detecting File Damage & Inconsistencies

2020-11-11 Thread Simon Riggs
I would like to propose a few points that will help us detect file
damage, inconsistencies in files and track actions of users.

Optionally, we could add these fields onto the WAL commit record:
* 8-byte session start time (from MyStartTime)
* 2-byte pid (from MyProcPid)
* 4-byte user oid
Identified by a name flag bit, XACT_XINFO_HAS_USERINFO.
That allows us to match up transactions with the server log %c option.
Another option might allow text username to be added to each commit as well.

XLogLongPageHeaderData has 4 bytes of unused data because of
alignment. We could use that space for 1) the Xid Epoch value or 2) a
CRC value - since only WAL records are covered by a CRC, not page or
file headers. Perhaps we should add both?

There are also 2 bytes unused in the XLogRecord header (with 8 byte
alignment). We could optionally use that space for the pid that wrote
the record, but that's not compelling. What can we use those 2 bytes
for?

REINDEX VERIFY
After the new index is created, but before we drop the old index:
Check whether the two indexes match:
* checks whether the previous index had pointers to row versions that
don't exist
* checks whether the heap has rows that were not in the old index
This approach piggybacks on existing operations. AccessShareLock is
held on both indexes before the old one is dropped.

Other ideas are welcome.

Thoughts?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: cutting down the TODO list thread

2020-11-11 Thread John Naylor
Here is the next section on data types, proposed to be moved to the "not
worth doing" page. As before, if there are any objections, do speak up.
I'll make the move in a few days.

**Datatypes

- Fix data types where equality comparison is not intuitive, e.g. box
  There is likely no way to do this without breaking applications. We
already have a big warning about this in the docs.

- Add IMMUTABLE column attribute
  Controversial

*Domains (this entire section would go)

- Allow functions defined as casts to domains to be called during casting
- Allow values to be cast to domain types
- Make domains work better with polymorphic functions
  Old and difficult

*Date/Time

- Allow TIMESTAMP WITH TIME ZONE to store the original timezone information
- Have timestamp subtraction not call justify_hours()
  Very old

- Allow a comma to denote fractional seconds in ISO-8601-compliant times
(and timestamps)
  Apparent lack of interest

*Arrays

- Add function to detect if an array is empty
- Improve handling of NULLs in arrays
  Lack of interest

*Money (this entire section would go)

- Add locale-aware MONEY type, and support multiple currencies
- MONEY dumps in a locale-specific format making it difficult to restore to
a system with a different locale
  The money datatype seems kind of obsolete anyway, and there doesn't seem
to be demand to improve it.

*Text Search

- Allow dictionaries to change the token that is passed on to later
dictionaries
- Consider a function-based API for '@@' searches
  Very old

- Improve text search error messages
  One of the gripes has been fixed already, in any case it's very old

- tsearch and tsdicts regression tests fail in Turkish locale on glibc
  Bug report that refers to locale behavior from 2009

- Improve handling of dash and plus signs in email address user names, and
perhaps improve URL parsing
  Difficult

*XML (difficult section -- plenty of bugs which should be fixed, but also
old and low interest)

- Allow XML arrays to be cast to other data types
  Very old

- Allow reliable XML operation non-UTF8 server encodings
  Difficult

- Move XSLT from contrib/xml2 to a more reasonable location
  Lack of consensus

- Improve the XSLT parameter passing API
  Lack of consensus

- xpath_table needs to be implemented/implementable to get rid of
contrib/xml2
- xpath_table is pretty broken anyway
  Unclear path forward

- better handling of XPath data types
- Improve handling of PIs and DTDs in xmlconcat()
  Zero interest

- Restructure XML and /contrib/xml2 functionality
  As discussed in the thread, it's an unrealistically large project


-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Allow matching whole DN from a client certificate

2020-11-11 Thread Andrew Dunstan

Currently we only match the Common Name (CN) of a client certificate
when authenticating a user. The attached patch allows matching the
entire Distinguished Name (DN) of the certificate. This is enabled by
the HBA line option "clientname", which can take the values "CN" or
"DN". "CN" is the default.

The idea is that you might have a role with a CN of, say, "dbauser" in
two different parts of the organization, say one with "OU=marketing" and
the other with "OU=engineering", and you only want to allow access to
one of them.

This feature is best used in conjunction with a map. e.g. in testing I
have this pg_hba.conf line:

hostssl all all 127.0.0.1/32 cert clientname=DN map=dn

and this pg_ident.conf line:

dn /^C=US,ST=North.Carolina,O=test,OU=eng,CN=andrew$ andrew

If people like this idea I'll add tests and docco and add it to the next CF.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
"

diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index d132c5cb48..7e40de82e0 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -2869,12 +2869,15 @@ static int
 CheckCertAuth(Port *port)
 {
 	int			status_check_usermap = STATUS_ERROR;
+	char	   *peer_username;
 
 	Assert(port->ssl);
 
 	/* Make sure we have received a username in the certificate */
-	if (port->peer_cn == NULL ||
-		strlen(port->peer_cn) <= 0)
+	peer_username = port->hba->clientcertname == clientCertCN ? port->peer_cn : port->peer_dn;
+
+	if (peer_username == NULL ||
+		strlen(peer_username) <= 0)
 	{
 		ereport(LOG,
 (errmsg("certificate authentication failed for user \"%s\": client certificate contains no user name",
@@ -2882,8 +2885,8 @@ CheckCertAuth(Port *port)
 		return STATUS_ERROR;
 	}
 
-	/* Just pass the certificate cn to the usermap check */
-	status_check_usermap = check_usermap(port->hba->usermap, port->user_name, port->peer_cn, false);
+	/* Just pass the certificate cn/dn to the usermap check */
+	status_check_usermap = check_usermap(port->hba->usermap, port->user_name, peer_username, false);
 	if (status_check_usermap != STATUS_OK)
 	{
 		/*
@@ -2894,7 +2897,7 @@ CheckCertAuth(Port *port)
 		if (port->hba->clientcert == clientCertFull && port->hba->auth_method != uaCert)
 		{
 			ereport(LOG,
-	(errmsg("certificate validation (clientcert=verify-full) failed for user \"%s\": CN mismatch",
+	(errmsg("certificate validation (clientcert=verify-full) failed for user \"%s\": CN/DN mismatch",
 			port->user_name)));
 		}
 	}
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index e10260051f..dbb32c86bb 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -520,22 +520,30 @@ aloop:
 	/* Get client certificate, if available. */
 	port->peer = SSL_get_peer_certificate(port->ssl);
 
-	/* and extract the Common Name from it. */
+	/* and extract the Common Name / Distinguished Name from it. */
 	port->peer_cn = NULL;
+	port->peer_dn = NULL;
 	port->peer_cert_valid = false;
 	if (port->peer != NULL)
 	{
 		int			len;
+		X509_NAME  *x509name = X509_get_subject_name(port->peer);
+		char	   *peer_cn;
+		char	   *peer_dn;
+		BIO		   *bio = NULL;
+		BUF_MEM*bio_buf = NULL;
 
-		len = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer),
-		NID_commonName, NULL, 0);
-		if (len != -1)
+		if (!x509name)
 		{
-			char	   *peer_cn;
+			return -1;
+		}
 
+		len = X509_NAME_get_text_by_NID(x509name, NID_commonName, NULL, 0);
+		if (len != -1)
+		{
 			peer_cn = MemoryContextAlloc(TopMemoryContext, len + 1);
-			r = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer),
-		  NID_commonName, peer_cn, len + 1);
+			r = X509_NAME_get_text_by_NID(x509name, NID_commonName, peer_cn,
+		  len + 1);
 			peer_cn[len] = '\0';
 			if (r != len)
 			{
@@ -559,6 +567,32 @@ aloop:
 
 			port->peer_cn = peer_cn;
 		}
+
+		bio = BIO_new(BIO_s_mem());
+		if (!bio)
+		{
+			return -1;
+		}
+		/* use commas instead of slashes */
+		X509_NAME_print_ex(bio, x509name, 0, XN_FLAG_SEP_COMMA_PLUS);
+		BIO_get_mem_ptr(bio, &bio_buf);
+		peer_dn = MemoryContextAlloc(TopMemoryContext, bio_buf->length + 1);
+		memcpy(peer_dn, bio_buf->data, bio_buf->length);
+		peer_dn[bio_buf->length] = '\0';
+		if (bio_buf->length != strlen(peer_dn))
+		{
+			ereport(COMMERROR,
+	(errcode(ERRCODE_PROTOCOL_VIOLATION),
+	 errmsg("SSL certificate's distinguished name contains embedded null")));
+			BIO_free(bio);
+			pfree(peer_dn);
+			return -1;
+		}
+
+		BIO_free(bio);
+
+		port->peer_dn = peer_dn;
+
 		port->peer_cert_valid = true;
 	}
 
@@ -590,6 +624,12 @@ be_tls_close(Port *port)
 		pfree(port->peer_cn);
 		port->peer_cn = NULL;
 	}
+
+	if (port->peer_dn)
+	{
+		pfree(port->peer_dn);
+		port->peer_dn = NULL;
+	}
 }
 
 ssize_t
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index 2ae507a902..ef1320e686 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/

Re: Add docs stub for recovery.conf

2020-11-11 Thread Daniel Gustafsson
> On 11 Nov 2020, at 21:01, Bruce Momjian  wrote:
> 
> On Wed, Nov 11, 2020 at 08:59:40PM +0100, Daniel Gustafsson wrote:
>>> On 11 Nov 2020, at 20:44, Bruce Momjian  wrote:
>>> On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote:
>> 
 I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the 
 docs for
 it were removed completely as well. That's largely sensible, but is 
 confusing
 when users have upgraded and are trying to find out what happened, or how 
 to
 configure equivalent functionality.
>>> 
>>> I don't see the logic in carrying doc stuff that we don't have anymore.
>> 
>> Well, we do have that already in 's sprinkled across the docs where it
>> makes sense to help transitioning users, like this one in func.sgml:
>> 
>>"Prior to PostgreSQL 12, it was possible to skip arbitrary text in the
>>input string using non-letter or non-digit characters..."
>> 
>> It doesn't seem like a terrible idea to do a similar one for recovery.conf.
> 
> I am fine with a tip.  The patch looked like it was creating a new
> chapter for it.

I admittedly hadn't looked at the patch, but now that I have I agree with not
adding a separate "obsolete" topic for it.  I'd prefer to use tips within the
docs, they will also help guide users who search for recovery.conf and lands in
the tip which is next to the relevant updated documentation on the topic.

cheers ./daniel



Re: Add docs stub for recovery.conf

2020-11-11 Thread Bruce Momjian
On Wed, Nov 11, 2020 at 08:59:40PM +0100, Daniel Gustafsson wrote:
> > On 11 Nov 2020, at 20:44, Bruce Momjian  wrote:
> > On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote:
> 
> >> I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the 
> >> docs for
> >> it were removed completely as well. That's largely sensible, but is 
> >> confusing
> >> when users have upgraded and are trying to find out what happened, or how 
> >> to
> >> configure equivalent functionality.
> > 
> > I don't see the logic in carrying doc stuff that we don't have anymore.
> 
> Well, we do have that already in 's sprinkled across the docs where it
> makes sense to help transitioning users, like this one in func.sgml:
> 
> "Prior to PostgreSQL 12, it was possible to skip arbitrary text in the
> input string using non-letter or non-digit characters..."
> 
> It doesn't seem like a terrible idea to do a similar one for recovery.conf.

I am fine with a tip.  The patch looked like it was creating a new
chapter for it.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Add docs stub for recovery.conf

2020-11-11 Thread Daniel Gustafsson
> On 11 Nov 2020, at 20:44, Bruce Momjian  wrote:
> On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote:

>> I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the docs 
>> for
>> it were removed completely as well. That's largely sensible, but is confusing
>> when users have upgraded and are trying to find out what happened, or how to
>> configure equivalent functionality.
> 
> I don't see the logic in carrying doc stuff that we don't have anymore.

Well, we do have that already in 's sprinkled across the docs where it
makes sense to help transitioning users, like this one in func.sgml:

"Prior to PostgreSQL 12, it was possible to skip arbitrary text in the
input string using non-letter or non-digit characters..."

It doesn't seem like a terrible idea to do a similar one for recovery.conf.

cheers ./daniel



Re: Add docs stub for recovery.conf

2020-11-11 Thread David G. Johnston
On Wed, Nov 11, 2020 at 12:44 PM Bruce Momjian  wrote:

> On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote:
> > Hi all
> >
> > I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the
> docs for
> > it were removed completely as well. That's largely sensible, but is
> confusing
> > when users have upgraded and are trying to find out what happened, or
> how to
> > configure equivalent functionality.
>
> I don't see the logic in carrying doc stuff that we don't have anymore.
>
>
I do.  Saying why something went away has value.  For small stuff you have
commit messages.  For user-facing documentation stuff that warranted its
own page, having said page remain and describe the change seems worthwhile.

David J.


Re: Parallel bitmap index scan

2020-11-11 Thread Tomas Vondra
Hi,

I took a look at this today, doing a bit of stress-testing, and I can
get it to crash because of segfaults in pagetable_create (not sure if
the issue is there, it might be just a symptom of an issue elsewhere).

Attached is a shell script I use to run the stress test - it's using
'test' database, generates tables of different size and then runs
queries with various parameter combinations. It takes a while to trigger
the crash, so it might depend on timing or something like that.

I've also attached two examples of backtraces. I've also seen infinite
loop in pagetable_create, but the crashes are much more common.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Core was generated by `postgres: postgres test [local] SELECT   
 '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  MemoryContextAllocZero (context=context@entry=0x561c0598cf40, 
size=size@entry=48) at mcxt.c:852
852 ret = context->methods->alloc(context, size);
(gdb) bt
#0  MemoryContextAllocZero (context=context@entry=0x561c0598cf40, 
size=size@entry=48) at mcxt.c:852
#1  0x561c0467b95c in pagetable_create (nelements=128, 
private_data=0x7f155f7ef000, ctx=0x561c0598cf40) at 
../../../src/include/lib/simplehash.h:457
#2  tbm_create_pagetable (tbm=tbm@entry=0x7f155f7ef000) at tidbitmap.c:296
#3  0x561c0467bae3 in tbm_get_pageentry (tbm=tbm@entry=0x7f155f7ef000, 
pageno=3779) at tidbitmap.c:1303
#4  0x561c0467bed5 in tbm_union_page (a=a@entry=0x7f155f7ef000, 
bpage=0x7f155f7e21b8) at tidbitmap.c:514
#5  0x561c0467c5a0 in tbm_union (b=0x561c059cd468, a=0x7f155f7ef000) at 
tidbitmap.c:474
#6  tbm_union (a=0x7f155f7ef000, b=0x561c059cd468) at tidbitmap.c:457
#7  0x561c0467cb77 in tbm_merge (tbm=0x561c059cd468, dp_tbm=, dp_pagetable=0x7f155f8d4418) at tidbitmap.c:822
#8  0x561c0461c80f in BitmapHeapNext (node=node@entry=0x561c05996400) at 
nodeBitmapHeapscan.c:228
#9  0x561c0460f611 in ExecScanFetch (recheckMtd=0x561c0461bf10 
, accessMtd=0x561c0461bfa0 , 
node=0x561c05996400)
at execScan.c:133
#10 ExecScan (node=0x561c05996400, accessMtd=0x561c0461bfa0 , 
recheckMtd=0x561c0461bf10 ) at execScan.c:182
#11 0x561c04615d31 in ExecProcNode (node=0x561c05996400) at 
../../../src/include/executor/executor.h:247
#12 fetch_input_tuple (aggstate=aggstate@entry=0x561c05995d98) at nodeAgg.c:589
#13 0x561c046188c8 in agg_retrieve_direct (aggstate=) at 
nodeAgg.c:2356
#14 ExecAgg (pstate=) at nodeAgg.c:2171
#15 0x561c0461f487 in ExecProcNode (node=0x561c05995d98) at 
../../../src/include/executor/executor.h:247
#16 gather_getnext (gatherstate=0x561c05995bf8) at nodeGather.c:295
#17 ExecGather (pstate=0x561c05995bf8) at nodeGather.c:227
#18 0x561c04615d31 in ExecProcNode (node=0x561c05995bf8) at 
../../../src/include/executor/executor.h:247
#19 fetch_input_tuple (aggstate=aggstate@entry=0x561c059955d0) at nodeAgg.c:589
#20 0x561c046188c8 in agg_retrieve_direct (aggstate=) at 
nodeAgg.c:2356
#21 ExecAgg (pstate=) at nodeAgg.c:2171
#22 0x561c04606b4b in ExecProcNode (node=0x561c059955d0) at 
../../../src/include/executor/executor.h:247
#23 ExecutePlan (execute_once=, dest=0x561c059c6550, 
direction=, numberTuples=0, sendTuples=, 
operation=CMD_SELECT, use_parallel_mode=, 
planstate=0x561c059955d0, estate=0x561c05995378) at execMain.c:1542
#24 standard_ExecutorRun (queryDesc=0x561c058fdac8, direction=, 
count=0, execute_once=) at execMain.c:364
#25 0x561c0475f39c in PortalRunSelect (portal=0x561c0593f758, 
forward=, count=0, dest=) at pquery.c:912
#26 0x561c04760546 in PortalRun (portal=portal@entry=0x561c0593f758, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, 
run_once=run_once@entry=true, dest=dest@entry=0x561c059c6550, 
altdest=altdest@entry=0x561c059c6550, qc=0x7ffd52084c80) at pquery.c:756
#27 0x561c0475c33c in exec_simple_query (query_string=0x561c058dbce8 
"select count(b)from t where a between 3387 and 4027;") at postgres.c:1239
#28 0x561c0475dee0 in PostgresMain (argc=argc@entry=1, 
argv=argv@entry=0x7ffd520850e0, dbname=, username=)
at postgres.c:4305
#29 0x561c046e709c in BackendRun (port=, port=) at postmaster.c:4488
#30 BackendStartup (port=) at postmaster.c:4210
#31 ServerLoop () at postmaster.c:1727
#32 0x561c046e7f3f in PostmasterMain (argc=, 
argv=0x561c058d6550) at postmaster.c:1400
#33 0x561c0448d970 in main (argc=3, argv=0x561c058d6550) at main.c:209


Core was generated by `postgres: postgres test [local] EXPLAIN  
 '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  MemoryContextAllocZero (context=context@entry=0x561c0598ed90, 
size=size@entry=48) at mcxt.c:852
852 ret = context->methods->alloc(context, size);
(gdb) bt
#0  MemoryContextAllocZero (context=context@entry=0x561c0598ed90, 
size=size@entry=48) at mcxt.c:852
#1  0x561c0467b95c in 

Re: Split copy.c

2020-11-11 Thread Soumyadeep Chakraborty
Hey Heikki,

On Tue, Nov 3, 2020 at 1:30 AM Heikki Linnakangas  wrote:

Thanks for working on this refactor. LGTM! I had a few minor comments:

1. We should rename the CopyFormatOptions *cstate param in
ProcessCopyOptions() to CopyFormatOptions *options and List *options to
List *raw_options IMO, to make it more readable.

2. We need to update the header comment for Copy{From,To}StateData. It is
currently the old comment from CopyStateData.

3. Can we add docs for the missing fields in the header comment for
BeginCopyFrom()?

4.

> /*
>  * Working state for COPY TO/FROM
>  */
> MemoryContext copycontext; /* per-copy execution context */

Comment needs to be updated for the COPY operation.

5.

> I also split/duplicated the CopyStateData struct into CopyFromStateData
> and CopyToStateData. Many of the fields were common, but many were not,
> and I think some duplication is nicer than a struct where you use some
> fields and others are unused. I put the common formatting options into a
> new CopyFormatOptions struct.

Would we be better off if we sub-struct CopyState <- Copy{From,To}State?
Like this:
typedef struct Copy{From|To}StateData
{
CopyState cs;
// Fields specific to COPY FROM/TO follow..
}

6.

> /* create workspace for CopyReadAttributes results */
> if (!cstate->opts.binary)

Can we replace this if with an else?

Regards,
Soumyadeep (VMware)




Re: Add docs stub for recovery.conf

2020-11-11 Thread Bruce Momjian
On Tue, Nov 10, 2020 at 01:38:14PM +0800, Craig Ringer wrote:
> Hi all
> 
> I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the docs 
> for
> it were removed completely as well. That's largely sensible, but is confusing
> when users have upgraded and are trying to find out what happened, or how to
> configure equivalent functionality.

I don't see the logic in carrying doc stuff that we don't have anymore.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Allow some recovery parameters to be changed with reload

2020-11-11 Thread Sergei Kornilov
Hello

> Anyway, for now I think that your first patch would be enough, i.e.,
> just change the context of restore_command to PGC_SIGHUP.

Glad to hear. Attached a rebased version of the original proposal.

regards, Sergeidiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e31..ec74cb43ad 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3542,7 +3542,8 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows

 

-This parameter can only be set at server start.
+This parameter can only be set in the postgresql.conf
+file or on the server command line.

   
  
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index bb34630e8e..a732279f52 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3699,7 +3699,7 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
-		{"restore_command", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+		{"restore_command", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
 			gettext_noop("Sets the shell command that will be called to retrieve an archived WAL file."),
 			NULL
 		},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..9c9091e601 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -253,7 +253,6 @@
 # placeholders: %p = path of file to restore
 #   %f = file name only
 # e.g. 'cp /mnt/server/archivedir/%f %p'
-# (change requires restart)
 #archive_cleanup_command = ''	# command to execute at every restartpoint
 #recovery_end_command = ''	# command to execute at completion of recovery
 


Re: Feature request: Improve allowed values for generate series

2020-11-11 Thread David G. Johnston
On Wed, Nov 11, 2020 at 12:12 PM Eugen Konkov  wrote:

>
> > So   I   feature  request  to  allow  zero size step for cases when
> start point is equest to finish
>
> > What do you think?
>
>
>
> hm  probably  with  step 0 we always should generate series of one
> value and exit, despite on finish value.
> Because  with  step  0 we always stay at current position, so there is
> always should be just one value.
>
>
How is this better than writing "VALUES (start date)"?

David J.


Re: Add session statistics to pg_stat_database

2020-11-11 Thread Laurenz Albe
On Tue, 2020-11-10 at 15:03 +, Georgios Kokolatos wrote:
> I noticed that the cfbot fails for this patch.
> 
> For this, I am setting the status to: 'Waiting on Author'.

Thanks for noticing, it was only the documentation build.

Version 5 attached, status changed back to "waiting for review".

Yours,
Laurenz Albe
From afc37856c12fd0a85587c638fca291a0b5652d9b Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Wed, 11 Nov 2020 20:14:28 +0100
Subject: [PATCH] Add session statistics to pg_stat_database

If "track_counts" is active, track the following per database:
- total number of connections
- number of sessions that ended other than with a client disconnect
- total time spent in database sessions
- total time spent executing queries
- total idle in transaction time

This is useful to check if connection pooling is working.
It also helps to estimate the size of the connection pool
required to keep the database busy, which depends on the
percentage of the transaction time that is spent idling.

Discussion: https://postgr.es/m/b07e1f9953701b90c66ed368656f2aef40cac4fb.ca...@cybertec.at
Reviewed-By: Soumyadeep Chakraborty, Justin Pryzby, Masahiro Ikeda

(This requires a catversion bump, as well as an update to
PGSTAT_FILE_FORMAT_ID)
---
 doc/src/sgml/monitoring.sgml |  49 +
 src/backend/catalog/system_views.sql |   5 ++
 src/backend/postmaster/pgstat.c  | 105 ++-
 src/backend/tcop/postgres.c  |   5 ++
 src/backend/utils/adt/pgstatfuncs.c  |  68 +
 src/include/catalog/pg_proc.dat  |  20 +
 src/include/pgstat.h |  27 +++
 src/test/regress/expected/rules.out  |   5 ++
 8 files changed, 283 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 98e1995453..89610d1010 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3704,6 +3704,55 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   session_time double precision
+  
+  
+   Time spent by database sessions in this database, in milliseconds
+   (note that statistics are only updated when the state of a session
+   changes, so if sessions have been idle for a long time, this idle time
+   won't be included)
+  
+ 
+
+ 
+  
+   active_time double precision
+  
+  
+   Time spent executing SQL statements in this database, in milliseconds
+  
+ 
+
+ 
+  
+   idle_in_transaction_time double precision
+  
+  
+   Time spent idling while in a transaction in this database, in milliseconds
+  
+ 
+
+ 
+  
+   connections bigint
+  
+  
+   Total number of connections established to this database
+  
+ 
+
+ 
+  
+   aborted_sessions bigint
+  
+  
+   Number of database sessions to this database that were terminated
+   by something else than a regular client disconnection
+  
+ 
+
  
   
stats_reset timestamp with time zone
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2e4aa1c4b6..998b4d542a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -924,6 +924,11 @@ CREATE VIEW pg_stat_database AS
 pg_stat_get_db_checksum_last_failure(D.oid) AS checksum_last_failure,
 pg_stat_get_db_blk_read_time(D.oid) AS blk_read_time,
 pg_stat_get_db_blk_write_time(D.oid) AS blk_write_time,
+pg_stat_get_db_session_time(D.oid) AS session_time,
+pg_stat_get_db_active_time(D.oid) AS active_time,
+pg_stat_get_db_idle_in_transaction_time(D.oid) AS idle_in_transaction_time,
+pg_stat_get_db_connections(D.oid) AS connections,
+pg_stat_get_db_aborted_sessions(D.oid) AS aborted_sessions,
 pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset
 FROM (
 SELECT 0 AS oid, NULL::name AS datname
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e76e627c6b..9978aab60a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -249,6 +249,9 @@ static int	pgStatXactCommit = 0;
 static int	pgStatXactRollback = 0;
 PgStat_Counter pgStatBlockReadTime = 0;
 PgStat_Counter pgStatBlockWriteTime = 0;
+static PgStat_Counter pgStatActiveTime = 0;
+static PgStat_Counter pgStatTransactionIdleTime = 0;
+bool pgStatSessionDisconnected = false;
 
 /* Record that's written to 2PC state file when pgstat state is persisted */
 typedef struct TwoPhasePgStatRecord
@@ -334,6 +337,7 @@ static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
 static void pgstat_send_funcstats(void);
 static void pgstat_send_slru(void);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
+static void pgstat_send_connstats(bool dis

Re: Proposition for autoname columns

2020-11-11 Thread Eugen Konkov
Hello Bruce,

Wednesday, November 11, 2020, 5:56:08 PM, you wrote:

> On Mon, Nov  2, 2020 at 05:05:29PM +0200, Eugen Konkov wrote:
>> Hello Pgsql-hackers,
>> 
>> When selecting data from json column it named as '?column?'
>> tucha=# select info->>'suma', docn from document order by id desc limit 5;
>>  ?column? | docn 
>> --+--
>>  665.97   | 695
>>  513.51   | 632
>>  665.97   | 4804
>>  492.12   | 4315
>>  332.98   | 1302
>> (5 rows)
>> 
>> It would be useful if the name of column will be autoassigned based on
>> name of json key. Like at next query:
>> 
>> tucha=# select info->>'suma' as suma, docn from document order by id desc 
>> limit 5;
>>   suma  | docn 
>> +--
>>  665.97 | 695
>>  513.51 | 632
>>  665.97 | 4804
>>  492.12 | 4315
>>  332.98 | 1302
>> (5 rows)
>> 
>> 
>> Would it be useful this auto assigned name for column from json?

> I think we could do it, but it would only work if the column was output
> as a single json value, and not a multi-key/value field.  I am afraid if
> we tried to do it, the result would be too inconsistent to be useful.


cool, thank you.

-- 
Best regards,
Eugen Konkov





Re: Feature request: Improve allowed values for generate series

2020-11-11 Thread Eugen Konkov
Hello Eugen,

Wednesday, November 11, 2020, 8:50:59 PM, you wrote:

> Hello Pgsql-hackers,

> Seems I fall into corner case:

test=>> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL 
'00:00:00' );
> ERROR:  step size cannot equal zero

> But:
test=>> SELECT * FROM generate_series( '2020-11-09', '2020-11-10', INTERVAL '1 
day' );
> generate_series 
> 
>  2020-11-09 00:00:00+02
>  2020-11-10 00:00:00+02
> (2 rows)

> Here  we  start  at  2020-11-09, add interval of one day and finish at 
> 2020-11-10
> Done! series is generated.

> In  first  case  I  expect that I start at 2020-11-09, add interval of zero 
> and finish at 2020-11-09
> Everything is consistent.

test=>> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL 
'00:00:00' );
> generate_series
> 
>  2020-11-09 00:00:00+02
> (1 row)


> So   I   feature  request  to  allow  zero size step for cases when start 
> point is equest to finish

> What do you think?



hm  probably  with  step 0 we always should generate series of one
value and exit, despite on finish value.
Because  with  step  0 we always stay at current position, so there is
always should be just one value.

-- 
Best regards,
Eugen Konkov





Re: Feature request: Improve allowed values for generate series

2020-11-11 Thread David G. Johnston
On Wed, Nov 11, 2020 at 11:59 AM Eugen Konkov  wrote:

> So   I   feature  request  to  allow  zero size step for cases when  start
> point is equest to finish
>
> What do you think?
>

I don't see how this is useful.  If they are equal and you use a non-zero
step you get back the one record you are looking for anyway, plus the
non-zero step allows them to be unequal.  If zero step is allowed it is
only useful for when they are equal, being undefined when they are unequal.

David J.


Re: Support for NSS as a libpq TLS backend

2020-11-11 Thread Jacob Champion
On Nov 11, 2020, at 10:17 AM, Jacob Champion  wrote:
> 
> (Two failures left on my machine.)

False alarm -- the stderr debugging I'd added in to track down the
assertion tripped up the "no stderr" tests. Zero failing tests now.

--Jacob




Re: Feature request: Improve allowed values for generate series

2020-11-11 Thread Pavel Stehule
st 11. 11. 2020 v 19:59 odesílatel Eugen Konkov  napsal:

> Hello Pgsql-hackers,
>
> Seems I fall into corner case:
>
> test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL
> '00:00:00' );
> ERROR:  step size cannot equal zero
>
> But:
> test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-10', INTERVAL
> '1 day' );
> generate_series
> 
>  2020-11-09 00:00:00+02
>  2020-11-10 00:00:00+02
> (2 rows)
>
> Here  we  start  at  2020-11-09, add interval of one day and finish at
> 2020-11-10
> Done! series is generated.
>
> In  first  case  I  expect that I start at 2020-11-09, add interval of
> zero and finish at 2020-11-09
> Everything is consistent.
>
> test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL
> '00:00:00' );
> generate_series
> 
>  2020-11-09 00:00:00+02
> (1 row)
>
>
> So   I   feature  request  to  allow  zero size step for cases when  start
> point is equest to finish
>
> What do you think?
>

What is the real use case?  Current implementation is very simple -
increment should not be zero, and then we know so there is no infinity
cycle.

Regards

Pavel




>
> --
> Best regards,
> Eugen Konkov
>
>
>
>


Feature request: Improve allowed values for generate series

2020-11-11 Thread Eugen Konkov
Hello Pgsql-hackers,

Seems I fall into corner case:

test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL 
'00:00:00' );
ERROR:  step size cannot equal zero

But:
test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-10', INTERVAL '1 
day' );
generate_series 

 2020-11-09 00:00:00+02
 2020-11-10 00:00:00+02
(2 rows)

Here  we  start  at  2020-11-09, add interval of one day and finish at 
2020-11-10
Done! series is generated.

In  first  case  I  expect that I start at 2020-11-09, add interval of zero and 
finish at 2020-11-09
Everything is consistent.

test=> SELECT * FROM generate_series( '2020-11-09', '2020-11-09', INTERVAL 
'00:00:00' );
generate_series

 2020-11-09 00:00:00+02
(1 row)


So   I   feature  request  to  allow  zero size step for cases when  start 
point is equest to finish

What do you think?


-- 
Best regards,
Eugen Konkov





Re: Allow some recovery parameters to be changed with reload

2020-11-11 Thread Fujii Masao




On 2020/11/07 6:36, Sergei Kornilov wrote:

Hello


I'm wondering if it's safe to allow restore_command to be emptied during
archive recovery. Even when it's emptied, archive recovery can proceed
by reading WAL files from pg_wal directory. This is the same behavior as
when restore_command is set to, e.g., /bin/false.


I am always confused by this implementation detail. restore_command fails? 
Fine, let's just read file from pg_wal. But this is different topic...

I do not know the history of this fatal ereport. It looks like "must specify 
restore_command when standby mode is not enabled" check is only intended to protect 
the user from misconfiguration and the rest code will treat empty restore_command 
correctly, just like /bin/false.


Maybe.

Anyway, for now I think that your first patch would be enough, i.e.,
just change the context of restore_command to PGC_SIGHUP.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Support for NSS as a libpq TLS backend

2020-11-11 Thread Jacob Champion
On Nov 10, 2020, at 2:28 PM, Daniel Gustafsson  wrote:
> 
> Digging through the archives from when this landed in curl, the assertion
> failure was never fully identified back then but happened spuriously.  Which
> version of NSPR is this happening with?

This is NSPR 4.29, with debugging enabled. The fd that causes the
assertion is the custom layer that's added during be_tls_open_server(),
which connects a Port as the layer secret. It looks like NSPR is trying
to help surface potential memory leaks by asserting if the secret is
non-NULL at the time the stack is being closed.

In this case, it doesn't matter since the Port lifetime is managed
elsewhere, but it looks easy enough to add a custom close in the way
that cURL and the NSPR test programs [1] do. Sample patch attached,
which gets me to the end of the tests without any assertions. (Two
failures left on my machine.)

--Jacob

[1] https://hg.mozilla.org/projects/nspr/file/bf6620c143/pr/tests/nblayer.c#l354



patch
Description: patch


Re: PATCH: Attempt to make dbsize a bit more consistent

2020-11-11 Thread Soumyadeep Chakraborty
Hey Georgios,

On Tue, Nov 10, 2020 at 6:20 AM  wrote:
>
>
>
>
>
>
> ‐‐‐ Original Message ‐‐‐
> On Monday, November 9, 2020 7:50 PM, Soumyadeep Chakraborty 
>  wrote:
>
> > Hey Georgios,
> >
> > Thanks for looking for more avenues to invoke tableAM APIS! Please find
> > my review below:
>
> A great review Soumyadeep, it is much appreciated.
> Please remember to add yourself as a reviewer in the commitfest
> [https://commitfest.postgresql.org/30/2701/]

Ah yes. Sorry, I forgot that!

> >
> > On Tue, Oct 13, 2020 at 6:28 AM gkokola...@pm.me wrote:
> >
> > 1.
> >
> > > /*
> > >
> > > -   -   heap size, including FSM and VM
> > >
> > > -   -   table size, including FSM and VM
> > > */
> > >
> >
> > We should not mention FSM and VM in dbsize.c at all as these are
> > heap AM specific. We can say:
> > table size, excluding TOAST relation
>
> Yeah, I was thinking that the notion that FSM and VM are still taken
> into account should be stated. We are iterating over ForkNumber
> after all.
>
> How about phrasing it as:
>
> + table size, including all implemented forks from the AM (e.g. FSM, VM)
> + excluding TOAST relations
>
> Thoughts?

Yes, I was thinking along the same lines. The concept of a "fork" forced
should not be forced down into the tableAM. But that is a discussion for
another day. We can probably say:

+ table size, including all implemented forks from the AM (e.g. FSM, VM
+ for the heap AM) excluding TOAST relations

> >
> > 2.
> >
> > > /*
> > >
> > > -   Size of toast relation
> > > */
> > > if (OidIsValid(rel->rd_rel->reltoastrelid))
> > >
> > >
> > > -   size += calculate_toast_table_size(rel->rd_rel->reltoastrelid);
> > >
> > > -   {
> > > -   Relation toastRel;
> > > -
> > > -   toastRel = relation_open(rel->rd_rel->reltoastrelid, AccessShareLock);
> >
> > We can replace the OidIsValid check with relation_needs_toast_table()
> > and then have the OidIsValid() check in an Assert. Perhaps, that will
> > make things more readable.
>
> Please, allow me to kindly disagree.
>
> Relation is already open at this stage. Even create_toast_table(), the
> internal workhorse for creating toast relations, does check reltoastrelid
> to test if the relation is already toasted.
>
> Furthermore, we do call:
>
> + toastRel = relation_open(rel->rd_rel->reltoastrelid, AccessShareLock);
>
> and in order to avoid elog() errors underneath us, we ought to have
> verified the validity of reltoastrelid.
>
> In short, I think that the code in the proposal is not too unreadable
> nor that it breaks the coding patterns throughout the codebase.
>
> Am I too wrong?

No not at all. The code in the proposal is indeed adhering to the
codebase. What I was going for here was to increase the usage of
relation_needs_toast_table(). What I meant was:

if (table_relation_needs_toast_table(rel))
{
if (!OidIsValid(rel->rd_rel->reltoastrelid))
{
elog(ERROR, );
}
//size calculation here..
}

We want to error out if the toast table can't be found and the relation
is expected to have one, which the existing code doesn't handle.


>
> >
> > 3.
> >
> > > -   if (rel->rd_rel->relkind == RELKIND_RELATION ||
> > > -   rel->rd_rel->relkind == RELKIND_TOASTVALUE ||
> > > -   rel->rd_rel->relkind == RELKIND_MATVIEW)
> > > -   size = calculate_table_size(rel);
> > > -   else
> > > -   {
> > > -   relation_close(rel, AccessShareLock);
> > > -   PG_RETURN_NULL();
> > > -   }
> >
> > This leads to behavioral changes:
> >
> > I am talking about the case where one calls pg_table_size() on an index.
> > W/o your patch, it returns the size of the index. W/ your patch it
> > returns NULL. Judging by the documentation, this function should not
> > ideally apply to indexes but it does! I have a sinking feeling that lots
> > of users use it for this purpose, as there is no function to calculate
> > the size of a single specific index (except pg_relation_size()).
> > The same argument I have made above applies to sequences. Users may have
> > trial-and-errored their way into finding that pg_table_size() can tell
> > them the size of a specific index/sequence! I don't know how widespread
> > the use is in the user community, so IMO maybe we should be conservative
> > and not introduce this change? Alternatively, we could call out that
> > pg_table_size() is only for tables by throwing an error if anything
> > other than a table is passed in.
> >
> > If we decide to preserve the existing behavior of the pg_table_size():
> > It seems that for things not backed by the tableAM (indexes and
> > sequences), they should still go through calculate_relation_size().
> > We can call table_relation_size() based on if relkind is
> > RELKIND_RELATION, RELKIND_TOASTVALUE or RELKIND_MATVIEW. Perhaps it
> > might be worthwhile making a new macro RELKIND_HAS_TABLE_STORAGE to
> > capture these three cases (See RELKIND_HAS_STORAGE). This also ensures
> > that we return 0 for things that don't qualify as RELKIND_HAS_STORAGE,
> > such as a partitioned table (Cu

Re: Parallel copy

2020-11-11 Thread vignesh C
On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila  wrote:
>
> On Tue, Nov 10, 2020 at 7:12 PM vignesh C  wrote:
> >
> > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  wrote:
> > >
> >
> > I have worked to provide a patch for the parallel safety checks. It
> > checks if parallely copy can be performed, Parallel copy cannot be
> > performed for the following a) If relation is temporary table b) If
> > relation is foreign table c) If relation has non parallel safe index
> > expressions d) If relation has triggers present whose type is of non
> > before statement trigger type e) If relation has check constraint
> > which are not parallel safe f) If relation has partition and any
> > partition has the above type. This patch has the checks for it. This
> > patch will be used by parallel copy implementation.
> >
>
> How did you ensure that this is sufficient? For parallel-insert's
> patch we have enabled parallel-mode for Inserts and ran the tests with
> force_parallel_mode to see if we are not missing anything. Also, it
> seems there are many common things here w.r.t parallel-insert patch,
> is it possible to prepare this atop that patch or do you have any
> reason to keep this separate?
>

I have done similar testing for copy too, I had set force_parallel
mode to regress, hardcoded in the code to pick parallel workers for
copy operation and ran make installcheck-world to verify. Many checks
in this patch are common between both patches, but I was not sure how
to handle it as both the projects are in-progress and are being
updated based on the reviewer's opinion. How to handle this?
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Implementing Incremental View Maintenance

2020-11-11 Thread legrand legrand
Hello Konstantin,
I remember testing it with pg_stat_statements (and planning counters
enabled). Maybe identifying internal queries associated with this (simple)
test case, could help dev team ?
Regards
PAscal



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-11 Thread Bruce Momjian
On Wed, Nov 11, 2020 at 05:39:10PM +0100, Magnus Hagander wrote:
> On Wed, Nov 11, 2020 at 5:38 PM Bruce Momjian  wrote:
> >
> > On Wed, Nov 11, 2020 at 11:21:22AM -0500, Bruce Momjian wrote:
> > > On Wed, Nov 11, 2020 at 05:11:38PM +0100, Magnus Hagander wrote:
> > > > > In summary, I think the vacuumdb --analyze is now a one-line command,
> > > > > but the delete part can be complex and not easily typed.
> > > >
> > > > I definitely agree to that part -- my argument is that it's more
> > > > complex (or rather, more platform-specific) than can easily be put in
> > > > the script either. If it were to be replaced it wouldn't be withy "a
> > > > command",it would be with instructions basically saying "delete the
> > > > old cluster". Platform specific wrappers (debian, redhat, windows or
> > > > whatever) knows more and could do a more complete job, but trying to
> > > > make all that generic is very complicated.
> > >
> > > Agreed, but I don't see OS-specific instruction on how to delete the
> > > tablespaces as reasonable --- that should be done by pg_upgrade, and we
> > > have had almost no complaints about that.
> >
> > Stepping back, I have always felt there was need for a wrapper program
> > to setup pg_upgrade, and clean up the old cluster on finish, though no
> > one has created such a tool yet.
> 
> Sure they have, it's just platform specific.
> 
> pg_upgradecluster on debian is a great example :)

Yes, true.  Ideally it would be nice to have a cross-platform wrapper
around pg_upgrade.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Proposition for autoname columns

2020-11-11 Thread David G. Johnston
On Wed, Nov 11, 2020 at 8:56 AM Bruce Momjian  wrote:

> > It would be useful if the name of column will be autoassigned based on
> > name of json key. Like at next query:
> >
> > tucha=# select info->>'suma' as suma, docn from document order by id
> desc limit 5;
> >   suma  | docn
> > +--
>
> > Would it be useful this auto assigned name for column from json?
>
> I think we could do it, but it would only work if the column was output
> as a single json value, and not a multi-key/value field.  I am afraid if
> we tried to do it, the result would be too inconsistent to be useful.
>

Doing it seems problematic given the nature of SQL and existing means to
assign names to columns.  If it can be done I don't see how the output
value would make any difference.  What is being asked for is the simple
textual value on the right side of the ->> (and other similar) operators to
be converted into a column name.  I could image doing this at rewrite time
by saying (in parse terms):

info->>'suma to' becomes info->>'suma' AS "suma to" (specifically, add AS,
double-quote the literal and stick it after the AS).

If {AS "suma to"} isn't valid syntax for some value of "suma to" just drop
the attempt and move on.

I agree that this feature would be useful.

David J.


Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-11 Thread Magnus Hagander
On Wed, Nov 11, 2020 at 5:38 PM Bruce Momjian  wrote:
>
> On Wed, Nov 11, 2020 at 11:21:22AM -0500, Bruce Momjian wrote:
> > On Wed, Nov 11, 2020 at 05:11:38PM +0100, Magnus Hagander wrote:
> > > > In summary, I think the vacuumdb --analyze is now a one-line command,
> > > > but the delete part can be complex and not easily typed.
> > >
> > > I definitely agree to that part -- my argument is that it's more
> > > complex (or rather, more platform-specific) than can easily be put in
> > > the script either. If it were to be replaced it wouldn't be withy "a
> > > command",it would be with instructions basically saying "delete the
> > > old cluster". Platform specific wrappers (debian, redhat, windows or
> > > whatever) knows more and could do a more complete job, but trying to
> > > make all that generic is very complicated.
> >
> > Agreed, but I don't see OS-specific instruction on how to delete the
> > tablespaces as reasonable --- that should be done by pg_upgrade, and we
> > have had almost no complaints about that.
>
> Stepping back, I have always felt there was need for a wrapper program
> to setup pg_upgrade, and clean up the old cluster on finish, though no
> one has created such a tool yet.

Sure they have, it's just platform specific.

pg_upgradecluster on debian is a great example :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-11 Thread Bruce Momjian
On Wed, Nov 11, 2020 at 11:21:22AM -0500, Bruce Momjian wrote:
> On Wed, Nov 11, 2020 at 05:11:38PM +0100, Magnus Hagander wrote:
> > > In summary, I think the vacuumdb --analyze is now a one-line command,
> > > but the delete part can be complex and not easily typed.
> > 
> > I definitely agree to that part -- my argument is that it's more
> > complex (or rather, more platform-specific) than can easily be put in
> > the script either. If it were to be replaced it wouldn't be withy "a
> > command",it would be with instructions basically saying "delete the
> > old cluster". Platform specific wrappers (debian, redhat, windows or
> > whatever) knows more and could do a more complete job, but trying to
> > make all that generic is very complicated.
> 
> Agreed, but I don't see OS-specific instruction on how to delete the
> tablespaces as reasonable --- that should be done by pg_upgrade, and we
> have had almost no complaints about that.

Stepping back, I have always felt there was need for a wrapper program
to setup pg_upgrade, and clean up the old cluster on finish, though no
one has created such a tool yet.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Supporting = operator in gin/gist_trgm_ops

2020-11-11 Thread Georgios Kokolatos
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hi,

this patch implements a useful and missing feature. Thank you.

It includes documentation, which to a non-native speaker as myself seems 
appropriate.
It includes comprehensive tests that cover the implemented cases.

In the thread Alexander has pointed out, quote:
"It would be more efficient to generate trigrams for equal operator
using generate_trgm() instead of generate_wildcard_trgm()"

I will echo the sentiment, though from a slightly different and possibly not
as important point of view. The method used to extract trigrams from the query
should match the method used to extract trigrams from the values when they
get added to the index. This is gin_extract_value_trgm() and is indeed using
generate_trgm().

I have no opinion over Alexander's second comment regarding costing.

I change the status to 'Waiting on Author', but please feel free to override
my opinion if you feel I am wrong and reset it to 'Needs review'.

Cheers,
//Georgios

The new status of this patch is: Waiting on Author


Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-11 Thread Bruce Momjian
On Wed, Nov 11, 2020 at 05:11:38PM +0100, Magnus Hagander wrote:
> > Uh, pg_upgrade does enumerate things like tablespaces in
> > create_script_for_old_cluster_deletion().  I think config file locations
> > are beyond the scope of what we want pg_upgrade to handle.
> 
> Ah, that's right. Forgot that it did actually handle tablespaces.
> 
> But how would config file locations be beyond the scope? WIthout that,
> it's incomplete for any user using that...

I would be worried the config file would somehow be shared between old
and new clusters.  Potentially you could use the same pg_hba.conf for
old and new, and then deleting one would stop both.  The cluster and
tablespace directories are clearly isolated to a single cluster.  I am
not sure that is 100% true of config files, or if deletion permission
would be possible for those.  It just seems too error-prone to attempt.

> And for the vast majority of users, it would for example also have to
> include the un-defining of a system service (systemd, sysvinit, or
> whatever) as well, which would be even more out of scope by that
> argument, yet at least as important in actually deleting the old
> cluster...

Well, the question is what can the user reaonsably do.  Finding all the
tablespace subdirecties is a pain, but managing systemctl seems like
something they would already have had to deal with.  I am happy for
pg_upgrade to handle the mostly-internal parts and leave the user-facing
stuff to the user, which is what the pg_upgrade usage instructions do as
well.

> > In summary, I think the vacuumdb --analyze is now a one-line command,
> > but the delete part can be complex and not easily typed.
> 
> I definitely agree to that part -- my argument is that it's more
> complex (or rather, more platform-specific) than can easily be put in
> the script either. If it were to be replaced it wouldn't be withy "a
> command",it would be with instructions basically saying "delete the
> old cluster". Platform specific wrappers (debian, redhat, windows or
> whatever) knows more and could do a more complete job, but trying to
> make all that generic is very complicated.

Agreed, but I don't see OS-specific instruction on how to delete the
tablespaces as reasonable --- that should be done by pg_upgrade, and we
have had almost no complaints about that.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-11 Thread Magnus Hagander
On Wed, Nov 11, 2020 at 4:53 PM Bruce Momjian  wrote:
>
> On Mon, Nov  2, 2020 at 02:23:35PM +0100, Magnus Hagander wrote:
> > On Tue, Oct 27, 2020 at 12:40 PM Bruce Momjian  wrote:
> > >
> > > On Tue, Oct 27, 2020 at 12:35:56PM +0100, Peter Eisentraut wrote:
> > > > On 2020-10-27 11:53, Bruce Momjian wrote:
> > > > > On Tue, Oct 27, 2020 at 11:35:25AM +0100, Peter Eisentraut wrote:
> > > > > > On 2020-10-06 12:26, Magnus Hagander wrote:
> > > > > > > I went with the name --no-instructions to have the same name for 
> > > > > > > both
> > > > > > > initdb and pg_upgrade. The downside is that "no-instructions" also
> > > > > > > causes the scripts not to be written in pg_upgrade, which 
> > > > > > > arguably is a
> > > > > > > different thing. We could go with "--no-instructions" and
> > > > > > > "--no-scripts", but that would leave the parameters different. I 
> > > > > > > also
> > > > > > > considered "--no-next-step", but that one didn't quite have the 
> > > > > > > right
> > > > > > > ring to me. I'm happy for other suggestions on the parameter 
> > > > > > > names.
> > > > > >
> > > > > > What scripts are left after we remove the analyze script, as 
> > > > > > discussed in a
> > > > > > different thread?
> > > > >
> > > > > There is still delete_old_cluster.sh.
> > > >
> > > > Well, that one can trivially be replaced by a printed instruction, too.
> > >
> > > True. On my system is it simply:
> > >
> > > rm -rf '/u/pgsql.old/data'
> > >
> > > The question is whether the user is going to record the vacuumdb and rm
> > > instructions that display at the end of the pg_upgrade run, or do we
> > > need to write it down for them in script files.
> >
> > That assumes for example that you've had no extra tablespaces defined
> > in it. And it assumes your config files are actually in the same
> > directory etc.
> >
> > Now, pg_upgrade *could* create a script that "actually works" for most
> > things, since it connects to the system and could then enumerate
> > things like tablespaces and config file locations, and generate a
> > script that actually uses that information.
>
> Uh, pg_upgrade does enumerate things like tablespaces in
> create_script_for_old_cluster_deletion().  I think config file locations
> are beyond the scope of what we want pg_upgrade to handle.

Ah, that's right. Forgot that it did actually handle tablespaces.

But how would config file locations be beyond the scope? WIthout that,
it's incomplete for any user using that...

And for the vast majority of users, it would for example also have to
include the un-defining of a system service (systemd, sysvinit, or
whatever) as well, which would be even more out of scope by that
argument, yet at least as important in actually deleting the old
cluster...


> In summary, I think the vacuumdb --analyze is now a one-line command,
> but the delete part can be complex and not easily typed.

I definitely agree to that part -- my argument is that it's more
complex (or rather, more platform-specific) than can easily be put in
the script either. If it were to be replaced it wouldn't be withy "a
command",it would be with instructions basically saying "delete the
old cluster". Platform specific wrappers (debian, redhat, windows or
whatever) knows more and could do a more complete job, but trying to
make all that generic is very complicated.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Implementing Incremental View Maintenance

2020-11-11 Thread Konstantin Knizhnik




On 05.10.2020 12:16, Yugo NAGATA wrote:

Hi,

Attached is the rebased patch (v18) to add support for Incremental
Materialized View Maintenance (IVM). It is able to be applied to
current latest master branch.



Thank you very much for this work.
I consider incremental materialized views as "reincarnation" of OLAP 
hypercubes.

There are two approaches of making OLAP queries faster:
1. speed up query execution (using JIT, columnar store, vector 
operations and parallel execution)

2. precalculate requested data

Incremental materialize views make it possible to implement second 
approach. But how competitive it is?
I do not know current limitations of incremental materialized views, but 
I checked that basic OLAP functionality:

JOIN+GROUP_BY+AGGREGATION is supported.

The patch is not applied to the current master because makeFuncCall 
prototype is changed,

I fixed it by adding COAERCE_CALL_EXPLICIT.
Then I did the following simple test:

1. Create pgbench database with scale 100.
pgbench speed at my desktop is about 10k TPS:

pgbench -M prepared -N -c 10 -j 4 -T 30 -P 1 postgres
tps = 10194.951827 (including connections establishing)

2. Then I created incremental materialized view:

create incremental materialized view teller_sums as select 
t.tid,sum(abalance) from pgbench_accounts a join pgbench_tellers t on 
a.bid=t.bid group by t.tid;

SELECT 1000
Time: 20805.230 ms (00:20.805)

20 second is reasonable time, comparable with time of database 
initialization.


Then obviously we see advantages of precalculated aggregates:

postgres=# select * from teller_sums where tid=1;
 tid |  sum
-+
   1 | -96427
(1 row)

Time: 0.871 ms
postgres=# select t.tid,sum(abalance) from pgbench_accounts a join 
pgbench_tellers t on a.bid=t.bid group by t.tid having t.tid=1

;
 tid |  sum
-+
   1 | -96427
(1 row)

Time: 915.508 ms

Amazing. Almost 1000 times difference!

3. Run pgbench once again:

Ooops! Now TPS are much lower:

tps = 141.767347 (including connections establishing)

Speed of updates is reduced more than 70 times!
Looks like we loose parallelism because almost the same result I get 
with just one connection.


4. Finally let's create one more view (it is reasonable to expect that 
analytics will run many different queries and so need multiple views).


create incremental materialized view teller_avgs as select 
t.tid,avg(abalance) from pgbench_accounts a join pgbench_tellers t on 
a.bid=t.bid group by t.tid;


It is great that not only simple aggregates like SUM are supported, but 
also AVG.

But insertion speed now is reduced twice - 72TPS.

I tried to make some profiling but didn't see something unusual:

  16.41%  postgres  postgres    [.] ExecInterpExpr
   8.78%  postgres  postgres    [.] slot_deform_heap_tuple
   3.23%  postgres  postgres    [.] ExecMaterial
   2.71%  postgres  postgres    [.] AllocSetCheck
   2.33%  postgres  postgres    [.] AllocSetAlloc
   2.29%  postgres  postgres    [.] slot_getsomeattrs_int
   2.26%  postgres  postgres    [.] ExecNestLoop
   2.11%  postgres  postgres    [.] MemoryContextReset
   1.98%  postgres  postgres    [.] tts_minimal_store_tuple
   1.87%  postgres  postgres    [.] heap_compute_data_size
   1.78%  postgres  postgres    [.] fill_val
   1.56%  postgres  postgres    [.] tuplestore_gettuple
   1.44%  postgres  postgres    [.] sentinel_ok
   1.35%  postgres  postgres    [.] heap_fill_tuple
   1.27%  postgres  postgres    [.] tuplestore_gettupleslot
   1.17%  postgres  postgres    [.] ExecQual
   1.14%  postgres  postgres    [.] tts_minimal_clear
   1.13%  postgres  postgres    [.] CheckOpSlotCompatibility
   1.10%  postgres  postgres    [.] base_yyparse
   1.10%  postgres  postgres    [.] heapgetpage
   1.04%  postgres  postgres    [.] heap_form_minimal_tuple
   1.00%  postgres  postgres    [.] slot_getsomeattrs

So good news is that incremental materialized views really work.
And bad news is that maintenance overhead is too large which 
significantly restrict applicability of this approach.
Certainly in case of dominated read-only workload such materialized 
views can significantly improve performance.
But unfortunately my dream that them allow to combine OLAP+OLPT is not 
currently realized.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





RE: POC: postgres_fdw insert batching

2020-11-11 Thread Tim . Colles

On Wed, 11 Nov 2020, tsunakawa.ta...@fujitsu.com wrote:


This email was sent to you by someone outside of the University.
You should only click on links or attachments if you are certain that the email 
is genuine and the content is safe.

From: Tomas Vondra 

I see the patch builds the "bulk" query in execute_foreign_modify. IMO
that's something we should do earlier, when we're building the simple
query (for 1-row inserts). I'd understand if you were concerned about
overhead in case of 1-row inserts, trying to not plan the bulk query
until necessary, but I'm not sure this actually helps.

Or was the goal to build a query for every possible number of slots? I
don't think that's really useful, considering it requires deallocating
the old plan, preparing a new one, etc. IMO it should be sufficient to
have two queries - one for 1-row inserts, one for the full batch. The
last incomplete batch can be inserted using a loop of 1-row queries.

That's what my patch was doing, but I'm not insisting on that - it just
seems like a better approach to me. So feel free to argue why this is
better.


Don't be concerned, the processing is not changed for 1-row inserts: the INSERT 
query string is built in PlanForeignModify(), and the remote statement is 
prepared in execute_foreign_modify() during the first call to 
ExecForeignInsert() and it's reused for subsequent ExecForeignInsert() calls.

The re-creation of INSERT query string and its corresponding PREPARE happen 
when the number of tuples to be inserted is different from the previous call to 
ExecForeignInsert()/ExecForeignBulkInsert().  That's because we don't know how 
many tuples will be inserted during planning (PlanForeignModify) or execution 
(until the scan ends for SELECT).  For example, if we insert 10,030 rows with 
the bulk size 100, the flow is:

 PlanForeignModify():
   build the INSERT query string for 1 row
 ExecForeignBulkInsert(100):
   drop the INSERT query string and prepared statement for 1 row
   build the query string and prepare statement for 100 row INSERT
   execute it
 ExecForeignBulkInsert(100):
   reuse the prepared statement for 100 row INSERT and execute it
...
 ExecForeignBulkInsert(30):
   drop the INSERT query string and prepared statement for 100 row
   build the query string and prepare statement for 30 row INSERT
   execute it



I think it'd be good to have such GUC, even if only for testing and
development. We should probably have a way to disable the batching,
which the GUC could also do, I think. So +1 to have the GUC.


OK, I'll add it.  The name would be max_bulk_insert_tuples, because a) it might cover bulk insert for local 
relations in the future, and b) "tuple" is used in cpu_(index_)tuple_cost and parallel_tuple_cost, 
while "row" or "record" is not used in GUC (except for row_security).

The valid range would be between 1 and 1,000 (I said 10,000 previously, but I 
think it's overreaction and am a bit worried about unforseen trouble too many 
tuples might cause.)  1 disables the bulk processing and uses the traditonal 
ExecForeignInsert().  The default value is 100 (would 1 be sensible as a 
default value to avoid surprising users by increased memory usage?)



2. Should we accumulate records per relation in ResultRelInfo
instead? That is, when inserting into a partitioned table that has
foreign partitions, delay insertion until a certain number of input
records accumulate, and then insert accumulated records per relation
(e.g., 50 records to relation A, 30 records to relation B, and 20
records to relation C.)  If we do that,



I think there's a chunk of text missing here? If we do that, then what?


Sorry, the two bullets below there are what follows.  Perhaps I should have written ":" 
instead of ",".



Anyway, I don't see why accumulating the records in ResultRelInfo would
be better than what the patch does now. It seems to me like fairly
specific to FDWs, so keeping it int FDW state seems appropriate. What
would be the advantage of stashing it in ResultRelInfo?


I thought of distributing input records to their corresponding partitions' 
ResultRelInfos.  For example, input record for partition 1 comes, store it in 
the ResultRelInfo for partition 1, then input record for partition 2 comes, 
store it in the ResultRelInfo for partition 2.  When a ResultRelInfo 
accumulates some number of rows, insert the accumulated rows therein into the 
partition.  When the input endds, perform bulk inserts for ResultRelInfos that 
have accumulated rows.




I think that's OK for most use cases, and if it's not (e.g. when there's
something requiring the exact order of writes) then it's not possible to
use batching. That's one of the reasons why I think we should have a GUC
to disable the batching.


Agreed.



* Should the maximum count of accumulated records be applied per
relation or the query? When many foreign partitions belong to a
partitioned table, if the former is chosen, it may use much memory in
total.  If the latter i

Re: Proposition for autoname columns

2020-11-11 Thread Bruce Momjian
On Mon, Nov  2, 2020 at 05:05:29PM +0200, Eugen Konkov wrote:
> Hello Pgsql-hackers,
> 
> When selecting data from json column it named as '?column?'
> tucha=# select info->>'suma', docn from document order by id desc limit 5;
>  ?column? | docn 
> --+--
>  665.97   | 695
>  513.51   | 632
>  665.97   | 4804
>  492.12   | 4315
>  332.98   | 1302
> (5 rows)
> 
> It would be useful if the name of column will be autoassigned based on
> name of json key. Like at next query:
> 
> tucha=# select info->>'suma' as suma, docn from document order by id desc 
> limit 5;
>   suma  | docn 
> +--
>  665.97 | 695
>  513.51 | 632
>  665.97 | 4804
>  492.12 | 4315
>  332.98 | 1302
> (5 rows)
> 
> 
> Would it be useful this auto assigned name for column from json?

I think we could do it, but it would only work if the column was output
as a single json value, and not a multi-key/value field.  I am afraid if
we tried to do it, the result would be too inconsistent to be useful.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-11 Thread Bruce Momjian
On Mon, Nov  2, 2020 at 02:23:35PM +0100, Magnus Hagander wrote:
> On Tue, Oct 27, 2020 at 12:40 PM Bruce Momjian  wrote:
> >
> > On Tue, Oct 27, 2020 at 12:35:56PM +0100, Peter Eisentraut wrote:
> > > On 2020-10-27 11:53, Bruce Momjian wrote:
> > > > On Tue, Oct 27, 2020 at 11:35:25AM +0100, Peter Eisentraut wrote:
> > > > > On 2020-10-06 12:26, Magnus Hagander wrote:
> > > > > > I went with the name --no-instructions to have the same name for 
> > > > > > both
> > > > > > initdb and pg_upgrade. The downside is that "no-instructions" also
> > > > > > causes the scripts not to be written in pg_upgrade, which arguably 
> > > > > > is a
> > > > > > different thing. We could go with "--no-instructions" and
> > > > > > "--no-scripts", but that would leave the parameters different. I 
> > > > > > also
> > > > > > considered "--no-next-step", but that one didn't quite have the 
> > > > > > right
> > > > > > ring to me. I'm happy for other suggestions on the parameter names.
> > > > >
> > > > > What scripts are left after we remove the analyze script, as 
> > > > > discussed in a
> > > > > different thread?
> > > >
> > > > There is still delete_old_cluster.sh.
> > >
> > > Well, that one can trivially be replaced by a printed instruction, too.
> >
> > True. On my system is it simply:
> >
> > rm -rf '/u/pgsql.old/data'
> >
> > The question is whether the user is going to record the vacuumdb and rm
> > instructions that display at the end of the pg_upgrade run, or do we
> > need to write it down for them in script files.
> 
> That assumes for example that you've had no extra tablespaces defined
> in it. And it assumes your config files are actually in the same
> directory etc.
> 
> Now, pg_upgrade *could* create a script that "actually works" for most
> things, since it connects to the system and could then enumerate
> things like tablespaces and config file locations, and generate a
> script that actually uses that information.

Uh, pg_upgrade does enumerate things like tablespaces in
create_script_for_old_cluster_deletion().  I think config file locations
are beyond the scope of what we want pg_upgrade to handle.

In summary, I think the vacuumdb --analyze is now a one-line command,
but the delete part can be complex and not easily typed.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: pg_upgrade analyze script

2020-11-11 Thread Magnus Hagander
On Wed, Nov 11, 2020 at 2:21 AM Michael Paquier  wrote:
>
> On Tue, Nov 10, 2020 at 10:21:04AM +0900, Michael Paquier wrote:
> > On Mon, Nov 09, 2020 at 03:47:22PM +0100, Peter Eisentraut wrote:
> >> You should just remove those calls.  There is no need to replace them with
> >> vacuumdb calls.  The reason those calls were there is that they were 
> >> testing
> >> the generated script itself.  If the script is gone, there is no more need.
> >> There are already separate tests for testing vacuumdb.
> >
> > True, 102_vacuumdb_stages.pl already does all that.
>
> Let's fix that.  From what I can see it only involves the attached,
> and as a bobus it also reduces the number of extra characters showing
> on my terminal after running pg_upgrade tests.

Gah. For some reason, I did not spot this email until after I had
applied the other patch. And not only that, I also missed one line in
my patch that you included in yours :/

My apologies.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: pg_upgrade analyze script

2020-11-11 Thread Magnus Hagander
On Mon, Nov 9, 2020 at 3:47 PM Peter Eisentraut
 wrote:
>
> On 2020-11-09 11:22, Magnus Hagander wrote:
> >> I have spotted one small-ish thing.  This patch is missing to update
> >> the following code in vcregress.pl:
> >>  print "\nSetting up stats on new cluster\n\n";
> >>  system(".\\analyze_new_cluster.bat") == 0 or exit 1;
> > Ah, nice catch -- thanks! I guess this is unfortunately not a test
> > that's part of what cfbot tests.
> >
> > Untested on Windows, but following the patterns of the rows before it.
> > I will go ahead and push this version in a bit.
>
> You should just remove those calls.  There is no need to replace them
> with vacuumdb calls.  The reason those calls were there is that they
> were testing the generated script itself.  If the script is gone, there
> is no more need.  There are already separate tests for testing vacuumdb.
>

Good point. Done.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: proposal: possibility to read dumped table's name from file

2020-11-11 Thread Alvaro Herrera
On 2020-Nov-11, Pavel Stehule wrote:

> I think the proposed feature is very far to be the config file for pg_dump
> (it implements a option "--filter"). This is not the target. It is not
> designed for this. This is just an alternative for options like -t, -T, ...
> and I am sure so nobody will generate this file manually. Main target of
> this patch is eliminating problems with the max length of the command line.
> So it is really not designed to be the config file for pg_dump.

I agree that a replacement for existing command line arguments is a good
goal, but at the same time it's good to keep in mind the request that
more object types are supported as dumpable.  While it's not necessary
that this infrastructure supports all object types in the first cut,
it'd be good to have it support that.  I would propose that instead of a
single letter 't' etc we support keywords, maybe similar to those
returned by getObjectTypeDescription() (with additions -- for example
for "table data").  Then we can extend for other object types later
without struggling to find good letters for each.

Of course we could allow abbrevations for common cases, such that "t"
means "table".

For example: it'll be useful to support selective dumping of functions,
materialized views, foreign objects, etc.




Re: In-place persistance change of a relation

2020-11-11 Thread Stephen Frost
Greetings,

* Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost  wrote 
> in 
> > * Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place
> > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
> > > trials of several ways, I drifted to the following way after poking
> > > several ways.
> > > 
> > > 1. Flip BM_PERMANENT of active buffers
> > > 2. adding/removing init fork
> > > 3. sync files,
> > > 4. Flip pg_class.relpersistence.
> > > 
> > > It always skips table copy in the SET UNLOGGED case, and only when
> > > wal_level=minimal in the SET LOGGED case.  Crash recovery seems
> > > working by some brief testing by hand.
> > 
> > Somehow missed that this patch more-or-less does what I was referring to
> > down-thread, but I did want to mention that it looks like it's missing a
> > necessary FlushRelationBuffers() call before the sync, otherwise there
> > could be dirty buffers for the relation that's being set to LOGGED (with
> > wal_level=minimal), which wouldn't be good.  See the comments above
> > smgrimmedsync().
> 
> Right. Thanks.  However, since SetRelFileNodeBuffersPersistence()
> called just above scans shared buffers so I don't want to just call
> FlushRelationBuffers() separately.  Instead, I added buffer-flush to
> SetRelFileNodeBuffersPersistence().

Maybe I'm missing something, but it sure looks like in the patch that
SetRelFileNodeBuffersPersistence() is being called after the
smgrimmedsync() call, and I don't think you get to just switch the order
of those- the sync is telling the kernel to make sure it's written to
disk, while the FlushBuffer() is just writing it into the kernel but
doesn't provide any guarantee that the data has actually made it to
disk.  We have to FlushBuffer() first, and then call smgrimmedsync().
Perhaps there's a way to avoid having to go through shared buffers
twice, and I generally agreed it'd be good if we could avoid doing so,
but this approach doesn't look like it actually works.

> FWIW this is a revised version of the PoC, which has some known
> problems.
> 
> - Flipping of Buffer persistence is not WAL-logged nor even be able to
>   be safely roll-backed. (It might be better to drop buffers).

Not sure if it'd be better to drop buffers or not, but figuring out how
to deal with rollback seems pretty important.  How is the persistence
change in the catalog not WAL-logged though..?

> - This version handles indexes but not yet handle toast relatins.

Would need to be fixed, of course.

> - tableAMs are supposed to support this feature. (but I'm not sure
>   it's worth allowing them not to do so).

Seems like they should.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Disable WAL logging to speed up data loading

2020-11-11 Thread Stephen Frost
Greetings,

* tsunakawa.ta...@fujitsu.com (tsunakawa.ta...@fujitsu.com) wrote:
> * ALTER TABLE SET UNLOGGED/LOGGED without data copy
> Good:
> - Does not require server restart (if this feature can be used in all 
> wal_level settings).
> 
> Bad:
> - The user has to maintain and modify some scripts to use ALTER TABLE when 
> adding or removing the tables/partitions to load data into.  For example, if 
> the data loading job specifies a partitioned table, he may forget to add 
> ALTER TABLE for new partitions, resulting in slow data loading.

I'm not sure that I see this as really being much of an issue.  Perhaps
there are some things we can do, as I mentioned before, to make it
easier for users to have tables be created as unlogged from the start,
or to be able to ALTER TABLE a bunch of tables at once (using all in
tablespace, or maybe having an ALTER TABLE on a partitioned table
cascade to the partitions), but overall the risk here seems very low-
clearly whatever processing is running to load the data into a
particular table knows what the table is and adding an ALTER TABLE into
it would be trivial.

Specifically, for a partitioned table, I would think the load would go
something like:

CREATE UNLOGGED TABLE ...
load all of the data
ALTER TABLE ... SET LOGGED
ALTER TABLE ... ATTACH PARTITION

If the data load could all be done in a single transaction then you
wouldn't even need to create the table as UNLOGGED or issue the SET
LOGGED, with wal_level=minimal, you just need to create the table in the
same transaction that you do the data load in.

> * wal_level = none
> Good:
> - Easy to use.  The user does not have to be aware of what tables are loaded. 
>  This can come in handy when migrating from an older version or another DBMS, 
> building test databases, and consolidating databases.

A GUC that allowed users to set a default for newly created tables to be
unlogged would also address this.

> Bad:
> - Requires server restart.

Introducing yet another wal_level strikes me as a very large step, one
that the arguments presented here for why it'd be worth it don't come
anywhere near justifying that step.

> I expect both features will be able to meet our customer's needs.  The worst 
> scenario (I don't want to imagine!) is that neither feature fails to be 
> committed.  So, let us continue both features.  I'll join Horiguchi-san's new 
> thread, and please help us here too.  (I'll catch up with the recent 
> discussion in this thread and reply.)

While you're certainly welcome to spend your time where you wish to, if,
as you say, making these changes to how tables can be switched from
unlogged to logged with wal_level=minimal meets this use-case then that
strikes me as definitely the right approach and removes any
justification for adding another wal_level.

> > Couldn't we have something like the following?
> > 
> >  ALTER TABLE table1, table2, table3 SET UNLOGGED;
> > 
> > That is, multiple target object specification in ALTER TABLE sttatement.
> 
> Likewise, can't we do ALTER TABLE SET UNLOGGED/LOGGED against a partitioned 
> table?  Currently, the statement succeeds but none of the partitioned table 
> nor its partitions is set unlogged (pg_class.relpersistence remains 'p').  Is 
> this intended?  If it's a bug, I'm willing to fix it so that it reports an 
> eror.  Of course, it's good to make all partitions unlogged at once.

I agree that this doesn't seem quite right and considering the way other
commands work like CREATE INDEX, I would think that doing such an ALTER
TABLE would recurse to the individual partitions (skipping over any
which are already set to the persistance desired..).

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: avoid bitmapOR-ing indexes with scan condition inconsistent with partition constraint

2020-11-11 Thread Konstantin Knizhnik
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

I think that work on improving operator_predicate_proof should really be done 
in separate patch.
And this minimal patch is doing it's work well.

The new status of this patch is: Ready for Committer


Re: Skip ExecCheckRTPerms in CTAS with no data

2020-11-11 Thread Bharath Rupireddy
On Tue, Nov 10, 2020 at 1:18 AM Anastasia Lubennikova
 wrote:
>
> I took a look at the patch. It is quite simple, so no comments about the
> code. It would be good to add a test to select_into.sql to show that it
> only applies to 'WITH NO DATA' and that subsequent insertions will fail
> if permissions are not set.
>

Done.

>
> Maybe we should also mention it a documentation, but I haven't found any
> specific paragraph about permissions on CTAS.
>

Yes we do not have anything related to permissions mentioned in the
documentation. So, I'm not adding it now.

Apart from the above, I also noticed that we unnecessarily allocate
bulk insert state(16MB memory) in case of WITH NO DATA, just to free
it in intorel_shutdown() without actually using it. So, in the v2
patch I have made changes to not allocate bulk insert state.

Attaching v2 patch. Consider it for further review.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v2-Skip-Insert-Perm-Check-Bulk-Insert-State-alloc-in-WITH-NO-DATA.patch
Description: Binary data


Re: logical streaming of xacts via test_decoding is broken

2020-11-11 Thread Dilip Kumar
On Wed, Nov 11, 2020 at 6:59 PM Amit Kapila  wrote:
>
> On Wed, Nov 11, 2020 at 10:00 AM Dilip Kumar  wrote:
> >
> > On Tue, Nov 10, 2020 at 7:20 PM Amit Kapila  wrote:
> > >
> > > On Tue, Nov 10, 2020 at 2:25 PM Dilip Kumar  wrote:
> > > >
> > >
> > > You can probably add a comment as to why we are disallowing this case.
> > > I thought of considering 'stream-changes' parameter here because it
> > > won't make sense to give this parameter without it, however, it seems
> > > that is not necessary but maybe adding a comment
> > > here in that regard would be a good idea.
> >
> > Should we also consider the case that if the user just passed
> > skip_empty_streams to true then we should automatically set
> > skip_empty_xacts to true?
> >
>
> Is there any problem if we don't do this? Actually, adding
> dependencies on parameters is confusing so I want to avoid that unless
> it is really required.
>
> > And we will allow the 'skip_empty_streams' parameter only if
> > stream-changes' is true.

The reason behind this thought is that if the user doesn't pass any
value for 'skip_empty_xacts' then the default value will be false and
if the user only pass 'skip_empty_streams' to true then we will error
out assuming that skip_empty_xacts is false but skip_empty_streams is
true.  So it seems instead of error out we can assume that
skip_empty_streams true mean skip_empty_xacts is also true if nothing
is passed for that.

> Can't we simply ignore 'skip_empty_streams' if 'stream-changes' is not given?

Yeah, we can do that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logical streaming of xacts via test_decoding is broken

2020-11-11 Thread Amit Kapila
On Wed, Nov 11, 2020 at 10:00 AM Dilip Kumar  wrote:
>
> On Tue, Nov 10, 2020 at 7:20 PM Amit Kapila  wrote:
> >
> > On Tue, Nov 10, 2020 at 2:25 PM Dilip Kumar  wrote:
> > >
> >
> > You can probably add a comment as to why we are disallowing this case.
> > I thought of considering 'stream-changes' parameter here because it
> > won't make sense to give this parameter without it, however, it seems
> > that is not necessary but maybe adding a comment
> > here in that regard would be a good idea.
>
> Should we also consider the case that if the user just passed
> skip_empty_streams to true then we should automatically set
> skip_empty_xacts to true?
>

Is there any problem if we don't do this? Actually, adding
dependencies on parameters is confusing so I want to avoid that unless
it is really required.

> And we will allow the 'skip_empty_streams' parameter only if
> stream-changes' is true.
>

Can't we simply ignore 'skip_empty_streams' if 'stream-changes' is not given?

-- 
With Regards,
Amit Kapila.




Re: cutting down the TODO list thread

2020-11-11 Thread John Naylor
On Tue, Nov 10, 2020 at 7:08 PM Bruce Momjian  wrote:

> On Tue, Nov  3, 2020 at 02:06:13PM -0400, John Naylor wrote:
> > I was thinking of not having the next updates during commitfest, but it
> could
> > also be argued this is a type of review, and the things here will be
> returned
> > with feedback or rejected, in a way. Ultimately, it comes down to "when
> time
> > permits".
>
> I don't understand what this is referring to.  Thanks for the rest of
> the work.
>

This was awkwardly phrased, but I was concerned future proposals for
removal would be easy to miss during commitfest. At this point, I'm
thinking it isn't an issue.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Clean up optional rules in grammar

2020-11-11 Thread John Naylor
On Wed, Nov 11, 2020 at 5:13 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> It's obviously confusing to have multiple different styles to do the
> same thing.  And these extra rules (including the empty ones) also end
> up in the output, so they create more work down the line.
>
> The attached patch cleans this up to make them all look like the first
> style.
>

+1 for standardizing in this area. It's worth noting that Bison 3.0
introduced %empty for this situation, which is self-documenting. Until we
get there, this is a good step.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Prefer TG_TABLE_NAME over TG_RELNAME in tests

2020-11-11 Thread Greg Sabino Mullane
Thanks so much: the Bucardo PR has been merged in.


Re: ModifyTable overheads in generic plans

2020-11-11 Thread Heikki Linnakangas
I'm still a bit confused and unhappy about the initialization of 
ResultRelInfos and the various fields in them. We've made progress in 
the previous patches, but it's still a bit messy.



/*
 * If transition tuples will be captured, initialize a map to 
convert
 * child tuples into the format of the table mentioned in the 
query
 * (root relation), because the transition tuple store can only 
store
 * tuples in the root table format.  However for INSERT, the 
map is
 * only initialized for a given partition when the partition 
itself is
 * first initialized by ExecFindPartition.  Also, this map is 
also
 * needed if an UPDATE ends up having to move tuples across
 * partitions, because in that case the child tuple to be moved 
first
 * needs to be converted into the root table's format.  In that 
case,
 * we use GetChildToRootMap() to either create one from scratch 
if
 * we didn't already create it here.
 *
 * Note: We cannot always initialize this map lazily, that is, 
use
 * GetChildToRootMap(), because AfterTriggerSaveEvent(), which 
needs
 * the map, doesn't have access to the "target" relation that is
 * needed to create the map.
 */
if (mtstate->mt_transition_capture && operation != CMD_INSERT)
{
Relationrelation = 
resultRelInfo->ri_RelationDesc;
RelationtargetRel = 
mtstate->rootResultRelInfo->ri_RelationDesc;

resultRelInfo->ri_ChildToRootMap =

convert_tuples_by_name(RelationGetDescr(relation),
   
RelationGetDescr(targetRel));
/* First time creating the map for this result 
relation. */
Assert(!resultRelInfo->ri_ChildToRootMapValid);
resultRelInfo->ri_ChildToRootMapValid = true;
}


The comment explains that AfterTriggerSaveEvent() cannot use 
GetChildToRootMap(), because it doesn't have access to the root target 
relation. But there is a field for that in ResultRelInfo: 
ri_PartitionRoot. However, that's only set up when we do partition routing.


How about we rename ri_PartitionRoot to e.g ri_RootTarget, and set it 
always, even for non-partition inheritance? We have that information 
available when we initialize the ResultRelInfo, so might as well.


Some code currently checks ri_PartitionRoot, to determine if a tuple 
that's been inserted, has been routed. For example:



/*
 * Also check the tuple against the partition constraint, if 
there is
 * one; except that if we got here via tuple-routing, we don't 
need to
 * if there's no BR trigger defined on the partition.
 */
if (resultRelationDesc->rd_rel->relispartition &&
(resultRelInfo->ri_PartitionRoot == NULL ||
 (resultRelInfo->ri_TrigDesc &&
  resultRelInfo->ri_TrigDesc->trig_insert_before_row)))
ExecPartitionCheck(resultRelInfo, slot, estate, true);


So if we set ri_PartitionRoot always, we would need some other way to 
determine if the tuple at hand has actually been routed or not. But 
wouldn't that be a good thing anyway? Isn't it possible that the same 
ResultRelInfo is sometimes used for routed tuples, and sometimes for 
tuples that have been inserted/updated "directly"? 
ExecLookupUpdateResultRelByOid() sets that field lazily, so I think it 
would be possible to get here with ri_PartitionRoot either set or not, 
depending on whether an earlier cross-partition update was routed to the 
table.


The above check is just an optimization, to skip unnecessary 
ExecPartitionCheck() calls, but I think this snippet in 
ExecConstraints() needs to get this right:



/*
 * If the tuple has been routed, it's been 
converted to the
 * partition's rowtype, which might differ from 
the root
 * table's.  We must convert it back to the 
root table's
 * rowtype so that val_desc shown error message 
matches the
 * input tuple.
 */
if (resultRelInfo->ri_PartitionRoot)
{
AttrMap*map;

rel = resultRelInfo->ri_PartitionRoot;
tupdesc = RelationGetDescr(rel);
 

Re: Clean up optional rules in grammar

2020-11-11 Thread Peter Eisentraut

On 2020-11-11 12:43, Vik Fearing wrote:

No objections, but could we also take this opportunity to standardize
the comment itself?  Even in your patch there is a mix of spacing and
casing.

My preference is /* EMPTY */.  That is, uppercase with spaces, but
whatever gets chosen is fine with me.


Looks like /*EMPTY*/ is the most common right now.  I agree this should 
be straightened out.





Re: abstract Unix-domain sockets

2020-11-11 Thread Peter Eisentraut

On 2020-11-10 07:24, Michael Paquier wrote:

Can you sketch how you would structure this?  I realize it's not very
elegant, but I couldn't come up with a better way that didn't involve having
to duplicate some of the error messages into multiple branches.


I think that I would use a StringInfo to build each sentence of the
hint separately.  The first sentence, "Is another postmaster already
running on port %d?" is already known.  Then the second sentence could
be built depending on the two other conditions.


I'm not sure concatenating sentences like that is okay for translatability.


FWIW, I think that it
is confusing to mention in the hint to remove a socket file that
cannot be removed.


Thinking about it further, I think the hint in the Unix-domain socket 
case is bogus.  A socket in the file-system namespace never reports 
EADDRINUSE anyway, it just overwrites the file.  For sockets in the 
abstract namespace, you can get this error, but of course there is no 
file to remove.


Perhaps we should change the hint in both the Unix and the IP cases to:

"Is another postmaster already running at this address?"

(This also resolves the confusing reference to "port" in the Unix case.)

Or we just drop the hint in the Unix case.  The primary error message is 
clear enough.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/




RE: pgbench: option delaying queries till connections establishment?

2020-11-11 Thread Fabien COELHO



Hello,


I can remove the line, but I strongly believe that reporting performance
figures if some client connection failed thus the bench could not run as
prescribed is a bad behavior. The good news is that it is probably quite
unlikely. So I'd prefer to keep it and possibly submit a patch to change
the behavior.


I agree such a situation is very bad, and I understood you have a plan to
submit patches for fix it. If so leaving lines as a TODO is OK.


Thanks.


Should be this one: https://commitfest.postgresql.org/30/2624/


This discussion is still on-going, but I can see that the starting time
may be delayed for looking up all pgbench-variables.


Yep, that's it.


(I think the status of this thread might be wrong. it should be
'Needs review,' but now 'Waiting on Author.')


I changed it to "Needs review".


This patch is mostly good and can change a review status soon,
however, I think it should wait that related patch.


Hmmm.


Please discuss how to fix it with Tom,


I would not have the presumption to pressure Tom's agenda in any way!


and this will commit soon.


and this will wait till its time comes. In the mean time, I think that you 
should put the patch status as you see fit, independently of the other 
patch: it seems unlikely that they would be committed together, and I'll 
have to merge the remaining one anyway.


--
Fabien.




Re: making update/delete of inheritance trees scale better

2020-11-11 Thread Heikki Linnakangas

On 29/10/2020 15:03, Amit Langote wrote:

Rebased over the recent executor result relation related commits.


ModifyTablePath didn't get the memo that a ModifyTable can only have one 
subpath after these patches. Attached patch, on top of your v5 patches, 
cleans that up.


- Heikki
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fffe5bb7f0e..715eaece37e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2623,7 +2623,7 @@ static ModifyTable *
 create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 {
 	ModifyTable *plan;
-	Path	   *subpath = (Path *) linitial(best_path->subpaths);
+	Path	   *subpath = best_path->subpath;
 	Plan	   *subplan;
 
 	/* Build the plan. */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a1c015435f9..2c8f8b7fc0c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1825,8 +1825,7 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 		rootRelation,
 		root->partColsUpdated,
 		resultRelations,
-		list_make1(path),
-		list_make1(root),
+		path,
 		updateTargetLists,
 		withCheckOptionLists,
 		returningLists,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5581cd30088..bf4d7fc348b 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3521,8 +3521,7 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partColsUpdated' is true if any partitioning columns are being updated,
  *		either from the target relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
- * 'subpaths' is a list of Path(s) producing source data (one per rel)
- * 'subroots' is a list of PlannerInfo structs (one per rel)
+ * 'subpath' is a Path producing source data
  * 'updateTargetLists' is a list of UPDATE targetlists.
  * 'withCheckOptionLists' is a list of WCO lists (one per rel)
  * 'returningLists' is a list of RETURNING tlists (one per rel)
@@ -3535,15 +3534,14 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 		CmdType operation, bool canSetTag,
 		Index nominalRelation, Index rootRelation,
 		bool partColsUpdated,
-		List *resultRelations, List *subpaths,
-		List *subroots, List *updateTargetLists,
+		List *resultRelations, Path *subpath,
+		List *updateTargetLists,
 		List *withCheckOptionLists, List *returningLists,
 		List *rowMarks, OnConflictExpr *onconflict,
 		int epqParam)
 {
 	ModifyTablePath *pathnode = makeNode(ModifyTablePath);
 	double		total_size;
-	ListCell   *lc;
 
 	Assert(withCheckOptionLists == NIL ||
 		   list_length(resultRelations) == list_length(withCheckOptionLists));
@@ -3575,18 +3573,13 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->path.total_cost = 0;
 	pathnode->path.rows = 0;
 	total_size = 0;
-	foreach(lc, subpaths)
-	{
-		Path	   *subpath = (Path *) lfirst(lc);
 
-		if (lc == list_head(subpaths))	/* first node? */
-			pathnode->path.startup_cost = subpath->startup_cost;
-		pathnode->path.total_cost += subpath->total_cost;
-		if (returningLists != NIL)
-		{
-			pathnode->path.rows += subpath->rows;
-			total_size += subpath->pathtarget->width * subpath->rows;
-		}
+	pathnode->path.startup_cost = subpath->startup_cost;
+	pathnode->path.total_cost += subpath->total_cost;
+	if (returningLists != NIL)
+	{
+		pathnode->path.rows += subpath->rows;
+		total_size += subpath->pathtarget->width * subpath->rows;
 	}
 
 	/*
@@ -3605,8 +3598,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->rootRelation = rootRelation;
 	pathnode->partColsUpdated = partColsUpdated;
 	pathnode->resultRelations = resultRelations;
-	pathnode->subpaths = subpaths;
-	pathnode->subroots = subroots;
+	pathnode->subpath = subpath;
 	pathnode->updateTargetLists = updateTargetLists;
 	pathnode->withCheckOptionLists = withCheckOptionLists;
 	pathnode->returningLists = returningLists;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 291c27ac503..6e8f2734f1d 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1824,8 +1824,7 @@ typedef struct ModifyTablePath
 	Index		rootRelation;	/* Root RT index, if target is partitioned */
 	bool		partColsUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
-	List	   *subpaths;		/* Path(s) producing source data */
-	List	   *subroots;		/* per-target-table PlannerInfos */
+	Path	   *subpath;		/* Path producing source data */
 	List	   *updateTargetLists; /* per-target-table UPDATE tlists */
 	List	   *withCheckOptionLists;	/* per-target-table WCO lists */
 	List	   *returningLists; /* per-target-table 

Re: Clean up optional rules in grammar

2020-11-11 Thread Vik Fearing
On 11/11/20 10:12 AM, Peter Eisentraut wrote:
> There are a number of rules like this in the grammar:
> 
> opt_foo: FOO
>     | /*EMPTY*/
> ;
> 
> And there are some like this:
> 
> opt_foo: FOO {}
>     | /*EMPTY*/  {}
> ;
> 
> and some even like this:
> 
> %type  opt_foo
> 
> opt_foo: FOO { $$ = NULL; }
>     | /*EMPTY*/  { $$ = NULL; }
> ;
> 
> (I mean here specifically those rules where FOO is a noise word and the
> actions are the same in each branch.)
> 
> It's obviously confusing to have multiple different styles to do the
> same thing.  And these extra rules (including the empty ones) also end
> up in the output, so they create more work down the line.
> 
> The attached patch cleans this up to make them all look like the first
> style.

No objections, but could we also take this opportunity to standardize
the comment itself?  Even in your patch there is a mix of spacing and
casing.

My preference is /* EMPTY */.  That is, uppercase with spaces, but
whatever gets chosen is fine with me.
-- 
Vik Fearing




RE: pgbench: option delaying queries till connections establishment?

2020-11-11 Thread kuroda.hay...@fujitsu.com
Dear Fabien,

> Usually I run many pgbench through scripts, so I'm probably not there to 
> check a lone stderr failure at the beginning if performance figures are
> actually reported.

> I can remove the line, but I strongly believe that reporting performance 
> figures if some client connection failed thus the bench could not run as 
> prescribed is a bad behavior. The good news is that it is probably quite 
> unlikely. So I'd prefer to keep it and possibly submit a patch to change 
> the behavior.

I agree such a situation is very bad, and I understood you have a plan to 
submit patches for fix it. If so leaving lines as a TODO is OK.

> Should be this one: https://commitfest.postgresql.org/30/2624/

This discussion is still on-going, but I can see that the starting time
may be delayed for looking up all pgbench-variables.
(I think the status of this thread might be wrong. it should be
'Needs review,' but now 'Waiting on Author.')

This patch is mostly good and can change a review status soon,
however, I think it should wait that related patch.
Please discuss how to fix it with Tom, and this will commit soon.

Hayato Kuroda
FUJITSU LIMITED





Re: [HACKERS] logical decoding of two-phase transactions

2020-11-11 Thread Ajin Cherian
Did some further tests on the problem I saw and I see that it does not
have anything to do with this patch. I picked code from top of head.
If I have enough changes in a transaction to initiate streaming, then
it will also stream the same changes after a commit.

BEGIN;
INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
generate_series(1,800) g(i);
SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1', 'stream-changes', '1');
** see streamed output here **
END;
SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1', 'stream-changes', '1');
** see the same streamed output here **

I think this is because since the transaction has not been committed,
SnapBuildCommitTxn is not called which is what moves the
"builder->start_decoding_at", and as a result
later calls to pg_logical_slot_get_changes will start from the
previous lsn. I did do a quick test in pgoutput using pub/sub and I
dont see duplication of data there but I haven't
verified exactly what happens.

regards,
Ajin Cherian
Fujitsu Australia




Re: Clean up optional rules in grammar

2020-11-11 Thread Heikki Linnakangas

On 11/11/2020 11:12, Peter Eisentraut wrote:

The attached patch cleans this up to make them all look like the first
style.


+1

- Heikki




Re: warn_unused_results

2020-11-11 Thread Peter Eisentraut

On 2020-11-10 04:34, Michael Paquier wrote:

I am not sure
about the addition of repalloc(), as it is quite obvious that one has
to use its result.  Lists are fine, these are proper to PG internals
and beginners tend to be easily confused in the way to use them.

realloc() is listed in the GCC documentation as the reason this option
exists, and glibc tags its realloc() with this attribute, so doing the same
for repalloc() seems sensible.

Good point.


committed

--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/




Strange GiST logic leading to uninitialized memory access in pg_trgm gist code

2020-11-11 Thread Andrew Gierth
(From a report by user "ftzdomino" on IRC of a segfault while loading
data with a pg_trgm gist index)

If gtrgm_picksplit is invoked on a vector of exactly 2 items (which I
think is rare, but it can happen if gistSplit recurses or I think in
cases of secondary splits), then it tries to access cache[2] without
ever having initialized it, causing hilarity to ensue.

What I don't entirely understand is why the code is insisting on
treating the last item as special: given N items, it tries to find seeds
from the first N-1 items only. This would make a vague sort of sense if
the N'th item was always the just-inserted one, but this doesn't appear
to be always the case (e.g. recursive calls in gistSplit, or secondary
splits), and even if it were it's not clear that it would be correct
logic.

What gtrgm_picksplit currently does, as I read it, is:

  take first N-1 items
(note that entryvec->n is N+1, it sets maxoff = entryvec->n - 2)
  populate a cache of their signatures
  find the two furthest apart as seeds
  if we didn't choose two, then default to items 1 and 2 as seeds
(note here that if N=2 then item 2 is not cached)
  make datums from the cache entries of the two seeds
(explodes here when N=2)
  Increase maxoff and construct the cache entry for item N
  Split all N items using the two seeds

Now the obvious simple fix is just to reorder those last two operations,
and the original reporter verified that doing so fixed their problem
(patch attached). But I'd really like to understand the logic here and
whether there is any reason to have this special treatment at all - why
would it not be better to just cache all N items upfront and consider
them all as potential seeds?

Another issue I don't understand yet is that even though this code is
largely unchanged since 8.x, the original reporter could not reproduce
the crash on any version before 13.0.

Anyone have any ideas? (If not, I'll commit and backpatch something like
the attached patch at some suitable time.)

-- 
Andrew (irc:RhodiumToad)

diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c
index 9937ef9253..c7f2873e13 100644
--- a/contrib/pg_trgm/trgm_gist.c
+++ b/contrib/pg_trgm/trgm_gist.c
@@ -847,15 +847,22 @@ gtrgm_picksplit(PG_FUNCTION_ARGS)
 	v->spl_nleft = 0;
 	v->spl_nright = 0;
 
+	/*
+	 * We ignored the last entry in the list above, and did not populate its
+	 * cache entry yet, but if we were called with exactly 2 items then seed_2
+	 * now points to it, so we must fill in the cache entry before trying to
+	 * build datums from the seeds.
+	 */
+	maxoff = OffsetNumberNext(maxoff);
+	fillcache(&cache[maxoff], GETENTRY(entryvec, maxoff),
+			  &cache_sign[siglen * maxoff], siglen);
+
 	/* form initial .. */
 	datum_l = gtrgm_alloc(cache[seed_1].allistrue, siglen, cache[seed_1].sign);
 	datum_r = gtrgm_alloc(cache[seed_2].allistrue, siglen, cache[seed_2].sign);
 
 	union_l = GETSIGN(datum_l);
 	union_r = GETSIGN(datum_r);
-	maxoff = OffsetNumberNext(maxoff);
-	fillcache(&cache[maxoff], GETENTRY(entryvec, maxoff),
-			  &cache_sign[siglen * maxoff], siglen);
 
 	/* sort before ... */
 	costvector = (SPLITCOST *) palloc(sizeof(SPLITCOST) * maxoff);


  1   2   >