Re: Global temporary tables

2019-08-16 Thread Konstantin Knizhnik



On 16.08.2019 9:25, Craig Ringer wrote:
On Tue, 13 Aug 2019 at 21:50, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:



As far as I understand relpages and reltuples are set only
when you perform "analyze" of the table.


Also autovacuum's autoanalyze.


When it happen?
I have created normal table, populated it with some data and then
wait several hours but pg_class was not updated for this table.



heap_vacuum_rel() in src/backend/access/heap/vacuumlazy.c below

     * Update statistics in pg_class.

which I'm pretty sure is common to explicit vacuum and autovacuum. I 
haven't run up a test to verify 100% but most DBs would never have 
relpages etc set if autovac didn't do it since most aren't explicitly 
VACUUMed at all.


Sorry, I already understood it myself.
But to make vacuum process the table it is necessary to remove or update 
some rows in it.
It seems to be yet another Postgres problem, which was noticed by 
Darafei Praliaskouski some time ago: append-only tables are never 
proceeded by autovacuum.





I thought it was done when autovac ran an analyze, but it looks like 
it's all autovac. Try setting very aggressive autovac thresholds and 
inserting + deleting a bunch of tuples maybe.


I attach to this mail slightly refactored versions of this patches
with fixes of issues reported in your review.


Thanks.

Did you have a chance to consider my questions too? I see a couple of 
things where there's no patch change, which is fine, but I'd be 
interested in your thoughts on the question/issue in those cases.



Sorry, may be I didn't notice some your questions. I have a filling that 
I have replied on all your comments/questions.

Right now I reread all this thread and see two open issues:

1. Statistic for global temporary tables (including number of tuples, 
pages and all visible flag).
My position is the following: while in most cases it should not be a 
problem, because users rarely create indexes or do analyze for temporary 
tables,
there can be situations when differences in data sets of global 
temporary tables in different backends can really be a problem.
Unfortunately I can not propose good solution for this problem. It is 
certainly possible to create some private (per-backend) cache for this 
metadata.

But it seems to requires changes in many places.

2. Your concerns about performance penalty of global temp tables 
accessed through shared buffers comparing with local temp tables access 
through local buffers.
I think that this concern is not  actual any more because there is 
implementation of global temp tables using local buffers.
But my experiments doesn't show significant difference in access speed 
of shared and local buffers. As far as shared buffers are used to be 
much larger than local buffers,
there are more chances to hold all temp relation in memory without 
spilling it to the disk. In this case access to global temp table will 
be much faster comparing with access to
local temp tables. But the fact is that right now in the most frequent 
scenario of temp table usage:


    SELECT ... FROM PersistentTable INTO TempTable WHERE ...;
    SELECT * FROM TempTable;

local temp table are more efficient than global temp table access 
through shared buffer.

I think it is explained by caching and eviction policies.
In case of pulling all content of temp table in memory (pg_prewarm) 
global temp table with shared buffers becomes faster.



I forget or do not notice some of your questions, would you be so kind 
as to repeat them?



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: UNION ALL

2019-08-16 Thread Mark Pasterkamp
First of all, thank you for the replies.

I am using a base installation of postgres 10.10, with no modifications to
any of the system defaults.

I am trying to speedup a join between two tables: the title table and the
cast_info table.

The title table is a table containing information about different movies.
it contains 4626969 records.
the table also has a foreign key index on the cast_info table, enabling the
planner to use a hash-join.

The cast_info table is a table containing the information of which actor
was casted in which movie and contains 62039343 records.

The database also contains a materialized view ci_t_15, defined as:
select * from cast_info join title on cast_info.movie_id = title.id
where title.production_year < 2015

I am comparing two queries, q1 and q2 respectively.
Query q1 is the original query and q2 is an attempt to reduce the cost of
execution via leveraging the materialized view ci_t_15.

Query q1 is defined as:
select * from cast_info join title on cast_info.movie_id = title.id

Query q2 is defined as
select * from cast_info join title on cast_info.movie_id = title.id
where title.production_year >= 2015
UNION ALL
select * from ci_t_15

Both queries are executed on a Dell xps laptop with an I7-8750H processor
and 16 (2*8) gb ram on an SSD running on ubuntu 18.04.2 LTS.

Running explain analyze on both queries I get the following execution plans.
q1:
"Hash Join  (cost=199773.80..2561662.10 rows=62155656 width=103) (actual
time=855.063..25786.264 rows=62039343 loops=1)"
"  Hash Cond: (cast_info.ci_movie_id = title.t_id)"
"  ->  Seq Scan on cast_info  (cost=0.00..1056445.56 rows=62155656
width=42) (actual time=0.027..3837.722 rows=62039343 loops=1)"
"  ->  Hash  (cost=92232.69..92232.69 rows=4626969 width=61) (actual
time=854.548..854.548 rows=4626969 loops=1)"
"Buckets: 65536  Batches: 128  Memory Usage: 3431kB"
"->  Seq Scan on title  (cost=0.00..92232.69 rows=4626969 width=61)
(actual time=0.005..327.588 rows=4626969 loops=1)"
"Planning time: 5.097 ms"
"Execution time: 27236.088 ms"

q2:
"Append  (cost=123209.65..3713445.65 rows=61473488 width=105) (actual
time=442.207..29713.621 rows=60918189 loops=1)"
"  ->  Gather  (cost=123209.65..2412792.77 rows=10639784 width=103) (actual
time=442.206..14634.427 rows=10046633 loops=1)"
"Workers Planned: 2"
"Workers Launched: 2"
"->  Hash Join  (cost=122209.65..1347814.37 rows=4433243 width=103)
(actual time=471.969..12527.840 rows=3348878 loops=3)"
"  Hash Cond: (cast_info.ci_movie_id = title.t_id)"
"  ->  Parallel Seq Scan on cast_info  (cost=0.00..693870.90
rows=25898190 width=42) (actual time=0.006..7302.679 rows=20679781 loops=3)"
"  ->  Hash  (cost=103800.11..103800.11 rows=792043 width=61)
(actual time=471.351..471.351 rows=775098 loops=3)"
"Buckets: 65536  Batches: 32  Memory Usage: 2515kB"
"->  Seq Scan on title  (cost=0.00..103800.11
rows=792043 width=61) (actual time=0.009..376.127 rows=775098 loops=3)"
"  Filter: (t_production_year >= 2015)"
"  Rows Removed by Filter: 3851871"
"  ->  Seq Scan on ci_t_15  (cost=0.00..1194255.04 rows=50833704 width=105)
(actual time=1.143..11967.391 rows=50871556 loops=1)"
"Planning time: 0.268 ms"
"Execution time: 31379.854 ms"

Due to using the materialized view I can reduce the amount of records going
into the hash join, lowering the time from 25786.264 msec to 12527.840 msec.
However, this is where my question comes in, this reduction is completely
negated by the cost of appending both results in the UNION ALL command.
I was wondering if this is normal behaviour.
In my mind, I wouldn't expect appending 2 resultsets to have such a
relative huge cost associated with it.
This is also why I asked what exactly a UNION ALL does to achieve its
functionality, to perhaps gain some insight in its cost.


With kind regards,

Mark

On Thu, 15 Aug 2019 at 21:22, Ibrar Ahmed  wrote:

>
>
> On Fri, Aug 16, 2019 at 12:16 AM <066ce...@free.fr> wrote:
>
>> Generally speaking, when executing UNION ; a DISTINCT is run afterward on
>> the resultset.
>>
>> So, if you're sure that each part of UNION cannot return a line returned
>> by another one, you may use UNION ALL, you'll cut the cost of the final
>> implicit DISTINCT.
>>
>>
>> - Mail original -
>> De: "Mark Pasterkamp" 
>> À: pgsql-hackers@lists.postgresql.org
>> Envoyé: Jeudi 15 Août 2019 20:37:06
>> Objet: UNION ALL
>>
>>
>> Dear all,
>>
>>
>> I was wondering if someone could help me understands what a union all
>> actually does.
>>
>>
>> For my thesis I am using Apache Calcite to rewrite queries into using
>> materialized views which I then give to a Postgres database.
>> For some queries, this means that they will be rewritten in a UNION ALL
>> style query between an expression and a table scan of a materialized view.
>> However, contrary to what I expected, the UNION ALL query is actu

Re: Global temporary tables

2019-08-16 Thread Craig Ringer
On Fri, 16 Aug 2019 at 15:30, Konstantin Knizhnik 
wrote:


>
> 1. Statistic for global temporary tables (including number of tuples,
> pages and all visible flag).
> My position is the following: while in most cases it should not be a
> problem, because users rarely create indexes or do analyze for temporary
> tables,
> there can be situations when differences in data sets of global temporary
> tables in different backends can really be a problem.
> Unfortunately I can not propose good solution for this problem. It is
> certainly possible to create some private (per-backend) cache for this
> metadata.
> But it seems to requires changes in many places.
>

Yeah. I don't really like just sharing them but it's not that bad either.


> 2. Your concerns about performance penalty of global temp tables accessed
> through shared buffers comparing with local temp tables access through
> local buffers.
> I think that this concern is not  actual any more because there is
> implementation of global temp tables using local buffers.
> But my experiments doesn't show significant difference in access speed of
> shared and local buffers. As far as shared buffers are used to be much
> larger than local buffers,
> there are more chances to hold all temp relation in memory without
> spilling it to the disk. In this case access to global temp table will be
> much faster comparing with access to
> local temp tables.
>

You ignore the costs of evicting non-temporary data from shared_buffers,
i.e. contention for space. Also increased chance of backends being forced
to do direct write-out due to lack of s_b space for dirty buffers.

> In case of pulling all content of temp table in memory (pg_prewarm)
global temp table with shared buffers becomes faster.

Who would ever do that?

I forget or do not notice some of your questions, would you be so kind as
> to repeat them?
>



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: Global temporary tables

2019-08-16 Thread Craig Ringer
>
>
> On Fri, 16 Aug 2019 at 15:30, Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>
>> I forget or do not notice some of your questions, would you be so kind as
>> to repeat them?
>>
>
>

Sent early by accident.

Repeating questions:


Why do you need to do all this indirection with changing RelFileNode to
RelFileNodeBackend in the bufmgr, changing BufferGetTag etc? Similarly,
your changes of RelFileNodeBackendIsTemp to RelFileNodeBackendIsLocalTemp .
I'm guessing you did it the way you did instead to lay the groundwork for
cross-backend sharing, but if so it should IMO be in your second patch that
adds support for using shared_buffers for temp tables, not in the first
patch that adds a minimal global temp tables implementation. Maybe my
understanding of the existing temp table mechanics is just insufficient as
I see RelFileNodeBackendIsTemp is already used in some aspects of existing
temp relation handling.

Did you look into my suggestion of extending the relmapper so that global
temp tables would have a relfilenode of 0 like pg_class etc, and use a
backend-local map of oid-to-relfilenode mappings?

Similarly, TruncateSessionRelations probably shouldn't need to exist in
this patch in its current form; there's no shared_buffers use to clean and
the same file cleanup mechanism should handle both session-temp and
local-temp relfilenodes.

Sequence initialization ignores sequence startval/firstval settings. Why?
+   value[SEQ_COL_LASTVAL-1] = Int64GetDatumFast(1); /* start
sequence with 1 */



Doesn't this change the test outcome for RELPERSISTENCE_UNLOGGED?:
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ else if (newrelpersistence != RELPERSISTENCE_TEMP)


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: POC: Cleaning up orphaned files using undo logs

2019-08-16 Thread Dilip Kumar
On Fri, Aug 16, 2019 at 10:56 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-16 09:44:25 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar  wrote:
> > >
> > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> >
> > > >   I think that batch reading should just copy the underlying data into a
> > > >   char* buffer. Only the records that currently are being used by
> > > >   higher layers should get exploded into an unpacked record. That will
> > > >   reduce memory usage quite noticably (and I suspect it also drastically
> > > >   reduce the overhead due to a large context with a lot of small
> > > >   allocations that then get individually freed).
> > >
> > > Ok, I got your idea.  I will analyze it further and work on this if
> > > there is no problem.
> >
> > I think there is one problem that currently while unpacking the undo
> > record if the record is compressed (i.e. some of the fields does not
> > exist in the record) then we read those fields from the first record
> > on the page.  But, if we just memcpy the undo pages to the buffers and
> > delay the unpacking whenever it's needed seems that we would need to
> > know the page boundary and also we need to know the offset of the
> > first complete record on the page from where we can get that
> > information (which is currently in undo page header).
>
> I don't understand why that's a problem?
Okay, I was assuming that we will be only copying data part not
complete page including the page header.  If we copy the page header
as well we might be able to unpack the compressed record as well.

>
>
> > As of now even if we leave this issue apart I am not very clear what
> > benefit you are seeing in the way you are describing compared to the
> > way I am doing it now?
> >
> > a) Is it the multiple palloc? If so then we can allocate memory at
> > once and flatten the undo records in that.  Earlier, I was doing that
> > but we need to align each unpacked undo record so that we can access
> > them directly and based on Robert's suggestion I have modified it to
> > multiple palloc.
>
> Part of it.
>
> > b) Is it the memory size problem that the unpack undo record will take
> > more memory compared to the packed record?
>
> Part of it.
>
> > c) Do you think that we will not need to unpack all the records?  But,
> > I think eventually, at the higher level we will have to unpack all the
> > undo records ( I understand that it will be one at a time)
>
> Part of it. There's a *huge* difference between having a few hundred to
> thousand unpacked records, each consisting of several independent
> allocations, in memory and having one large block containing all
> packed records in a batch, and a few allocations for the few unpacked
> records that need to exist.
>
> There's also d) we don't need separate tiny memory copies while holding
> buffer locks etc.

Yeah, that too.  Yet another problem could be that how are we going to
process those record? Because for that we need to know all the undo
record pointers between start_urecptr and the end_urecptr right?  we
just have the big memory chunk and we have no idea how many undo
records are there and what are their undo record pointers.  And
without knowing that information, I am unable to imagine how we are
going to sort them based on block number.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: block-level incremental backup

2019-08-16 Thread Jeevan Chalke
On Fri, Aug 2, 2019 at 6:43 PM vignesh C  wrote:

> Some comments:
> 1) There will be some link files created for tablespace, we might
> require some special handling for it
>

Yep. I have that in my ToDo.
Will start working on that soon.


> 2)
> Retry functionality is hanlded only for copying of full files, should
> we handle retry for copying of partial files
> 3)
> we can have some range for maxretries similar to sleeptime
>

I took help from pg_standby code related to maxentries and sleeptime.

However, as we don't want to use system() call now, I have
removed all this kludge and just used fread/fwrite as discussed.


> 4)
> Should we check for malloc failure
>

Used pg_malloc() instead. Same is also suggested by Ibrar.


>
> 5) Should we add display of progress as backup may take some time,
> this can be added as enhancement. We can get other's opinion on this.
>

Can be done afterward once we have the functionality in place.


>
> 6)
> If the backup count increases providing the input may be difficult,
> Shall user provide all the incremental backups from a parent folder
> and can we handle the ordering of incremental backup internally
>

I am not sure of this yet. We need to provide the tablespace mapping too.
But thanks for putting a point here. Will keep that in mind when I revisit
this.


>
> 7)
> Add verbose for copying whole file
>
Done


>
> 8) We can also check if approximate space is available in disk before
> starting combine backup, this can be added as enhancement. We can get
> other's opinion on this.
>

Hmm... will leave it for now. User will get an error anyway.


>
> 9)
> Combine backup into directory can be combine backup directory
>
Done


>
> 10)
> MAX_INCR_BK_COUNT can be increased little, some applications use 1
> full backup at the beginning of the month and use 30 incremental
> backups rest of the days in the month
>

Yeah, agree. But using any number here is debatable.
Let's see others opinion too.


> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
>


Attached new sets of patches with refactoring done separately.
Incremental backup patch became small now and hopefully more
readable than the first version.

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 93041c12a7d07bf17073c9cf4571bd3b5a8acc81 Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Fri, 16 Aug 2019 14:08:34 +0530
Subject: [PATCH 1/4] Add support for command line option to pass LSN.

This adds [ LSN 'lsn' ] to BASE_BACKUP command and --lsn=LSN to
the pg_basebackup binary.

Also, add small tests.
---
 src/backend/replication/basebackup.c | 20 
 src/backend/replication/repl_gram.y  |  6 ++
 src/backend/replication/repl_scanner.l   |  1 +
 src/bin/pg_basebackup/pg_basebackup.c| 15 +--
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 12 +++-
 5 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index c91f66d..74c954b 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -38,6 +38,7 @@
 #include "storage/ipc.h"
 #include "storage/reinit.h"
 #include "utils/builtins.h"
+#include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/relcache.h"
 #include "utils/timestamp.h"
@@ -52,6 +53,7 @@ typedef struct
 	bool		includewal;
 	uint32		maxrate;
 	bool		sendtblspcmapfile;
+	XLogRecPtr 	lsn;
 } basebackup_options;
 
 
@@ -638,6 +640,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
 	bool		o_maxrate = false;
 	bool		o_tablespace_map = false;
 	bool		o_noverify_checksums = false;
+	bool		o_lsn = false;
 
 	MemSet(opt, 0, sizeof(*opt));
 	foreach(lopt, options)
@@ -726,6 +729,23 @@ parse_basebackup_options(List *options, basebackup_options *opt)
 			noverify_checksums = true;
 			o_noverify_checksums = true;
 		}
+		else if (strcmp(defel->defname, "lsn") == 0)
+		{
+			bool		have_error = false;
+
+			if (o_lsn)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("duplicate option \"%s\"", defel->defname)));
+			o_lsn = true;
+
+			/* Validate given LSN and convert it into XLogRecPtr. */
+			opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error);
+			if (XLogRecPtrIsInvalid(opt->lsn))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 errmsg("invalid value for LSN")));
+		}
 		else
 			elog(ERROR, "option \"%s\" not recognized",
  defel->defname);
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index c4e11cc..c24d319 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -87,6 +87,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
+%token K_LSN
 
 %type 	command
 %type 	base_backup start_replication start_

Re: block-level incremental backup

2019-08-16 Thread Jeevan Chalke
On Fri, Aug 9, 2019 at 6:07 AM Jeevan Ladhe 
wrote:

> Hi Jeevan,
>
> I have reviewed the backup part at code level and still looking into the
> restore(combine) and functional part of it. But, here are my comments so
> far:
>

Thank you Jeevan Ladhe for reviewing the changes.


>
> The patches need rebase.
>

Done.


> May be we should rename to something like:
> "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
> START LOCATION"
> to make it more intuitive?
>

As discussed, used "INCREMENTAL BACKUP REFERENCE WAL LOCATION".

File header structure is defined in both the files basebackup.c and
> pg_combinebackup.c. I think it is better to move this to
> replication/basebackup.h.
>

Yep. Was that in my cleanup list. Done now.


> I think we can avoid having flag isrelfile in sendFile().
> Something like this:
>
Also, having isrelfile as part of following condition:
> is confusing, because even the relation files in full backup are going to
> be
> backed up by this loop only, but still, the condition reads '(!isrelfile
> &&...)'.
>

In the refactored patch I have moved full backup code in a separate
function.
And now all incremental backup code is also done in its own function.
Hopefully, the code is now more readable.


>
> IMHO, while labels are not advisable in general, it may be better to use a
> label
> here rather than a while(1) loop, so that we can move to the label in case
> we
> want to retry once. I think here it opens doors for future bugs if someone
> happens to add code here, ending up adding some condition and then the
> break becomes conditional. That will leave us in an infinite loop.
>

I kept it as is as I don't see any correctness issue here.

Similar to structure partial_file_header, I think above macro can also be
> moved
> to basebackup.h instead of defining it twice.
>

Yes. Done.


> I think this is a huge memory request (1GB) and may fail on busy/loaded
> server at
> times. We should check for failures of malloc, maybe throw some error on
> getting ENOMEM as errno.
>

Agree. Done.


> Here, should not we expect statbuf->st_size < (RELSEG_SIZE * BLCKSZ), and
> it
> should be safe to read just statbuf_st_size always I guess? But, I am ok
> with
> having this extra guard here.
>

Yes, we can do this way. Added an Assert() before that and used just
statbuf->st_size.

In sendFile(), I am sorry if I am missing something, but I am not able to
> understand why 'cnt' and 'i' should have different values when they are
> being
> passed to verify_page_checksum(). I think passing only one of them should
> be
> sufficient.
>

As discussed offline, you meant to say i and blkno.
These two are different. i represent the current block offset from the read
buffer whereas blkno is the offset from the start of the page. For
incremental
backup, they are same as we read the whole file but they are different in
case
of regular full backup where we read 4 blocks at a time. i value there will
be
between 0 and 3.


> Maybe we should just have a variable no_of_blocks to store a number of
> blocks,
> rather than calculating this say RELSEG_SIZE(i.e. 131072) times in the
> worst
> case.
>

OK. Done.


> Sorry if I am missing something, but, should not it be just:
>
> len = cnt;
>

Yeah. Done.


> As I said earlier in my previous email, we now do not need
> +decode_lsn_internal()
> as it is already taken care by the introduction of function
> pg_lsn_in_internal().
>

Yes. Done that and rebased on latest HEAD.


>
> Regards,
> Jeevan Ladhe
>

Patches attached in the previous reply.

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: block-level incremental backup

2019-08-16 Thread Ibrar Ahmed
On Fri, Aug 16, 2019 at 3:24 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Fri, Aug 2, 2019 at 6:43 PM vignesh C  wrote:
>
>> Some comments:
>> 1) There will be some link files created for tablespace, we might
>> require some special handling for it
>>
>
> Yep. I have that in my ToDo.
> Will start working on that soon.
>
>
>> 2)
>> Retry functionality is hanlded only for copying of full files, should
>> we handle retry for copying of partial files
>> 3)
>> we can have some range for maxretries similar to sleeptime
>>
>
> I took help from pg_standby code related to maxentries and sleeptime.
>
> However, as we don't want to use system() call now, I have
> removed all this kludge and just used fread/fwrite as discussed.
>
>
>> 4)
>> Should we check for malloc failure
>>
>
> Used pg_malloc() instead. Same is also suggested by Ibrar.
>
>
>>
>> 5) Should we add display of progress as backup may take some time,
>> this can be added as enhancement. We can get other's opinion on this.
>>
>
> Can be done afterward once we have the functionality in place.
>
>
>>
>> 6)
>> If the backup count increases providing the input may be difficult,
>> Shall user provide all the incremental backups from a parent folder
>> and can we handle the ordering of incremental backup internally
>>
>
> I am not sure of this yet. We need to provide the tablespace mapping too.
> But thanks for putting a point here. Will keep that in mind when I revisit
> this.
>
>
>>
>> 7)
>> Add verbose for copying whole file
>>
> Done
>
>
>>
>> 8) We can also check if approximate space is available in disk before
>> starting combine backup, this can be added as enhancement. We can get
>> other's opinion on this.
>>
>
> Hmm... will leave it for now. User will get an error anyway.
>
>
>>
>> 9)
>> Combine backup into directory can be combine backup directory
>>
> Done
>
>
>>
>> 10)
>> MAX_INCR_BK_COUNT can be increased little, some applications use 1
>> full backup at the beginning of the month and use 30 incremental
>> backups rest of the days in the month
>>
>
> Yeah, agree. But using any number here is debatable.
> Let's see others opinion too.
>
Why not use a list?


>
>
>> Regards,
>> Vignesh
>> EnterpriseDB: http://www.enterprisedb.com
>>
>
>
> Attached new sets of patches with refactoring done separately.
> Incremental backup patch became small now and hopefully more
> readable than the first version.
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>

+   buf = (char *) malloc(statbuf->st_size);

+   if (buf == NULL)

+   ereport(ERROR,

+   (errcode(ERRCODE_OUT_OF_MEMORY),

+errmsg("out of memory")));

Why are you using malloc, you can use palloc here.




-- 
Ibrar Ahmed


Re: Global temporary tables

2019-08-16 Thread Konstantin Knizhnik



On 16.08.2019 11:37, Craig Ringer wrote:



On Fri, 16 Aug 2019 at 15:30, Konstantin Knizhnik
mailto:k.knizh...@postgrespro.ru>> wrote:

I forget or do not notice some of your questions, would you be
so kind as to repeat them?


Sent early by accident.

Repeating questions:


Sorry, but I have answered them (my e-mail from 13.08)!
Looks like you have looed at wrong version of the patch:
global_shared_temp-1.patch instead of global_private_temp-1.patch which 
implements global tables accessed through local buffers.





Why do you need to do all this indirection with changing RelFileNode 
to RelFileNodeBackend in the bufmgr, changing BufferGetTag etc? 
Similarly, your changes of RelFileNodeBackendIsTemp 
to RelFileNodeBackendIsLocalTemp . I'm guessing you did it the way you 
did instead to lay the groundwork for cross-backend sharing, but if so 
it should IMO be in your second patch that adds support for using 
shared_buffers for temp tables, not in the first patch that adds a 
minimal global temp tables implementation. Maybe my understanding of 
the existing temp table mechanics is just insufficient as I 
see RelFileNodeBackendIsTemp is already used in some aspects of 
existing temp relation handling.




Sorry, are you really speaking about global_private_temp-1.patch?
This patch doesn't change bufmgr file at all.
May be you looked at another patch - global_shared_temp-1.patch
which is accessing shared tables though shared buffers and so have to 
change buffer tag to include backend ID in it.



Did you look into my suggestion of extending the relmapper so that 
global temp tables would have a relfilenode of 0 like pg_class etc, 
and use a backend-local map of oid-to-relfilenode mappings?


Similarly, TruncateSessionRelations probably shouldn't need to exist 
in this patch in its current form; there's no shared_buffers use to 
clean and the same file cleanup mechanism should handle both 
session-temp and local-temp relfilenodes.


In global_private_temp-1.patch TruncateSessionRelations does nothing 
with shared buffers, it just delete relation files.






Sequence initialization ignores sequence startval/firstval settings. Why?
+               value[SEQ_COL_LASTVAL-1] = Int64GetDatumFast(1); /* 
start sequence with 1 */






I am handling only case of implicitly created sequences for 
SERIAL/BIGSERIAL columns.

Is it possible to explicitly specify initial value and step for them?
If so, this place should definitely be rewritten.



Doesn't this change the test outcome for RELPERSISTENCE_UNLOGGED?:
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ else if (newrelpersistence != RELPERSISTENCE_TEMP)



RELPERSISTENCE_UNLOGGED case is handle in previous IF branch.
-

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Global temporary tables

2019-08-16 Thread Konstantin Knizhnik



On 16.08.2019 11:32, Craig Ringer wrote:


You ignore the costs of evicting non-temporary data from 
shared_buffers, i.e. contention for space. Also increased chance of 
backends being forced to do direct write-out due to lack of s_b space 
for dirty buffers.
> In case of pulling all content of temp table in memory (pg_prewarm) 
global temp table with shared buffers becomes faster.


Who would ever do that?




I decided to redo my experiments and now get different results which 
illustrates advantages of global temp tables with shared buffer.
I performed the following test at my desktop with SSD and 16GB of RAM 
and Postgres with default configuration except shared-buffers increased 
to 1Gb.



postgres=# create table big(pk bigint primary key, val bigint);
CREATE TABLE
postgres=# insert into big values 
(generate_series(1,1),generate_series(1,1)/100);

INSERT 0 1
postgres=# select * from buffer_usage limit 3;
    relname |  buffered  | buffer_percent | percent_of_relation
+++-
 big    | 678 MB |   66.2 |    16.1
 big_pkey   | 344 MB |   33.6 |    16.1
 pg_am  | 8192 bytes |    0.0 |    20.0

postgres=# create temp table lt(key bigint, count bigint);
postgres=# \timing
Timing is on.
postgres=# insert into lt (select count(*),val as key from big group by 
val);

INSERT 0 101
Time: 43265.491 ms (00:43.265)
postgres=# select sum(count) from lt;
 sum
--
 5050
(1 row)

Time: 94.194 ms
postgres=# insert into gt (select count(*),val as key from big group by 
val);

INSERT 0 101
Time: 42952.671 ms (00:42.953)
postgres=# select sum(count) from gt;
 sum
--
 5050
(1 row)

Time: 35.906 ms
postgres=# select * from buffer_usage limit 3;
 relname  | buffered | buffer_percent | percent_of_relation
--+--++-
 big  | 679 MB   |   66.3 |    16.1
 big_pkey | 300 MB   |   29.3 |    14.0
 gt   | 42 MB    |    4.1 |   100.0


So time of storing result in global temp table is slightly smaller than 
time of storing it in local temp table and time of scanning global temp 
table is twice smaller!



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Ibrar Ahmed
On Thu, Aug 15, 2019 at 8:21 PM Bruce Momjian  wrote:

> On Thu, Aug 15, 2019 at 11:24:46AM +0200, Antonin Houska wrote:
> > > I think there are several directions we can go after all-cluster
> > > encryption,
> >
> > I think I misunderstood. What you summarize in
> >
> >
> https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#TODO_for_Full-Cluster_Encryption
> >
>
Do we have any status of TODO's, what has been done and what left? It's
much better if we have a link of discussion of each item.



> > does include
> >
> >
> https://www.postgresql.org/message-id/CAD21AoBjrbxvaMpTApX1cEsO=8N=nc2xVZPB0d9e-VjJ=ya...@mail.gmail.com
> >
> > i.e. per-tablespace keys, right? Then the collaboration should be easier
> than
> > I thought.
>
> No, there is a single tables/indexes key and a WAL key, plus keys for
> rotation.  I explained why per-tablespace keys don't add much value.
>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +  Ancient Roman grave inscription +
>


-- 
Ibrar Ahmed


Re: [PATCH] Implement INSERT SET syntax

2019-08-16 Thread Ibrar Ahmed
On Fri, Aug 16, 2019 at 8:19 AM Amit Kapila  wrote:

> On Wed, Jul 17, 2019 at 10:00 AM Gareth Palmer 
> wrote:
> >
> > Hello,
> >
> > Attached is a patch that adds the option of using SET clause to specify
> > the columns and values in an INSERT statement in the same manner as that
> > of an UPDATE statement.
> >
> > A simple example that uses SET instead of a VALUES() clause:
> >
> > INSERT INTO t SET c1 = 'foo', c2 = 'bar', c3 = 'baz';
> >
> > Values may also be sourced from a CTE using a FROM clause:
> >
> > WITH x AS (
> >   SELECT 'foo' AS c1, 'bar' AS c2, 'baz' AS c3
> > )
> > INSERT INTO t SET c1 = x.c1, c2 = x.c2, c3 = x.c3 FROM x;
> >
> > The advantage of using the SET clause style is that the column and value
> > are kept together, which can make changing or removing a column or value
> from
> > a large list easier.
> >
> > Internally the grammar parser converts INSERT SET without a FROM clause
> into
> > the equivalent INSERT with a VALUES clause. When using a FROM clause it
> becomes
> > the equivalent of INSERT with a SELECT statement.
> >
> > There was a brief discussion regarding INSERT SET on pgsql-hackers in
> late
> > August 2009 [1].
> >
> > INSERT SET is not part of any SQL standard (that I am aware of), however
> this
> > syntax is also implemented by MySQL [2]. Their implementation does not
> support
> > specifying a FROM clause.
> >
>
> I think this can be a handy feature in some cases as pointed by you,
> but do we really want it for PostgreSQL?  In the last round of
> discussions as pointed by you, there doesn't seem to be a consensus
> that we want this feature.  I guess before spending too much time into
> reviewing this feature, we should first build a consensus on whether
> we need this.
>
>
I agree with you Amit, that we need a consensus on that. Do we really need
that
feature or not. In the previous discussion, there was no resistance to have
that
in PostgreSQL, but some problem with the patch. Current patch is very simple
and not invasive, but still, we need a consensus on that.

Along with users, I request some senior hackers/committers to also
> weigh in about the desirability of this feature.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
>

-- 
Ibrar Ahmed


Re: Unexpected "shared memory block is still in use"

2019-08-16 Thread Peter Eisentraut
On 2019-08-14 01:22, Tom Lane wrote:
> Attached is a draft patch to change both shmem and sema key selection
> to be based on data directory inode rather than port.
> 
> I considered using "st_ino ^ st_dev", or some such, but decided that
> that would largely just make it harder to manually correlate IPC
> keys with running postmasters.  It's generally easy to find out the
> data directory inode number with "ls", but the extra work to find out
> and XOR in the device number is not so easy, and it's not clear what
> it'd buy us in typical scenarios.

For the POSIX APIs where the numbers are just converted to a string, why
not use both -- or forget about the inodes and use the actual data
directory string.

For the SYSV APIs, the scenario that came to my mind is if someone
starts a bunch of servers each on their own mount, it could happen that
the inodes of the data directories are very similar.

There is also the issue that AFAICT the key_t in the SYSV APIs is always
32-bit whereas inodes are 64-bit.  Probably not a big deal, but it might
prevent an exact one-to-one mapping.

Of course, ftok() is also available here as an existing solution.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-08-16 Thread Etsuro Fujita
On Tue, Jul 30, 2019 at 6:00 PM Etsuro Fujita  wrote:
> On Fri, Jul 19, 2019 at 10:44 PM Robert Haas  wrote:
> > On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita  
> > wrote:
> > > I.e., partition_bounds_merge() is performed for each pair of input
> > > partitioned relations for a join relation in try_partitionwise_join().
> > > Since partition_bounds_merge() would need a lot of CPU cycles, I don't
> > > think this is acceptable; ISTM that some redesign is needed to avoid
> > > this.  I'm wondering that once we successfully merged partition bounds
> > > from a pair of input partitioned relations for the join relation, by
> > > using the merged partition bounds, we could get the lists of matching
> > > to-be-joined partitions for subsequent pairs of input partitioned
> > > relations for the join relation in a more efficient way than by
> > > performing partition_bounds_merge() as proposed in the patch.
> >
> > I don't know whether partition_bounds_merge() is well-implemented; I
> > haven't looked.
>
> My concern about that is list partitioning.  In that case that
> function calls partition_list_bounds_merge(), which generates the
> partition bounds for a join relation between two given input
> relations, by performing merge join for a pair of the datums arrays
> from both the input relations.  Since the datums arrays contain all
> non-null list values across all partitions, if the numbers of the list
> values (ie, ndatums') are large, the merge join would require not a
> few cycles, so it would be much expensive to perform the merge join
> for each such pair when considering large N-way partitionwise joins of
> list-partitioned tables with large ndatums.  To see that, I did simple
> tests using a list-partitioned table pt created with the attached,
> which has 10 partitions, each with 1000 list values, so ndatums is
> 1.  (The tests below are performed with
> enable_partitionwise_join=on.)
>
> * 2-way self-join of pt: explain analyze select * from pt t0, pt t1
> where t0.a = t1.a;
>  - HEAD:
>  Planning Time: 1.731 ms
>  Execution Time: 15.159 ms
>  - Patched:
>  Planning Time: 1.884 ms
>  Execution Time: 15.127 ms
>
> * 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a;
>  - HEAD:
>  Planning Time: 28.787 ms
>  Execution Time: 34.313 ms
>  - Patched:
>  Planning Time: 40.263 ms
>  Execution Time: 35.019 ms
>
> * 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
> t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
> and t6.a = t7.a;
>  - HEAD:
>  Planning Time: 2279.653 ms
>  Execution Time: 63.303 ms
>  - Patched:
>  Planning Time: 3834.751 ms
>  Execution Time: 62.949 ms
>
> Actually, these joins would not need the partition-matching algorithm
> the patch adds; we could probably avoid this regression by modifying
> the patch to plan these joins the same way as before, but ISTM that
> these results imply that the cost of performing the merge join for
> each such pair would not be negligible when considering large N-way
> partitionwise joins mentioned above.  Maybe I'm missing something,
> though.
>
> > But in general I don't see an alternative to doing
> > some kind of merging on each pair of input relations. That's just how
> > planning works, and I don't see why it should need to be prohibitively
> > expensive.  I might be missing something, though; do you have an idea?
>
> Yes, I do; but I think I should think a little more about that.

I gave more thought to this.  My idea is to generate the list of
matching partitions to be joined from the partbound info after
creating it with partition_bounds_merge(), as I stated before.  Let me
explain using an example.  Suppose that R, S and T are partitioned
tables, that R=R(a,b,c) is partitioned on ranges of a into three
partitions R1, R2 and R3, that S=S(a,b,c) is partitioned on ranges of
a into three partitions S1, S2 and S3, and that T=T(a,b,c) is
partitioned on ranges of a into three partitions T1, T2 and T3.
Consider a 3-way join query: SELECT * FROM R, S, T WHERE R.a=S.a AND
S.a=T.a;  Suppose that when considering 2-way join R IJ S,
partition_bounds_merge() successfully merges the partition bounds for
R and S, and generates join pairs (R1, S1), (R2, S2) and (R3, S3), and
that when considering 3-way join (R IJ S) IJ T, that function
successfully merges the partition bounds for (R IJ S) and T, and
generates join pairs ((R1 IJ S1), T1) and ((R2 IJ S2), T2).  The
question here is: do we really need to perform
partition_bounds_merge() to generate the list of matching partitions
to be joined for 3-way join R IJ (S IJ T), for example?  I don't think
so; because 1) we see from the 3-way join pairs ((R1 IJ S1), T1) and
((R2 IJ S2), T2) that Ri, Si and Ti (i=1,2) have boundaries
overlapping with each other, which means that there would be (S1, T1)
and (S2, T2) as 2-way join pairs 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Bruce Momjian
On Thu, Aug 15, 2019 at 09:01:05PM -0400, Stephen Frost wrote:
> * Bruce Momjian (br...@momjian.us) wrote:
> > I assume you are talking about my option #1.  I can see if you only need
> > a few tables encrypted, e.g., credit card numbers, it can be excessive
> > to encrypt the entire cluster.  (I think you would need to encrypt
> > pg_statistic too.)
> 
> Or we would need a seperate encrypted pg_statistic, or a way to encrypt
> certain entries inside pg_statistic.

Yes.

> > The tricky part will be WAL --- if we encrypt all of WAL, the per-table
> > overhead might be minimal compared to the WAL encryption overhead.  The
> > better solution would be to add a flag to WAL records to indicate
> > encrypted entries, but you would then leak when an encryption change
> > happens and WAL record length.  (FYI, numeric values have different
> > lengths, as do character strings.)  I assume we would still use a single
> > key for all tables/indexes, and one for WAL, plus key rotation
> > requirements.
> 
> I don't think the fact that a change was done to an encrypted blob is an
> actual 'leak'- anyone can tell that by looking at the at the encrypted
> data before and after.  Further, the actual change would be encrypted,
> right?  Length of data is necessary to include in the vast majority of
> cases that the data is being dealt with and so I'm not sure that it
> makes sense for us to be worrying about that as a leak, unless you have
> a specific recommendation from a well known source discussing that
> concern..?

Yes, it is a minor negative, but we would need to see some performance
reason to have that minor negative, and I have already stated why I
think there might be no performance reason to do so.  Masahiko Sawada
talk at PGCon 2019 supports that conclusion:

https://www.youtube.com/watch?v=TXKoo2SNMzk

> > I personally would like to see full cluster implemented first to find
> > out exactly what the overhead is.  As I stated earlier, the overhead of
> > determining which things to encrypt, both in code complexity, user
> > interface, and processing overhead, might not be worth it.
> 
> I disagree with this and feel that the overhead that's being discussed
> here (user interface, figuring out if we should encrypt it or not,
> processing overhead for those determinations) is along the lines of
> UNLOGGED tables, yet there wasn't any question about if that was a valid
> or useful feature to implement.  The biggest challenge here is really

We implemented UNLOGGED tables because where was a clear performance win
to doing so.  I have not seen any measurements for encryption,
particularly when WAL is considered.

> around key management and I agree that's difficult but it's also really
> important and something that we need to be thinking about- and thinking
> about how to work with multiple keys and not just one.  Building in an
> assumption that we will only ever work with one key would make this
> capability nothing more than DBA-managed filesystem-level encryption

Agreed, that's all it is.

> (though even there different tablespaces could have different keys...)
> and I worry would make later work to support multiple keys more
> difficult and less likely to actually happen.  It's also not clear to me
> why we aren't building in *some* mechanism to work with multiple keys
> from the start as part of the initial design.

Well, every time I look at multiple keys, I go over exactly what that
means and how it behaves, but get no feedback on how to address the
problems.

> > I can see why you would think that encrypting less would be easier than
> > encrypting more, but security boundaries are hard to construct, and
> > anything that requires a user API, even more so.
> 
> I'm not sure I'm follwing here- I'm pretty sure everyone understands
> that selective encryption will require more work to implement, in part
> because an API needs to be put in place and we need to deal with
> multiple keys, etc.  I don't think anyone thinks that'll be "easier".

Uh, I thought Masahiko Sawada stated but, but looking back, I don't see
it, so I must be wrong.

> > > > > At least it should be clear how [2] will retrieve the master key 
> > > > > because [1]
> > > > > should not do it in a differnt way. (The GUC 
> > > > > cluster_passphrase_command
> > > > > mentioned in [3] seems viable, although I think [1] uses approach 
> > > > > which is
> > > > > more convenient if the passphrase should be read from console.)
> > > 
> > > I think that we can also provide a way to pass encryption key directly
> > > to postmaster rather than using passphrase. Since it's common that
> > > user stores keys in KMS it's useful if we can do that.
> > 
> > Why would it not be simpler to have the cluster_passphrase_command run
> > whatever command-line program it wants?  If you don't want to use a
> > shell command, create an executable and call that.
> 
> Having direct integration with a KMS would certainly be valuable, and I
> don't see a reaso

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Bruce Momjian
On Fri, Aug 16, 2019 at 05:58:59PM +0500, Ibrar Ahmed wrote:
> 
> 
> On Thu, Aug 15, 2019 at 8:21 PM Bruce Momjian  wrote:
> 
> On Thu, Aug 15, 2019 at 11:24:46AM +0200, Antonin Houska wrote:
> > > I think there are several directions we can go after all-cluster
> > > encryption,
> >
> > I think I misunderstood. What you summarize in
> >
> > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#
> TODO_for_Full-Cluster_Encryption
> >
> 
> Do we have any status of TODO's, what has been done and what left? It's much
> better if we have a link of discussion of each item.

I think some are done and some are in process, but I don't have a good
handle on that yet.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Add test case for sslinfo

2019-08-16 Thread Peter Eisentraut
On 2019-07-08 10:18, Michael Paquier wrote:
> On Mon, Jul 08, 2019 at 02:11:34PM +0800, Hao Wu wrote:
>> Thank you for your quick response! I work on greenplum, and I didn't see
>> this folder(src/test/ssl/ssl) before.
>> I will add more certificates to test and resend again.
> 
> Not having duplicates would be nice.

I think sslinfo should be tested as an extension of src/test/ssl/
instead of its own test suite.  There are too many complications that we
would otherwise have to solve again.

You might want to review commit f60a0e96778854ed0b7fd4737488ba88022e47bd
and how it adds test cases.  You can't just hardcode a specific output
since different installations might report TLS 1.2 vs 1.3, different
ciphers etc.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Global temporary tables

2019-08-16 Thread Konstantin Knizhnik
I did more investigations of performance of global temp tables with 
shared buffers vs. vanilla (local) temp tables.


1. Combination of persistent and temporary tables in the same query.

Preparation:
create table big(pk bigint primary key, val bigint);
insert into big values 
(generate_series(1,1),generate_series(1,1));

create temp table lt(key bigint, count bigint);
create global temp table gt(key bigint, count bigint);

Size of table is about 6Gb, I run this test on desktop with 16GB of RAM 
and postgres with 1Gb shared buffers.

I run two queries:

insert into T (select count(*),pk/P as key from big group by key);
select sum(count) from T;

where P is (100,10,1) and T is name of temp table (lt or gt).
The table below contains times of both queries in msec:

Percent of selected data
1%
10%
100%
Local temp table
44610
90
47920
891
63414
21612
Global temp table
44669
35
47939
298
59159
26015


As you can see, time of insertion in temporary table is almost the same
and time of traversal of temporary table is about twice smaller for 
global temp table
when it fits in RAM together with persistent table and slightly worser 
when it doesn't fit.




2. Temporary table only access.
The same system, but Postgres is configured with shared_buffers=10GB, 
max_parallel_workers = 4, max_parallel_workers_per_gather = 4


Local temp tables:
create temp table local_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint, 
x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint);
insert into local_temp values 
(generate_series(1,1),0,0,0,0,0,0,0,0);

select sum(x1) from local_temp;

Global temp tables:
create global temporary table global_temp(x1 bigint, x2 bigint, x3 
bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint);
insert into global_temp values 
(generate_series(1,1),0,0,0,0,0,0,0,0);

select sum(x1) from global_temp;

Results (msec):

Insert
Select
Local temp table37489
48322
Global temp table   44358
3003


So insertion in local temp table is performed slightly faster but select 
is 16 times slower!


Conclusion:
In the assumption then temp table fits in memory, global temp tables 
with shared buffers provides better performance than local temp table.
I didn't consider here global temp tables with local buffers because for 
them results should be similar with local temp tables.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Unexpected "shared memory block is still in use"

2019-08-16 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-08-14 01:22, Tom Lane wrote:
>> Attached is a draft patch to change both shmem and sema key selection
>> to be based on data directory inode rather than port.

> For the POSIX APIs where the numbers are just converted to a string, why
> not use both -- or forget about the inodes and use the actual data
> directory string.

Considering that we still need an operation equivalent to "nextSemKey++"
(in case of a key collision), I'm not really sure how working with strings
rather than ints would make life better.

> For the SYSV APIs, the scenario that came to my mind is if someone
> starts a bunch of servers each on their own mount, it could happen that
> the inodes of the data directories are very similar.

Sure.  That's why I didn't throw away any of the duplicate-key-handling
logic, and why we're still checking for st_dev match when inspecting
particular shmem blocks.  (It also seems likely that somebody
who's doing that would be using similar pathnames on the different
mounts, so that string-based approaches wouldn't exactly be free of
collision problems either.)

> There is also the issue that AFAICT the key_t in the SYSV APIs is always
> 32-bit whereas inodes are 64-bit.  Probably not a big deal, but it might
> prevent an exact one-to-one mapping.

True, although the width of inode numbers is probably pretty platform-
and filesystem-dependent.  We could consider trying some more complicated
mapping like xor'ing high and low halves, but I don't entirely see what
it buys us.

> Of course, ftok() is also available here as an existing solution.

I looked at that briefly, but I don't really see what it'd buy us either,
except for opacity which doesn't seem useful.  The Linux man page pretty
much says in so many words that it's a wrapper for st_ino and st_dev;
and how does it help us if other platforms do it differently?

(Actually, if Linux does it the way the man page suggests, it'd really
be a net negative, because there'd only be 24 bits of key variation
not 32.)

regards, tom lane




Re: block-level incremental backup

2019-08-16 Thread Ibrar Ahmed
On Fri, Aug 16, 2019 at 4:12 PM Ibrar Ahmed  wrote:

>
>
>
>
> On Fri, Aug 16, 2019 at 3:24 PM Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote:
>
>>
>>
>> On Fri, Aug 2, 2019 at 6:43 PM vignesh C  wrote:
>>
>>> Some comments:
>>> 1) There will be some link files created for tablespace, we might
>>> require some special handling for it
>>>
>>
>> Yep. I have that in my ToDo.
>> Will start working on that soon.
>>
>>
>>> 2)
>>> Retry functionality is hanlded only for copying of full files, should
>>> we handle retry for copying of partial files
>>> 3)
>>> we can have some range for maxretries similar to sleeptime
>>>
>>
>> I took help from pg_standby code related to maxentries and sleeptime.
>>
>> However, as we don't want to use system() call now, I have
>> removed all this kludge and just used fread/fwrite as discussed.
>>
>>
>>> 4)
>>> Should we check for malloc failure
>>>
>>
>> Used pg_malloc() instead. Same is also suggested by Ibrar.
>>
>>
>>>
>>> 5) Should we add display of progress as backup may take some time,
>>> this can be added as enhancement. We can get other's opinion on this.
>>>
>>
>> Can be done afterward once we have the functionality in place.
>>
>>
>>>
>>> 6)
>>> If the backup count increases providing the input may be difficult,
>>> Shall user provide all the incremental backups from a parent folder
>>> and can we handle the ordering of incremental backup internally
>>>
>>
>> I am not sure of this yet. We need to provide the tablespace mapping too.
>> But thanks for putting a point here. Will keep that in mind when I
>> revisit this.
>>
>>
>>>
>>> 7)
>>> Add verbose for copying whole file
>>>
>> Done
>>
>>
>>>
>>> 8) We can also check if approximate space is available in disk before
>>> starting combine backup, this can be added as enhancement. We can get
>>> other's opinion on this.
>>>
>>
>> Hmm... will leave it for now. User will get an error anyway.
>>
>>
>>>
>>> 9)
>>> Combine backup into directory can be combine backup directory
>>>
>> Done
>>
>>
>>>
>>> 10)
>>> MAX_INCR_BK_COUNT can be increased little, some applications use 1
>>> full backup at the beginning of the month and use 30 incremental
>>> backups rest of the days in the month
>>>
>>
>> Yeah, agree. But using any number here is debatable.
>> Let's see others opinion too.
>>
> Why not use a list?
>
>
>>
>>
>>> Regards,
>>> Vignesh
>>> EnterpriseDB: http://www.enterprisedb.com
>>>
>>
>>
>> Attached new sets of patches with refactoring done separately.
>> Incremental backup patch became small now and hopefully more
>> readable than the first version.
>>
>> --
>> Jeevan Chalke
>> Technical Architect, Product Development
>> EnterpriseDB Corporation
>> The Enterprise PostgreSQL Company
>>
>>
>
> +   buf = (char *) malloc(statbuf->st_size);
>
> +   if (buf == NULL)
>
> +   ereport(ERROR,
>
> +   (errcode(ERRCODE_OUT_OF_MEMORY),
>
> +errmsg("out of memory")));
>
> Why are you using malloc, you can use palloc here.
>
>
>
> Hi, I gave another look at the patch and have some quick comments.


-
> char   *extptr = strstr(fn, ".partial");

I think there should be a better and strict way to check the file
extension.

-
> +   extptr = strstr(outfn, ".partial");
> +   Assert (extptr != NULL);

Why are you checking that again, you just appended that in the above
statement?

-
> +   if (verbose && statbuf.st_size > (RELSEG_SIZE * BLCKSZ))
> +   pg_log_info("found big file \"%s\" (size: %.2lfGB): %m",
fromfn,
> +   (double) statbuf.st_size /
(RELSEG_SIZE * BLCKSZ));

This is not just a log, you find a file which is bigger which surely has
some problem.

-
> +* We do read entire 1GB file in memory while taking incremental
backup; so
> +* I don't see any reason why can't we do that here.  Also,
copying data in
> +* chunks is expensive.  However, for bigger files, we still
slice at 1GB
> +* border.


What do you mean by bigger file, a file greater than 1GB? In which case you
get file > 1GB?


-- 
Ibrar Ahmed


Re: UNION ALL

2019-08-16 Thread Tom Lane
Mark Pasterkamp  writes:
> I am comparing two queries, q1 and q2 respectively.
> Query q1 is the original query and q2 is an attempt to reduce the cost of
> execution via leveraging the materialized view ci_t_15.
> ...
> Running explain analyze on both queries I get the following execution plans.

Huh ... I wonder why the planner decided to try to parallelize Q2 (and not
Q1)?  That seems like a pretty stupid choice, because if I'm reading the
plan correctly (I might not be) each worker process has to read all of
the "title" table and build its own copy of the hash table.  That seems
likely to swamp whatever performance gain might come from parallelizing
the scan of cast_info --- which is likely to be not much, anyway, on a
laptop with probably-not-great disk I/O bandwidth.

In any case, whether that decision was good or bad, making it differently
renders the performance of Q1 and Q2 not very comparable.  It'd be worth
disabling parallelism (SET max_parallel_workers_per_gather = 0) and
retrying Q2 to get a more apples-to-apples comparison.

Another bit of advice is to increase work_mem, so the hashes don't
have to be split into quite so many batches.

I'm noting also that your queries aren't giving the same results ---
Q2 reports returning fewer rows overall.  Do you have rows where
title.production_year is null, perhaps?

regards, tom lane




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-16 Thread Anastasia Lubennikova

13.08.2019 18:45, Anastasia Lubennikova wrote:

  I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.


Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.


Attached is v7.

In this version of the patch, I heavily refactored the code of insertion 
into

posting tuple. bt_split logic is quite complex, so I omitted a couple of
optimizations. They are mentioned in TODO comments.

Now the algorithm is the following:

- If bt_findinsertloc() found out that tuple belongs to existing posting 
tuple's

TID interval, it sets 'in_posting_offset' variable and passes it to
_bt_insertonpg()

- If 'in_posting_offset' is valid and origtup is valid,
merge our itup into origtup.

It can result in one tuple neworigtup, that must replace origtup; or two 
tuples:

neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

- If two new tuple(s) fit into the old page, we're lucky.
call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
atomically replace oldtup with new tuple(s) and generate xlog record.

- In case page split is needed, pass both tuples to _bt_split().
 _bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

It seems that now all replace operations are crash-safe. The new patch 
passes

all regression tests, so I think it's ready for review again.

In the meantime, I'll run more stress-tests.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..504bca2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 	   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-			ItemPointerGetBlockNumber(&(itup->t_tid)),
-			ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			ItemPointerGetBlockNumber(tid),
+			ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
 	 errmsg("could not find tuple using search from root page in index \"%s\"",
 			RelationGetRelationName(state->rel)),
-	 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+	 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 		itid, htid,
 		(uint32) (state->targetlsn >> 32),
 		(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+current = BTreeTupleGetPostingN(itup, i);
+
+if (ItemPointerCompare(current, &last) <= 0)
+{
+	char	   *itid,
+			   *htid;
+
+	itid = psprintf("(%u,%u)", state->targetblock, offset);
+	htid = psprintf("(%u,%u)",
+	ItemPointerGetBlockNumberNoCheck(current),
+	ItemPointerGetOffsetNumberNoCheck(current));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_INDEX_CORRUPTED),
+			 errmsg("posting list heap TIDs out of order in index \"%s\"",
+	RelationGetRelationName(state->rel)),
+			 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+itid, htid,
+(uint32) (state->targetlsn >> 32),
+(uint32) state->targetlsn)));
+}
+
+ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-			  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+IndexTuple	onetup;
+

REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity

2019-08-16 Thread Hadi Moshayedi
It seems that sometimes when DELETE cascades to referencing tables we fail
to acquire locks on replica identity index.

To reproduce, set wal_level to logical, and run 1.sql.

I can look into this, but I thought first I should send it here in case
someone who is more familiar with these related functions can solve it
quickly.

I get the following backtrace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7f301154b801 in __GI_abort () at abort.c:79
#2  0x55df8858a923 in ExceptionalCondition (
conditionName=conditionName@entry=0x55df885fd138
"!(CheckRelationLockedByMe(idx_rel, 1, 1))",
errorType=errorType@entry=0x55df885de8fd
"FailedAssertion",
fileName=fileName@entry=0x55df885fca32 "heapam.c",
lineNumber=lineNumber@entry=7646)
at assert.c:54
#3  0x55df88165e53 in ExtractReplicaIdentity
(relation=relation@entry=0x7f3012b54db0,

tp=tp@entry=0x7ffcf47d53f0, key_changed=key_changed@entry=true,
copy=copy@entry=0x7ffcf47d53d3) at heapam.c:7646
#4  0x55df8816c22b in heap_delete (relation=0x7f3012b54db0,
tid=,
cid=, crosscheck=0x0, wait=true, tmfd=0x7ffcf47d54b0,
changingPart=false)
at heapam.c:2676
#5  0x55df88318b62 in table_tuple_delete (changingPart=false,
tmfd=0x7ffcf47d54b0,
wait=true, crosscheck=, snapshot=,
cid=,
tid=0x7ffcf47d558a, rel=0x7f3012b54db0) at
../../../src/include/access/tableam.h:1216
#6  ExecDelete (mtstate=mtstate@entry=0x55df8a8196a0,
tupleid=0x7ffcf47d558a, oldtuple=0x0,
planSlot=planSlot@entry=0x55df8a81a8e8,
epqstate=epqstate@entry=0x55df8a819798,

estate=estate@entry=0x55df8a819058, processReturning=true,
canSetTag=true,
changingPart=false, tupleDeleted=0x0, epqreturnslot=0x0) at
nodeModifyTable.c:769
#7  0x55df8831aa25 in ExecModifyTable (pstate=0x55df8a8196a0) at
nodeModifyTable.c:2230
#8  0x55df882efa9a in ExecProcNode (node=0x55df8a8196a0)
at ../../../src/include/executor/executor.h:239
#9  ExecutePlan (execute_once=, dest=0x55df88a89a00
,
direction=, numberTuples=0, sendTuples=,
operation=CMD_DELETE, use_parallel_mode=,
planstate=0x55df8a8196a0,
estate=0x55df8a819058) at execMain.c:1648
#10 standard_ExecutorRun (queryDesc=0x55df8a7de4b0, direction=, count=0,
execute_once=) at execMain.c:365
#11 0x55df8832b90c in _SPI_pquery (tcount=0, fire_triggers=false,
queryDesc=0x55df8a7de4b0) at spi.c:2521
#12 _SPI_execute_plan (plan=plan@entry=0x55df8a812828, paramLI=,
snapshot=snapshot@entry=0x0,
crosscheck_snapshot=crosscheck_snapshot@entry=0x0,
read_only=read_only@entry=false, fire_triggers=fire_triggers@entry=false,

tcount=) at spi.c:2296
#13 0x55df8832c15c in SPI_execute_snapshot (plan=plan@entry=0x55df8a812828,

Values=Values@entry=0x7ffcf47d5820, Nulls=Nulls@entry=0x7ffcf47d5a20 "
",
snapshot=snapshot@entry=0x0,
crosscheck_snapshot=crosscheck_snapshot@entry=0x0,
read_only=read_only@entry=false, fire_triggers=false, tcount=0) at
spi.c:616
#14 0x55df88522f32 in ri_PerformCheck (riinfo=riinfo@entry=0x55df8a7f8050,

qkey=qkey@entry=0x7ffcf47d5b28, qplan=0x55df8a812828,
fk_rel=fk_rel@entry=0x7f3012b54db0, pk_rel=pk_rel@entry=0x7f3012b44a28,
oldslot=oldslot@entry=0x55df8a826f88, newslot=0x0, detectNewRows=true,
expect_OK=8)
at ri_triggers.c:2276
#15 0x55df88524653 in RI_FKey_cascade_del (fcinfo=) at
ri_triggers.c:819
#16 0x55df882c9996 in ExecCallTriggerFunc
(trigdata=trigdata@entry=0x7ffcf47d5ff0,

tgindx=tgindx@entry=0, finfo=finfo@entry=0x55df8a825710,
instr=instr@entry=0x0,
per_tuple_context=per_tuple_context@entry=0x55df8a812f10) at
trigger.c:2432
#17 0x55df882cb459 in AfterTriggerExecute (trigdesc=0x55df8a825530,
trigdesc=0x55df8a825530, trig_tuple_slot2=0x0, trig_tuple_slot1=0x0,
per_tuple_context=0x55df8a812f10, instr=0x0, finfo=0x55df8a825710,
relInfo=0x55df8a825418, event=0x55df8a81f0a8, estate=0x55df8a825188) at
trigger.c:4342
#18 afterTriggerInvokeEvents (events=events@entry=0x55df8a7c3e40,
firing_id=1,
estate=estate@entry=0x55df8a825188, delete_ok=delete_ok@entry=false) at
trigger.c:4539
#19 0x55df882d1408 in AfterTriggerEndQuery (estate=estate@entry
=0x55df8a825188)
at trigger.c:4850
#20 0x55df882efd99 in standard_ExecutorFinish (queryDesc=0x55df8a722ab8)
at execMain.c:440
#21 0x55df88464bdd in ProcessQuery (plan=,
sourceText=0x55df8a702f78 "DELETE FROM t1 RETURNING id;", params=0x0,
queryEnv=0x0,
dest=0x55df8a722a20, completionTag=0x7ffcf47d6180 "DELETE 11") at
pquery.c:203
#22 0x55df88464e0b in PortalRunMulti (portal=portal@entry=0x55df8a7692f8,

isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=true,

dest=dest@entry=0x55df8a722a20, altdest=0x55df88a81040 ,
completionTag=completionTag@entry=0x7ffcf47d6180 "DELETE 11") at
pquery.c:1283
#23 0x55df88465119 in FillPortalStore (portal=portal@entry=0x55df8a7692f8,

isTopLevel=isTopLevel@entry=true) at pquery.c:1030
#24 0x55df88465d1d in 

Re: Global temporary tables

2019-08-16 Thread Pavel Stehule
pá 16. 8. 2019 v 16:12 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

> I did more investigations of performance of global temp tables with shared
> buffers vs. vanilla (local) temp tables.
>
> 1. Combination of persistent and temporary tables in the same query.
>
> Preparation:
> create table big(pk bigint primary key, val bigint);
> insert into big values
> (generate_series(1,1),generate_series(1,1));
> create temp table lt(key bigint, count bigint);
> create global temp table gt(key bigint, count bigint);
>
> Size of table is about 6Gb, I run this test on desktop with 16GB of RAM
> and postgres with 1Gb shared buffers.
> I run two queries:
>
> insert into T (select count(*),pk/P as key from big group by key);
> select sum(count) from T;
>
> where P is (100,10,1) and T is name of temp table (lt or gt).
> The table below contains times of both queries in msec:
>
> Percent of selected data
> 1%
> 10%
> 100%
> Local temp table
> 44610
> 90
> 47920
> 891
> 63414
> 21612
> Global temp table
> 44669
> 35
> 47939
> 298
> 59159
> 26015
>
> As you can see, time of insertion in temporary table is almost the same
> and time of traversal of temporary table is about twice smaller for global
> temp table
> when it fits in RAM together with persistent table and slightly worser
> when it doesn't fit.
>
>
>
> 2. Temporary table only access.
> The same system, but Postgres is configured with shared_buffers=10GB,
> max_parallel_workers = 4, max_parallel_workers_per_gather = 4
>
> Local temp tables:
> create temp table local_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint,
> x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint);
> insert into local_temp values
> (generate_series(1,1),0,0,0,0,0,0,0,0);
> select sum(x1) from local_temp;
>
> Global temp tables:
> create global temporary table global_temp(x1 bigint, x2 bigint, x3 bigint,
> x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint);
> insert into global_temp values
> (generate_series(1,1),0,0,0,0,0,0,0,0);
> select sum(x1) from global_temp;
>
> Results (msec):
>
> Insert
> Select
> Local temp table 37489
> 48322
> Global temp table 44358
> 3003
>
> So insertion in local temp table is performed slightly faster but select
> is 16 times slower!
>
> Conclusion:
> In the assumption then temp table fits in memory, global temp tables with
> shared buffers provides better performance than local temp table.
> I didn't consider here global temp tables with local buffers because for
> them results should be similar with local temp tables.
>

Probably there is not a reason why shared buffers should be slower than
local buffers when system is under low load.

access to shared memory is protected by spin locks (are cheap for few
processes), so tests in one or few process are not too important (or it is
just one side of space)

another topic can be performance on MS Sys - there are stories about not
perfect performance of shared memory there.

Regards

Pavel



>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Antonin Houska
Bruce Momjian  wrote:

> I have seen no one present a clear description of how anything beyond
> all-cluster encryption would work or be secure.  Wishing that were not
> the case doesn't change things.

Since this email thread has grown a lot and is difficult to follow, it might
help if we summarized various approaches on the wiki, with their pros and
cons, and included some links to the corresponding emails in the
archive. There might be people who would like think about the problems but
don't have time to read the whole thread. Overview of the pending problems of
particular approaches might be useful for newcomers, but also for people who
followed only part of the discussion. I mean an overview of the storage
problems; the key management seems to be less controversial.

If you think it makes sense, I can spend some time next week on the
research. However I'd need at least an outline of the approaches proposed
because I also missed some parts of the thread.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity

2019-08-16 Thread Andres Freund
Hi,

On 2019-08-16 09:44:15 -0700, Hadi Moshayedi wrote:
> It seems that sometimes when DELETE cascades to referencing tables we fail
> to acquire locks on replica identity index.
> 
> To reproduce, set wal_level to logical, and run 1.sql.
> 
> I can look into this, but I thought first I should send it here in case
> someone who is more familiar with these related functions can solve it
> quickly.

I suspect this "always" has been broken, it's just that we previously
didn't have checks in place that detect these cases. I don't think it's
likely to cause actual harm, due to the locking on the table itself when
dropping indexes etc.  But we still should fix it.

The relevant code is:

/*
 * If there are indices on the result relation, open them and 
save
 * descriptors in the result relation info, so that we can add 
new
 * index entries for the tuples we add/update.  We need not do 
this
 * for a DELETE, however, since deletion doesn't affect 
indexes. Also,
 * inside an EvalPlanQual operation, the indexes might be open
 * already, since we share the resultrel state with the original
 * query.
 */
if (resultRelInfo->ri_RelationDesc->rd_rel->relhasindex &&
operation != CMD_DELETE &&
resultRelInfo->ri_IndexRelationDescs == NULL)
ExecOpenIndices(resultRelInfo,
node->onConflictAction 
!= ONCONFLICT_NONE);


I'm not quite sure what the best way to fix this would be however. It
seems like a bad idea to make locking dependent on wal_level, but I'm
also not sure we want to incur the price of locking one more table to
every delete on a table with a primary key?


Greetings,

Andres Freund




Re: Add "password_protocol" connection parameter to libpq

2019-08-16 Thread Jonathan S. Katz
On 8/15/19 9:28 PM, Stephen Frost wrote:
> Greetings,
> 
> * Jeff Davis (pg...@j-davis.com) wrote:
>> On Wed, 2019-08-14 at 11:38 +0900, Michael Paquier wrote:
>>> What I got in mind was a comma-separated list of authorized protocols
>>> which can be specified as a connection parameter, which extends to
>>> all
>>> the types of AUTH_REQ requests that libpq can understand, plus an
>>> extra for channel binding.  I also liked the idea mentioned upthread
>>> of "any" to be an alias to authorize everything, which should be the
>>> default.  So you basically get at that:
>>> auth_protocol = {any,password,md5,scram-sha-256,scram-sha-256-
>>> plus,krb5,gss,sspi}
>>
>> What about something corresponding to the auth methods "trust" and
>> "cert", where no authentication request is sent? That's a funny case,
>> because the server trusts the client; but that doesn't imply that the
>> client trusts the server.
> 
> I agree that "trust" is odd.  If you want my 2c, we should have to
> explicitly *enable* that for psql, otherwise there's the potential that
> a MITM could perform a downgrade attack to "trust" and while that might
> not expose a user's password, it could expose other information that the
> client ends up sending (INSERTs, UPDATEs, etc).
> 
> When it comes to "cert"- there is certainly an authentication that
> happens and we already have options in psql/libpq to require validation
> of the server.  If users want that, they should enable it (I wish we
> could make it the default too but that's a different discussion...).
> 
>> This is another reason I don't really like the list. It's impossible to
>> make it cleanly map to the auth methods, and there are a few ways it
>> could be confusing to the users.
> 
> I agree with these concerns, just to be clear.

+1.

> 
>> Given that we all pretty much agree on the need for the separate
>> channel_binding param, the question is whether we want to (a) address
>> additional use cases with specific parameters that also justify
>> themselves; or (b) have a generic list that is supposed to solve many
>> future use cases.
>>
>> I vote (a). With (b), the generic list is likely to cause more
>> confusion, ugliness, and clients that break needlessly in the future.
> 
> Admittedly, one doesn't preclude the other, and so we could move forward
> with the channel binding param, and that's fine- but I seriously hope
> that someone finds time to work on further improving the ability for
> clients to control what happens during authentication as this, imv
> anyway, is an area that we are weak in and it'd be great to improve on
> it.

To be pedantic, +1 on the channel_binding param.

I do agree with option (a), but we should narrow down what that means
for this iteration.

I do see "password_protocol" making sense as a comma-separated list of
options e.g. {plaintext, md5, scram-sha-256}. I would ask if
scram-sha-256-plus makes the list if we have the channel_binding param?

If channel_binding = require, it would essentially ignore any non-plus
options in password_protocol and require scram-sha-256-plus. In a future
scram-sha-512 world, then having scram-sha-256-plus or
scram-sha-512-plus options in "password_protocol" then could be
necessary based on what the user would prefer or require in their
application.

So if we do add a "password_protocol" parameter, then we likely need to
include the -plus's.

I think this is also fairly easy for a user to configure. Some scenarios
scenarios I can see for this are:

1. The user requiring channel binding, so only "channel_binding=require"
 is set.

2. A PostgreSQL cluster transitioning between SCRAM + MD5, and the user
setting password_protocol="scram-sha-256" to guarantee md5 auth does not
take place.

3. A user wanting to ensure a stronger method is used, with some
combination of the scram methods or md5, i.e. ensuring plaintext is not
used.

We would need to provide documentation around the types of password
validation methods are used for the external validators (e.g. LDAP) so
the user's known what to expect if their server is using those methods.

Jonathan



signature.asc
Description: OpenPGP digital signature


allocation limit for encoding conversion

2019-08-16 Thread Alvaro Herrera
Somebody ran into issues when generating large XML output (upwards of
256 MB) and then sending via a connection with a different
client_encoding.  This occurs because we pessimistically allocate 4x as
much memory as the string needs, and we run into the 1GB palloc
limitation.  ISTM we can do better now by using huge allocations, as per
the preliminary attached patch (which probably needs an updated overflow
check rather than have it removed altogether); but at least it is able
to process this query, which it wasn't without the patch:

select query_to_xml(
'select a, cash_words(a::text::money) from generate_series(0, 200) a',
true, false, '');

-- 
Álvaro Herrera
>From a718f299d72617c366a08735895e0439482ede2f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 8 Aug 2019 15:36:53 -0400
Subject: [PATCH v1] Use huge pallocs on encoding conversions

---
 src/backend/utils/mb/mbutils.c | 26 --
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index bec54bb5cb..e2853fcf6e 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -348,17 +348,8 @@ pg_do_encoding_conversion(unsigned char *src, int len,
 		pg_encoding_to_char(src_encoding),
 		pg_encoding_to_char(dest_encoding;
 
-	/*
-	 * Allocate space for conversion result, being wary of integer overflow
-	 */
-	if ((Size) len >= (MaxAllocSize / (Size) MAX_CONVERSION_GROWTH))
-		ereport(ERROR,
-(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("out of memory"),
- errdetail("String of %d bytes is too long for encoding conversion.",
-		   len)));
-
-	result = palloc(len * MAX_CONVERSION_GROWTH + 1);
+	result = palloc_extended((Size) len * MAX_CONVERSION_GROWTH + 1,
+			 MCXT_ALLOC_HUGE);
 
 	OidFunctionCall5(proc,
 	 Int32GetDatum(src_encoding),
@@ -681,17 +672,8 @@ perform_default_encoding_conversion(const char *src, int len,
 	if (flinfo == NULL)
 		return unconstify(char *, src);
 
-	/*
-	 * Allocate space for conversion result, being wary of integer overflow
-	 */
-	if ((Size) len >= (MaxAllocSize / (Size) MAX_CONVERSION_GROWTH))
-		ereport(ERROR,
-(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("out of memory"),
- errdetail("String of %d bytes is too long for encoding conversion.",
-		   len)));
-
-	result = palloc(len * MAX_CONVERSION_GROWTH + 1);
+	result = palloc_extended((Size) len * MAX_CONVERSION_GROWTH + 1,
+			 MCXT_ALLOC_HUGE);
 
 	FunctionCall5(flinfo,
   Int32GetDatum(src_encoding),
-- 
2.17.1



Re: allocation limit for encoding conversion

2019-08-16 Thread Alvaro Herrera
On 2019-Aug-16, Alvaro Herrera wrote:

> Somebody ran into issues when generating large XML output (upwards of
> 256 MB) and then sending via a connection with a different
> client_encoding.

ref: https://postgr.es/m/43a889a1-45fb-1d60-31ae-21e607307...@gmail.com
(pgsql-es-ayuda)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: doc: Add some images

2019-08-16 Thread Alvaro Herrera
On 2019-Mar-27, Peter Eisentraut wrote:

> doc: Add some images
> 
> Add infrastructure for having images in the documentation, in SVG
> format.  Add two images to start with.  See the included README file
> for instructions.

> Author: Jürgen Purtz 
> Author: Peter Eisentraut 

Now when I test Jürgen's new proposed image genetic-algorithm I find
that this stuff doesn't work in VPATH builds, at least for PDF -- I
don't get a build failure, but instead I get just a section title that
doesn't precede any actual image.  (There's a very small warning hidden
in the tons of other fop output).  If I edit the .fo file by hand to
make the path to .svg absolute, the image appears correctly.

I don't see any way in the fop docs to specify the base path for images.

I'm not sure what's a good way to fix this problem in a general way.
Would some new rule in the xslt would work?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: allocation limit for encoding conversion

2019-08-16 Thread Tom Lane
Alvaro Herrera  writes:
> Somebody ran into issues when generating large XML output (upwards of
> 256 MB) and then sending via a connection with a different
> client_encoding.  This occurs because we pessimistically allocate 4x as
> much memory as the string needs, and we run into the 1GB palloc
> limitation.  ISTM we can do better now by using huge allocations, as per
> the preliminary attached patch (which probably needs an updated overflow
> check rather than have it removed altogether); but at least it is able
> to process this query, which it wasn't without the patch:

> select query_to_xml(
> 'select a, cash_words(a::text::money) from generate_series(0, 200) a',
> true, false, '');

I fear that allowing pg_do_encoding_conversion to return strings longer
than 1GB is just going to create failure cases somewhere else.

However, it's certainly true that 4x growth is a pretty unlikely worst
case.  Maybe we could do something like

1. If string is short (say up to a few megabytes), continue to do it
like now.  This avoids adding overhead for typical cases.

2. Otherwise, run some lobotomized form of encoding conversion that
just computes the space required (as an int64, I guess) without saving
the result anywhere.

3. If space required > 1GB, fail.

4. Otherwise, allocate just the space required, and convert.

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Bruce Momjian
On Fri, Aug 16, 2019 at 07:47:37PM +0200, Antonin Houska wrote:
> Bruce Momjian  wrote:
> 
> > I have seen no one present a clear description of how anything beyond
> > all-cluster encryption would work or be secure.  Wishing that were not
> > the case doesn't change things.
> 
> Since this email thread has grown a lot and is difficult to follow, it might
> help if we summarized various approaches on the wiki, with their pros and
> cons, and included some links to the corresponding emails in the
> archive. There might be people who would like think about the problems but
> don't have time to read the whole thread. Overview of the pending problems of
> particular approaches might be useful for newcomers, but also for people who
> followed only part of the discussion. I mean an overview of the storage
> problems; the key management seems to be less controversial.
> 
> If you think it makes sense, I can spend some time next week on the
> research. However I'd need at least an outline of the approaches proposed
> because I also missed some parts of the thread.

I suggest we schedule a voice call and I will go over all the issues and
explain why I came to the conclusions listed.  It is hard to know what
level of detail to explain that in an email, beyond what I have already
posted on this thread.  The only other options is to read all the emails
_I_ sent on the thread to get an idea.

I am able to do that for others as well.  

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: allocation limit for encoding conversion

2019-08-16 Thread Andres Freund
Hi,

On 2019-08-16 17:31:49 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > Somebody ran into issues when generating large XML output (upwards of
> > 256 MB) and then sending via a connection with a different
> > client_encoding.  This occurs because we pessimistically allocate 4x as
> > much memory as the string needs, and we run into the 1GB palloc
> > limitation.  ISTM we can do better now by using huge allocations, as per
> > the preliminary attached patch (which probably needs an updated overflow
> > check rather than have it removed altogether); but at least it is able
> > to process this query, which it wasn't without the patch:
> 
> > select query_to_xml(
> > 'select a, cash_words(a::text::money) from generate_series(0, 200) 
> > a',
> > true, false, '');
> 
> I fear that allowing pg_do_encoding_conversion to return strings longer
> than 1GB is just going to create failure cases somewhere else.
> 
> However, it's certainly true that 4x growth is a pretty unlikely worst
> case.  Maybe we could do something like
> 
> 1. If string is short (say up to a few megabytes), continue to do it
> like now.  This avoids adding overhead for typical cases.
> 
> 2. Otherwise, run some lobotomized form of encoding conversion that
> just computes the space required (as an int64, I guess) without saving
> the result anywhere.
> 
> 3. If space required > 1GB, fail.
> 
> 4. Otherwise, allocate just the space required, and convert.

It's probably too big a hammer for this specific case, but I think at
some point we ought to stop using fixed size allocations for this kind
of work. Instead we should use something roughly like our StringInfo,
except that when exceeding the current size limit, the overflowing data
is stored in a separate allocation. And only once we actually need the
data in a consecutive form, we allocate memory that's large enough to
store the all the separate allocations in their entirety.

Greetings,

Andres Freund




Re: default_table_access_method is not in sample config file

2019-08-16 Thread Andres Freund
On 2019-08-13 15:03:13 +0900, Michael Paquier wrote:
> On Fri, Aug 09, 2019 at 11:34:05AM +0300, Heikki Linnakangas wrote:
> > On 11/04/2019 19:49, Andres Freund wrote:
> >> Hm, I think we should rather add it to sample. That's an oversight, not
> >> intentional.
> > 
> > I just noticed that this is still an issue. default_table_access_method is
> > not in the sample config file, and it's not marked with GUC_NOT_IN_SAMPLE.
> > I'll add this to the open items list so we don't forget.

Thanks!


> I think that we should give it the same visibility as default_tablespace,
> so adding it to the sample file sounds good to me.

> diff --git a/src/backend/utils/misc/postgresql.conf.sample 
> b/src/backend/utils/misc/postgresql.conf.sample
> index 65a6da18b3..39fc787851 100644
> --- a/src/backend/utils/misc/postgresql.conf.sample
> +++ b/src/backend/utils/misc/postgresql.conf.sample
> @@ -622,6 +622,7 @@
>  #default_tablespace = '' # a tablespace name, '' uses the default
>  #temp_tablespaces = ''   # a list of tablespace names, 
> '' uses
>   # only default tablespace
> +#default_table_access_method = 'heap'

Pushed, thanks.


>  #check_function_bodies = on
>  #default_transaction_isolation = 'read committed'
>  #default_transaction_read_only = off

Hm.  I find the current ordering there a bit weird. Unrelated to your
proposed change.  The header of the group is

#--
# CLIENT CONNECTION DEFAULTS
#--

# - Statement Behavior -

but I don't quite see GUCs like default_tablespace, search_path (due to
determining a created table's schema), temp_tablespace,
default_table_access_method fit reasonably well under that heading. They
all can affect persistent state. That seems pretty different from a
number of other settings (client_min_messages,
default_transaction_isolation, lock_timeout, ...) which only have
transient effects.

Should we perhaps split that group? Not that I have a good proposal for
better names.

Greetings,

Andres Freund




Re: Unused header file inclusion

2019-08-16 Thread Andres Freund
Hi,

On 2019-08-03 12:37:33 -0700, Andres Freund wrote:
> Think the first three are pretty clearly a good idea, I'm a bit less
> sanguine about the fourth:
> Headers like utils/timestamp.h are often included just because we need a
> TimestampTz type somewhere, or call GetCurrentTimestamp(). Approximately
> none of these need the PG_GETARG_* macros, which are the only reason for
> including fmgr.h in these headers.  As they're macros that's not
> actually needed, although I think normally good style. But I' think here
> avoiding exposing fmgr.h to more headers is a bigger win.

I still think the fourth is probably worthwhile, but I don't feel
confident enough to do it without somebody else +0.5'ing it...

I've pushed the other ones.

Greetings,

Andres Freund




Can't we give better table bloat stats easily?

2019-08-16 Thread Greg Stark
Everywhere I've worked I've seen people struggle with table bloat. It's
hard to even measure how much of it you have or where, let alone actually
fix it.

If you search online you'll find dozens of different queries estimating how
much empty space are in your tables and indexes based on pg_stats
statistics and suppositions about header lengths and padding and plugging
them into formulas of varying credibility.

But isn't this all just silliiness these days? We could actually sum up the
space recorded in the fsm and get a much more trustworthy number in
milliseconds.

I rigged up a quick proof of concept and the code seems super simple and
quick. There's one or two tables where the number is a bit suspect and
there's no fsm if vacuum hasn't run but that seems pretty small potatoes
for such a huge help in reducing user pain.


Re: Can't we give better table bloat stats easily?

2019-08-16 Thread Andres Freund
Hi,

On 2019-08-16 20:39:21 -0400, Greg Stark wrote:
> But isn't this all just silliiness these days? We could actually sum up the
> space recorded in the fsm and get a much more trustworthy number in
> milliseconds.

You mean like pgstattuple_approx()?

https://www.postgresql.org/docs/current/pgstattuple.html

Or something different?

> I rigged up a quick proof of concept and the code seems super simple and
> quick. There's one or two tables where the number is a bit suspect and
> there's no fsm if vacuum hasn't run but that seems pretty small potatoes
> for such a huge help in reducing user pain.

Hard to comment on what you propose, without more details. But note that
you can't just look at the FSM, because in a lot of workloads it is
often hugely out of date. And fairly obviously it doesn't provide you
with information about how much space is currently occupied by dead
tuples.  What pgstattuple_approx does is to use the FSM for blocks that
are all-visible, and look at the page otherwise.

Greetings,

Andres Freund




Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity

2019-08-16 Thread Tom Lane
Andres Freund  writes:
> On 2019-08-16 09:44:15 -0700, Hadi Moshayedi wrote:
>> It seems that sometimes when DELETE cascades to referencing tables we fail
>> to acquire locks on replica identity index.

> I suspect this "always" has been broken, it's just that we previously
> didn't have checks in place that detect these cases. I don't think it's
> likely to cause actual harm, due to the locking on the table itself when
> dropping indexes etc.  But we still should fix it.

Yeah ... see the discussion leading up to 9c703c169,

https://www.postgresql.org/message-id/flat/19465.1541636036%40sss.pgh.pa.us

We didn't pull the trigger on removing the CMD_DELETE exception here,
but I think the handwriting has been on the wall for some time.

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Antonin Houska
Bruce Momjian  wrote:

> On Thu, Aug 15, 2019 at 09:01:05PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > > Why would it not be simpler to have the cluster_passphrase_command run
> > > whatever command-line program it wants?  If you don't want to use a
> > > shell command, create an executable and call that.
> > 
> > Having direct integration with a KMS would certainly be valuable, and I
> > don't see a reason to deny users that option if someone would like to
> > spend time implementing it- in addition to a simpler mechanism such as a
> > passphrase command, which I believe is what was being suggested here.
> 
> OK,  I am just trying to see why we would not use the
> cluster_passphrase_command-like interface to do that.

One problem that occurs to me is that PG may need to send some sort of
credentials to the KMS. If it runs a separate process to execute the command,
it needs to pass those credentials to it. Whether it does so via parameters or
environment variables, both can be seen by other users.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Global temporary tables

2019-08-16 Thread Konstantin Knizhnik



On 16.08.2019 9:25, Craig Ringer wrote:
On Tue, 13 Aug 2019 at 21:50, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:



As far as I understand relpages and reltuples are set only
when you perform "analyze" of the table.


Also autovacuum's autoanalyze.


When it happen?
I have created normal table, populated it with some data and then
wait several hours but pg_class was not updated for this table.



heap_vacuum_rel() in src/backend/access/heap/vacuumlazy.c below

     * Update statistics in pg_class.

which I'm pretty sure is common to explicit vacuum and autovacuum. I 
haven't run up a test to verify 100% but most DBs would never have 
relpages etc set if autovac didn't do it since most aren't explicitly 
VACUUMed at all.


Sorry, I already understood it myself.
But to make vacuum process the table it is necessary to remove or update 
some rows in it.
It seems to be yet another Postgres problem, which was noticed by 
Darafei Praliaskouski some time ago: append-only tables are never 
proceeded by autovacuum.





I thought it was done when autovac ran an analyze, but it looks like 
it's all autovac. Try setting very aggressive autovac thresholds and 
inserting + deleting a bunch of tuples maybe.


I attach to this mail slightly refactored versions of this patches
with fixes of issues reported in your review.


Thanks.

Did you have a chance to consider my questions too? I see a couple of 
things where there's no patch change, which is fine, but I'd be 
interested in your thoughts on the question/issue in those cases.



Sorry, may be I didn't notice some your questions. I have a filling that 
I have replied on all your comments/questions.

Right now I reread all this thread and see two open issues:

1. Statistic for global temporary tables (including number of tuples, 
pages and all visible flag).
My position is the following: while in most cases it should not be a 
problem, because users rarely create indexes or do analyze for temporary 
tables,
there can be situations when differences in data sets of global 
temporary tables in different backends can really be a problem.
Unfortunately I can not propose good solution for this problem. It is 
certainly possible to create some private (per-backend) cache for this 
metadata.

But it seems to requires changes in many places.

2. Your concerns about performance penalty of global temp tables 
accessed through shared buffers comparing with local temp tables access 
through local buffers.
I think that this concern is not  actual any more because there is 
implementation of global temp tables using local buffers.
But my experiments doesn't show significant difference in access speed 
of shared and local buffers. As far as shared buffers are used to be 
much larger than local buffers,
there are more chances to hold all temp relation in memory without 
spilling it to the disk. In this case access to global temp table will 
be much faster comparing with access to
local temp tables. But the fact is that right now in the most frequent 
scenario of temp table usage:


    SELECT ... FROM PersistentTable INTO TempTable WHERE ...;
    SELECT * FROM TempTable;

local temp table are more efficient than global temp table access 
through shared buffer.

I think it is explained by caching and eviction policies.
In case of pulling all content of temp table in memory (pg_prewarm) 
global temp table with shared buffers becomes faster.



I forget or do not notice some of your questions, would you be so kind 
as to repeat them?



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: UNION ALL

2019-08-16 Thread Mark Pasterkamp
First of all, thank you for the replies.

I am using a base installation of postgres 10.10, with no modifications to
any of the system defaults.

I am trying to speedup a join between two tables: the title table and the
cast_info table.

The title table is a table containing information about different movies.
it contains 4626969 records.
the table also has a foreign key index on the cast_info table, enabling the
planner to use a hash-join.

The cast_info table is a table containing the information of which actor
was casted in which movie and contains 62039343 records.

The database also contains a materialized view ci_t_15, defined as:
select * from cast_info join title on cast_info.movie_id = title.id
where title.production_year < 2015

I am comparing two queries, q1 and q2 respectively.
Query q1 is the original query and q2 is an attempt to reduce the cost of
execution via leveraging the materialized view ci_t_15.

Query q1 is defined as:
select * from cast_info join title on cast_info.movie_id = title.id

Query q2 is defined as
select * from cast_info join title on cast_info.movie_id = title.id
where title.production_year >= 2015
UNION ALL
select * from ci_t_15

Both queries are executed on a Dell xps laptop with an I7-8750H processor
and 16 (2*8) gb ram on an SSD running on ubuntu 18.04.2 LTS.

Running explain analyze on both queries I get the following execution plans.
q1:
"Hash Join  (cost=199773.80..2561662.10 rows=62155656 width=103) (actual
time=855.063..25786.264 rows=62039343 loops=1)"
"  Hash Cond: (cast_info.ci_movie_id = title.t_id)"
"  ->  Seq Scan on cast_info  (cost=0.00..1056445.56 rows=62155656
width=42) (actual time=0.027..3837.722 rows=62039343 loops=1)"
"  ->  Hash  (cost=92232.69..92232.69 rows=4626969 width=61) (actual
time=854.548..854.548 rows=4626969 loops=1)"
"Buckets: 65536  Batches: 128  Memory Usage: 3431kB"
"->  Seq Scan on title  (cost=0.00..92232.69 rows=4626969 width=61)
(actual time=0.005..327.588 rows=4626969 loops=1)"
"Planning time: 5.097 ms"
"Execution time: 27236.088 ms"

q2:
"Append  (cost=123209.65..3713445.65 rows=61473488 width=105) (actual
time=442.207..29713.621 rows=60918189 loops=1)"
"  ->  Gather  (cost=123209.65..2412792.77 rows=10639784 width=103) (actual
time=442.206..14634.427 rows=10046633 loops=1)"
"Workers Planned: 2"
"Workers Launched: 2"
"->  Hash Join  (cost=122209.65..1347814.37 rows=4433243 width=103)
(actual time=471.969..12527.840 rows=3348878 loops=3)"
"  Hash Cond: (cast_info.ci_movie_id = title.t_id)"
"  ->  Parallel Seq Scan on cast_info  (cost=0.00..693870.90
rows=25898190 width=42) (actual time=0.006..7302.679 rows=20679781 loops=3)"
"  ->  Hash  (cost=103800.11..103800.11 rows=792043 width=61)
(actual time=471.351..471.351 rows=775098 loops=3)"
"Buckets: 65536  Batches: 32  Memory Usage: 2515kB"
"->  Seq Scan on title  (cost=0.00..103800.11
rows=792043 width=61) (actual time=0.009..376.127 rows=775098 loops=3)"
"  Filter: (t_production_year >= 2015)"
"  Rows Removed by Filter: 3851871"
"  ->  Seq Scan on ci_t_15  (cost=0.00..1194255.04 rows=50833704 width=105)
(actual time=1.143..11967.391 rows=50871556 loops=1)"
"Planning time: 0.268 ms"
"Execution time: 31379.854 ms"

Due to using the materialized view I can reduce the amount of records going
into the hash join, lowering the time from 25786.264 msec to 12527.840 msec.
However, this is where my question comes in, this reduction is completely
negated by the cost of appending both results in the UNION ALL command.
I was wondering if this is normal behaviour.
In my mind, I wouldn't expect appending 2 resultsets to have such a
relative huge cost associated with it.
This is also why I asked what exactly a UNION ALL does to achieve its
functionality, to perhaps gain some insight in its cost.


With kind regards,

Mark

On Thu, 15 Aug 2019 at 21:22, Ibrar Ahmed  wrote:

>
>
> On Fri, Aug 16, 2019 at 12:16 AM <066ce...@free.fr> wrote:
>
>> Generally speaking, when executing UNION ; a DISTINCT is run afterward on
>> the resultset.
>>
>> So, if you're sure that each part of UNION cannot return a line returned
>> by another one, you may use UNION ALL, you'll cut the cost of the final
>> implicit DISTINCT.
>>
>>
>> - Mail original -
>> De: "Mark Pasterkamp" 
>> À: pgsql-hackers@lists.postgresql.org
>> Envoyé: Jeudi 15 Août 2019 20:37:06
>> Objet: UNION ALL
>>
>>
>> Dear all,
>>
>>
>> I was wondering if someone could help me understands what a union all
>> actually does.
>>
>>
>> For my thesis I am using Apache Calcite to rewrite queries into using
>> materialized views which I then give to a Postgres database.
>> For some queries, this means that they will be rewritten in a UNION ALL
>> style query between an expression and a table scan of a materialized view.
>> However, contrary to what I expected, the UNION ALL query is actu

Re: Global temporary tables

2019-08-16 Thread Craig Ringer
On Fri, 16 Aug 2019 at 15:30, Konstantin Knizhnik 
wrote:


>
> 1. Statistic for global temporary tables (including number of tuples,
> pages and all visible flag).
> My position is the following: while in most cases it should not be a
> problem, because users rarely create indexes or do analyze for temporary
> tables,
> there can be situations when differences in data sets of global temporary
> tables in different backends can really be a problem.
> Unfortunately I can not propose good solution for this problem. It is
> certainly possible to create some private (per-backend) cache for this
> metadata.
> But it seems to requires changes in many places.
>

Yeah. I don't really like just sharing them but it's not that bad either.


> 2. Your concerns about performance penalty of global temp tables accessed
> through shared buffers comparing with local temp tables access through
> local buffers.
> I think that this concern is not  actual any more because there is
> implementation of global temp tables using local buffers.
> But my experiments doesn't show significant difference in access speed of
> shared and local buffers. As far as shared buffers are used to be much
> larger than local buffers,
> there are more chances to hold all temp relation in memory without
> spilling it to the disk. In this case access to global temp table will be
> much faster comparing with access to
> local temp tables.
>

You ignore the costs of evicting non-temporary data from shared_buffers,
i.e. contention for space. Also increased chance of backends being forced
to do direct write-out due to lack of s_b space for dirty buffers.

> In case of pulling all content of temp table in memory (pg_prewarm)
global temp table with shared buffers becomes faster.

Who would ever do that?

I forget or do not notice some of your questions, would you be so kind as
> to repeat them?
>



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: Global temporary tables

2019-08-16 Thread Craig Ringer
>
>
> On Fri, 16 Aug 2019 at 15:30, Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>
>> I forget or do not notice some of your questions, would you be so kind as
>> to repeat them?
>>
>
>

Sent early by accident.

Repeating questions:


Why do you need to do all this indirection with changing RelFileNode to
RelFileNodeBackend in the bufmgr, changing BufferGetTag etc? Similarly,
your changes of RelFileNodeBackendIsTemp to RelFileNodeBackendIsLocalTemp .
I'm guessing you did it the way you did instead to lay the groundwork for
cross-backend sharing, but if so it should IMO be in your second patch that
adds support for using shared_buffers for temp tables, not in the first
patch that adds a minimal global temp tables implementation. Maybe my
understanding of the existing temp table mechanics is just insufficient as
I see RelFileNodeBackendIsTemp is already used in some aspects of existing
temp relation handling.

Did you look into my suggestion of extending the relmapper so that global
temp tables would have a relfilenode of 0 like pg_class etc, and use a
backend-local map of oid-to-relfilenode mappings?

Similarly, TruncateSessionRelations probably shouldn't need to exist in
this patch in its current form; there's no shared_buffers use to clean and
the same file cleanup mechanism should handle both session-temp and
local-temp relfilenodes.

Sequence initialization ignores sequence startval/firstval settings. Why?
+   value[SEQ_COL_LASTVAL-1] = Int64GetDatumFast(1); /* start
sequence with 1 */



Doesn't this change the test outcome for RELPERSISTENCE_UNLOGGED?:
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ else if (newrelpersistence != RELPERSISTENCE_TEMP)


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: POC: Cleaning up orphaned files using undo logs

2019-08-16 Thread Dilip Kumar
On Fri, Aug 16, 2019 at 10:56 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-16 09:44:25 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar  wrote:
> > >
> > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> >
> > > >   I think that batch reading should just copy the underlying data into a
> > > >   char* buffer. Only the records that currently are being used by
> > > >   higher layers should get exploded into an unpacked record. That will
> > > >   reduce memory usage quite noticably (and I suspect it also drastically
> > > >   reduce the overhead due to a large context with a lot of small
> > > >   allocations that then get individually freed).
> > >
> > > Ok, I got your idea.  I will analyze it further and work on this if
> > > there is no problem.
> >
> > I think there is one problem that currently while unpacking the undo
> > record if the record is compressed (i.e. some of the fields does not
> > exist in the record) then we read those fields from the first record
> > on the page.  But, if we just memcpy the undo pages to the buffers and
> > delay the unpacking whenever it's needed seems that we would need to
> > know the page boundary and also we need to know the offset of the
> > first complete record on the page from where we can get that
> > information (which is currently in undo page header).
>
> I don't understand why that's a problem?
Okay, I was assuming that we will be only copying data part not
complete page including the page header.  If we copy the page header
as well we might be able to unpack the compressed record as well.

>
>
> > As of now even if we leave this issue apart I am not very clear what
> > benefit you are seeing in the way you are describing compared to the
> > way I am doing it now?
> >
> > a) Is it the multiple palloc? If so then we can allocate memory at
> > once and flatten the undo records in that.  Earlier, I was doing that
> > but we need to align each unpacked undo record so that we can access
> > them directly and based on Robert's suggestion I have modified it to
> > multiple palloc.
>
> Part of it.
>
> > b) Is it the memory size problem that the unpack undo record will take
> > more memory compared to the packed record?
>
> Part of it.
>
> > c) Do you think that we will not need to unpack all the records?  But,
> > I think eventually, at the higher level we will have to unpack all the
> > undo records ( I understand that it will be one at a time)
>
> Part of it. There's a *huge* difference between having a few hundred to
> thousand unpacked records, each consisting of several independent
> allocations, in memory and having one large block containing all
> packed records in a batch, and a few allocations for the few unpacked
> records that need to exist.
>
> There's also d) we don't need separate tiny memory copies while holding
> buffer locks etc.

Yeah, that too.  Yet another problem could be that how are we going to
process those record? Because for that we need to know all the undo
record pointers between start_urecptr and the end_urecptr right?  we
just have the big memory chunk and we have no idea how many undo
records are there and what are their undo record pointers.  And
without knowing that information, I am unable to imagine how we are
going to sort them based on block number.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: block-level incremental backup

2019-08-16 Thread Jeevan Chalke
On Fri, Aug 2, 2019 at 6:43 PM vignesh C  wrote:

> Some comments:
> 1) There will be some link files created for tablespace, we might
> require some special handling for it
>

Yep. I have that in my ToDo.
Will start working on that soon.


> 2)
> Retry functionality is hanlded only for copying of full files, should
> we handle retry for copying of partial files
> 3)
> we can have some range for maxretries similar to sleeptime
>

I took help from pg_standby code related to maxentries and sleeptime.

However, as we don't want to use system() call now, I have
removed all this kludge and just used fread/fwrite as discussed.


> 4)
> Should we check for malloc failure
>

Used pg_malloc() instead. Same is also suggested by Ibrar.


>
> 5) Should we add display of progress as backup may take some time,
> this can be added as enhancement. We can get other's opinion on this.
>

Can be done afterward once we have the functionality in place.


>
> 6)
> If the backup count increases providing the input may be difficult,
> Shall user provide all the incremental backups from a parent folder
> and can we handle the ordering of incremental backup internally
>

I am not sure of this yet. We need to provide the tablespace mapping too.
But thanks for putting a point here. Will keep that in mind when I revisit
this.


>
> 7)
> Add verbose for copying whole file
>
Done


>
> 8) We can also check if approximate space is available in disk before
> starting combine backup, this can be added as enhancement. We can get
> other's opinion on this.
>

Hmm... will leave it for now. User will get an error anyway.


>
> 9)
> Combine backup into directory can be combine backup directory
>
Done


>
> 10)
> MAX_INCR_BK_COUNT can be increased little, some applications use 1
> full backup at the beginning of the month and use 30 incremental
> backups rest of the days in the month
>

Yeah, agree. But using any number here is debatable.
Let's see others opinion too.


> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
>


Attached new sets of patches with refactoring done separately.
Incremental backup patch became small now and hopefully more
readable than the first version.

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 93041c12a7d07bf17073c9cf4571bd3b5a8acc81 Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Fri, 16 Aug 2019 14:08:34 +0530
Subject: [PATCH 1/4] Add support for command line option to pass LSN.

This adds [ LSN 'lsn' ] to BASE_BACKUP command and --lsn=LSN to
the pg_basebackup binary.

Also, add small tests.
---
 src/backend/replication/basebackup.c | 20 
 src/backend/replication/repl_gram.y  |  6 ++
 src/backend/replication/repl_scanner.l   |  1 +
 src/bin/pg_basebackup/pg_basebackup.c| 15 +--
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 12 +++-
 5 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index c91f66d..74c954b 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -38,6 +38,7 @@
 #include "storage/ipc.h"
 #include "storage/reinit.h"
 #include "utils/builtins.h"
+#include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/relcache.h"
 #include "utils/timestamp.h"
@@ -52,6 +53,7 @@ typedef struct
 	bool		includewal;
 	uint32		maxrate;
 	bool		sendtblspcmapfile;
+	XLogRecPtr 	lsn;
 } basebackup_options;
 
 
@@ -638,6 +640,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
 	bool		o_maxrate = false;
 	bool		o_tablespace_map = false;
 	bool		o_noverify_checksums = false;
+	bool		o_lsn = false;
 
 	MemSet(opt, 0, sizeof(*opt));
 	foreach(lopt, options)
@@ -726,6 +729,23 @@ parse_basebackup_options(List *options, basebackup_options *opt)
 			noverify_checksums = true;
 			o_noverify_checksums = true;
 		}
+		else if (strcmp(defel->defname, "lsn") == 0)
+		{
+			bool		have_error = false;
+
+			if (o_lsn)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("duplicate option \"%s\"", defel->defname)));
+			o_lsn = true;
+
+			/* Validate given LSN and convert it into XLogRecPtr. */
+			opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error);
+			if (XLogRecPtrIsInvalid(opt->lsn))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 errmsg("invalid value for LSN")));
+		}
 		else
 			elog(ERROR, "option \"%s\" not recognized",
  defel->defname);
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index c4e11cc..c24d319 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -87,6 +87,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
+%token K_LSN
 
 %type 	command
 %type 	base_backup start_replication start_

Re: block-level incremental backup

2019-08-16 Thread Jeevan Chalke
On Fri, Aug 9, 2019 at 6:07 AM Jeevan Ladhe 
wrote:

> Hi Jeevan,
>
> I have reviewed the backup part at code level and still looking into the
> restore(combine) and functional part of it. But, here are my comments so
> far:
>

Thank you Jeevan Ladhe for reviewing the changes.


>
> The patches need rebase.
>

Done.


> May be we should rename to something like:
> "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
> START LOCATION"
> to make it more intuitive?
>

As discussed, used "INCREMENTAL BACKUP REFERENCE WAL LOCATION".

File header structure is defined in both the files basebackup.c and
> pg_combinebackup.c. I think it is better to move this to
> replication/basebackup.h.
>

Yep. Was that in my cleanup list. Done now.


> I think we can avoid having flag isrelfile in sendFile().
> Something like this:
>
Also, having isrelfile as part of following condition:
> is confusing, because even the relation files in full backup are going to
> be
> backed up by this loop only, but still, the condition reads '(!isrelfile
> &&...)'.
>

In the refactored patch I have moved full backup code in a separate
function.
And now all incremental backup code is also done in its own function.
Hopefully, the code is now more readable.


>
> IMHO, while labels are not advisable in general, it may be better to use a
> label
> here rather than a while(1) loop, so that we can move to the label in case
> we
> want to retry once. I think here it opens doors for future bugs if someone
> happens to add code here, ending up adding some condition and then the
> break becomes conditional. That will leave us in an infinite loop.
>

I kept it as is as I don't see any correctness issue here.

Similar to structure partial_file_header, I think above macro can also be
> moved
> to basebackup.h instead of defining it twice.
>

Yes. Done.


> I think this is a huge memory request (1GB) and may fail on busy/loaded
> server at
> times. We should check for failures of malloc, maybe throw some error on
> getting ENOMEM as errno.
>

Agree. Done.


> Here, should not we expect statbuf->st_size < (RELSEG_SIZE * BLCKSZ), and
> it
> should be safe to read just statbuf_st_size always I guess? But, I am ok
> with
> having this extra guard here.
>

Yes, we can do this way. Added an Assert() before that and used just
statbuf->st_size.

In sendFile(), I am sorry if I am missing something, but I am not able to
> understand why 'cnt' and 'i' should have different values when they are
> being
> passed to verify_page_checksum(). I think passing only one of them should
> be
> sufficient.
>

As discussed offline, you meant to say i and blkno.
These two are different. i represent the current block offset from the read
buffer whereas blkno is the offset from the start of the page. For
incremental
backup, they are same as we read the whole file but they are different in
case
of regular full backup where we read 4 blocks at a time. i value there will
be
between 0 and 3.


> Maybe we should just have a variable no_of_blocks to store a number of
> blocks,
> rather than calculating this say RELSEG_SIZE(i.e. 131072) times in the
> worst
> case.
>

OK. Done.


> Sorry if I am missing something, but, should not it be just:
>
> len = cnt;
>

Yeah. Done.


> As I said earlier in my previous email, we now do not need
> +decode_lsn_internal()
> as it is already taken care by the introduction of function
> pg_lsn_in_internal().
>

Yes. Done that and rebased on latest HEAD.


>
> Regards,
> Jeevan Ladhe
>

Patches attached in the previous reply.

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: block-level incremental backup

2019-08-16 Thread Ibrar Ahmed
On Fri, Aug 16, 2019 at 3:24 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Fri, Aug 2, 2019 at 6:43 PM vignesh C  wrote:
>
>> Some comments:
>> 1) There will be some link files created for tablespace, we might
>> require some special handling for it
>>
>
> Yep. I have that in my ToDo.
> Will start working on that soon.
>
>
>> 2)
>> Retry functionality is hanlded only for copying of full files, should
>> we handle retry for copying of partial files
>> 3)
>> we can have some range for maxretries similar to sleeptime
>>
>
> I took help from pg_standby code related to maxentries and sleeptime.
>
> However, as we don't want to use system() call now, I have
> removed all this kludge and just used fread/fwrite as discussed.
>
>
>> 4)
>> Should we check for malloc failure
>>
>
> Used pg_malloc() instead. Same is also suggested by Ibrar.
>
>
>>
>> 5) Should we add display of progress as backup may take some time,
>> this can be added as enhancement. We can get other's opinion on this.
>>
>
> Can be done afterward once we have the functionality in place.
>
>
>>
>> 6)
>> If the backup count increases providing the input may be difficult,
>> Shall user provide all the incremental backups from a parent folder
>> and can we handle the ordering of incremental backup internally
>>
>
> I am not sure of this yet. We need to provide the tablespace mapping too.
> But thanks for putting a point here. Will keep that in mind when I revisit
> this.
>
>
>>
>> 7)
>> Add verbose for copying whole file
>>
> Done
>
>
>>
>> 8) We can also check if approximate space is available in disk before
>> starting combine backup, this can be added as enhancement. We can get
>> other's opinion on this.
>>
>
> Hmm... will leave it for now. User will get an error anyway.
>
>
>>
>> 9)
>> Combine backup into directory can be combine backup directory
>>
> Done
>
>
>>
>> 10)
>> MAX_INCR_BK_COUNT can be increased little, some applications use 1
>> full backup at the beginning of the month and use 30 incremental
>> backups rest of the days in the month
>>
>
> Yeah, agree. But using any number here is debatable.
> Let's see others opinion too.
>
Why not use a list?


>
>
>> Regards,
>> Vignesh
>> EnterpriseDB: http://www.enterprisedb.com
>>
>
>
> Attached new sets of patches with refactoring done separately.
> Incremental backup patch became small now and hopefully more
> readable than the first version.
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>

+   buf = (char *) malloc(statbuf->st_size);

+   if (buf == NULL)

+   ereport(ERROR,

+   (errcode(ERRCODE_OUT_OF_MEMORY),

+errmsg("out of memory")));

Why are you using malloc, you can use palloc here.




-- 
Ibrar Ahmed


Re: Global temporary tables

2019-08-16 Thread Konstantin Knizhnik



On 16.08.2019 11:37, Craig Ringer wrote:



On Fri, 16 Aug 2019 at 15:30, Konstantin Knizhnik
mailto:k.knizh...@postgrespro.ru>> wrote:

I forget or do not notice some of your questions, would you be
so kind as to repeat them?


Sent early by accident.

Repeating questions:


Sorry, but I have answered them (my e-mail from 13.08)!
Looks like you have looed at wrong version of the patch:
global_shared_temp-1.patch instead of global_private_temp-1.patch which 
implements global tables accessed through local buffers.





Why do you need to do all this indirection with changing RelFileNode 
to RelFileNodeBackend in the bufmgr, changing BufferGetTag etc? 
Similarly, your changes of RelFileNodeBackendIsTemp 
to RelFileNodeBackendIsLocalTemp . I'm guessing you did it the way you 
did instead to lay the groundwork for cross-backend sharing, but if so 
it should IMO be in your second patch that adds support for using 
shared_buffers for temp tables, not in the first patch that adds a 
minimal global temp tables implementation. Maybe my understanding of 
the existing temp table mechanics is just insufficient as I 
see RelFileNodeBackendIsTemp is already used in some aspects of 
existing temp relation handling.




Sorry, are you really speaking about global_private_temp-1.patch?
This patch doesn't change bufmgr file at all.
May be you looked at another patch - global_shared_temp-1.patch
which is accessing shared tables though shared buffers and so have to 
change buffer tag to include backend ID in it.



Did you look into my suggestion of extending the relmapper so that 
global temp tables would have a relfilenode of 0 like pg_class etc, 
and use a backend-local map of oid-to-relfilenode mappings?


Similarly, TruncateSessionRelations probably shouldn't need to exist 
in this patch in its current form; there's no shared_buffers use to 
clean and the same file cleanup mechanism should handle both 
session-temp and local-temp relfilenodes.


In global_private_temp-1.patch TruncateSessionRelations does nothing 
with shared buffers, it just delete relation files.






Sequence initialization ignores sequence startval/firstval settings. Why?
+               value[SEQ_COL_LASTVAL-1] = Int64GetDatumFast(1); /* 
start sequence with 1 */






I am handling only case of implicitly created sequences for 
SERIAL/BIGSERIAL columns.

Is it possible to explicitly specify initial value and step for them?
If so, this place should definitely be rewritten.



Doesn't this change the test outcome for RELPERSISTENCE_UNLOGGED?:
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ else if (newrelpersistence != RELPERSISTENCE_TEMP)



RELPERSISTENCE_UNLOGGED case is handle in previous IF branch.
-

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Global temporary tables

2019-08-16 Thread Konstantin Knizhnik



On 16.08.2019 11:32, Craig Ringer wrote:


You ignore the costs of evicting non-temporary data from 
shared_buffers, i.e. contention for space. Also increased chance of 
backends being forced to do direct write-out due to lack of s_b space 
for dirty buffers.
> In case of pulling all content of temp table in memory (pg_prewarm) 
global temp table with shared buffers becomes faster.


Who would ever do that?




I decided to redo my experiments and now get different results which 
illustrates advantages of global temp tables with shared buffer.
I performed the following test at my desktop with SSD and 16GB of RAM 
and Postgres with default configuration except shared-buffers increased 
to 1Gb.



postgres=# create table big(pk bigint primary key, val bigint);
CREATE TABLE
postgres=# insert into big values 
(generate_series(1,1),generate_series(1,1)/100);

INSERT 0 1
postgres=# select * from buffer_usage limit 3;
    relname |  buffered  | buffer_percent | percent_of_relation
+++-
 big    | 678 MB |   66.2 |    16.1
 big_pkey   | 344 MB |   33.6 |    16.1
 pg_am  | 8192 bytes |    0.0 |    20.0

postgres=# create temp table lt(key bigint, count bigint);
postgres=# \timing
Timing is on.
postgres=# insert into lt (select count(*),val as key from big group by 
val);

INSERT 0 101
Time: 43265.491 ms (00:43.265)
postgres=# select sum(count) from lt;
 sum
--
 5050
(1 row)

Time: 94.194 ms
postgres=# insert into gt (select count(*),val as key from big group by 
val);

INSERT 0 101
Time: 42952.671 ms (00:42.953)
postgres=# select sum(count) from gt;
 sum
--
 5050
(1 row)

Time: 35.906 ms
postgres=# select * from buffer_usage limit 3;
 relname  | buffered | buffer_percent | percent_of_relation
--+--++-
 big  | 679 MB   |   66.3 |    16.1
 big_pkey | 300 MB   |   29.3 |    14.0
 gt   | 42 MB    |    4.1 |   100.0


So time of storing result in global temp table is slightly smaller than 
time of storing it in local temp table and time of scanning global temp 
table is twice smaller!



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Ibrar Ahmed
On Thu, Aug 15, 2019 at 8:21 PM Bruce Momjian  wrote:

> On Thu, Aug 15, 2019 at 11:24:46AM +0200, Antonin Houska wrote:
> > > I think there are several directions we can go after all-cluster
> > > encryption,
> >
> > I think I misunderstood. What you summarize in
> >
> >
> https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#TODO_for_Full-Cluster_Encryption
> >
>
Do we have any status of TODO's, what has been done and what left? It's
much better if we have a link of discussion of each item.



> > does include
> >
> >
> https://www.postgresql.org/message-id/CAD21AoBjrbxvaMpTApX1cEsO=8N=nc2xVZPB0d9e-VjJ=ya...@mail.gmail.com
> >
> > i.e. per-tablespace keys, right? Then the collaboration should be easier
> than
> > I thought.
>
> No, there is a single tables/indexes key and a WAL key, plus keys for
> rotation.  I explained why per-tablespace keys don't add much value.
>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +  Ancient Roman grave inscription +
>


-- 
Ibrar Ahmed


Re: [PATCH] Implement INSERT SET syntax

2019-08-16 Thread Ibrar Ahmed
On Fri, Aug 16, 2019 at 8:19 AM Amit Kapila  wrote:

> On Wed, Jul 17, 2019 at 10:00 AM Gareth Palmer 
> wrote:
> >
> > Hello,
> >
> > Attached is a patch that adds the option of using SET clause to specify
> > the columns and values in an INSERT statement in the same manner as that
> > of an UPDATE statement.
> >
> > A simple example that uses SET instead of a VALUES() clause:
> >
> > INSERT INTO t SET c1 = 'foo', c2 = 'bar', c3 = 'baz';
> >
> > Values may also be sourced from a CTE using a FROM clause:
> >
> > WITH x AS (
> >   SELECT 'foo' AS c1, 'bar' AS c2, 'baz' AS c3
> > )
> > INSERT INTO t SET c1 = x.c1, c2 = x.c2, c3 = x.c3 FROM x;
> >
> > The advantage of using the SET clause style is that the column and value
> > are kept together, which can make changing or removing a column or value
> from
> > a large list easier.
> >
> > Internally the grammar parser converts INSERT SET without a FROM clause
> into
> > the equivalent INSERT with a VALUES clause. When using a FROM clause it
> becomes
> > the equivalent of INSERT with a SELECT statement.
> >
> > There was a brief discussion regarding INSERT SET on pgsql-hackers in
> late
> > August 2009 [1].
> >
> > INSERT SET is not part of any SQL standard (that I am aware of), however
> this
> > syntax is also implemented by MySQL [2]. Their implementation does not
> support
> > specifying a FROM clause.
> >
>
> I think this can be a handy feature in some cases as pointed by you,
> but do we really want it for PostgreSQL?  In the last round of
> discussions as pointed by you, there doesn't seem to be a consensus
> that we want this feature.  I guess before spending too much time into
> reviewing this feature, we should first build a consensus on whether
> we need this.
>
>
I agree with you Amit, that we need a consensus on that. Do we really need
that
feature or not. In the previous discussion, there was no resistance to have
that
in PostgreSQL, but some problem with the patch. Current patch is very simple
and not invasive, but still, we need a consensus on that.

Along with users, I request some senior hackers/committers to also
> weigh in about the desirability of this feature.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
>

-- 
Ibrar Ahmed


Re: Unexpected "shared memory block is still in use"

2019-08-16 Thread Peter Eisentraut
On 2019-08-14 01:22, Tom Lane wrote:
> Attached is a draft patch to change both shmem and sema key selection
> to be based on data directory inode rather than port.
> 
> I considered using "st_ino ^ st_dev", or some such, but decided that
> that would largely just make it harder to manually correlate IPC
> keys with running postmasters.  It's generally easy to find out the
> data directory inode number with "ls", but the extra work to find out
> and XOR in the device number is not so easy, and it's not clear what
> it'd buy us in typical scenarios.

For the POSIX APIs where the numbers are just converted to a string, why
not use both -- or forget about the inodes and use the actual data
directory string.

For the SYSV APIs, the scenario that came to my mind is if someone
starts a bunch of servers each on their own mount, it could happen that
the inodes of the data directories are very similar.

There is also the issue that AFAICT the key_t in the SYSV APIs is always
32-bit whereas inodes are 64-bit.  Probably not a big deal, but it might
prevent an exact one-to-one mapping.

Of course, ftok() is also available here as an existing solution.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-08-16 Thread Etsuro Fujita
On Tue, Jul 30, 2019 at 6:00 PM Etsuro Fujita  wrote:
> On Fri, Jul 19, 2019 at 10:44 PM Robert Haas  wrote:
> > On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita  
> > wrote:
> > > I.e., partition_bounds_merge() is performed for each pair of input
> > > partitioned relations for a join relation in try_partitionwise_join().
> > > Since partition_bounds_merge() would need a lot of CPU cycles, I don't
> > > think this is acceptable; ISTM that some redesign is needed to avoid
> > > this.  I'm wondering that once we successfully merged partition bounds
> > > from a pair of input partitioned relations for the join relation, by
> > > using the merged partition bounds, we could get the lists of matching
> > > to-be-joined partitions for subsequent pairs of input partitioned
> > > relations for the join relation in a more efficient way than by
> > > performing partition_bounds_merge() as proposed in the patch.
> >
> > I don't know whether partition_bounds_merge() is well-implemented; I
> > haven't looked.
>
> My concern about that is list partitioning.  In that case that
> function calls partition_list_bounds_merge(), which generates the
> partition bounds for a join relation between two given input
> relations, by performing merge join for a pair of the datums arrays
> from both the input relations.  Since the datums arrays contain all
> non-null list values across all partitions, if the numbers of the list
> values (ie, ndatums') are large, the merge join would require not a
> few cycles, so it would be much expensive to perform the merge join
> for each such pair when considering large N-way partitionwise joins of
> list-partitioned tables with large ndatums.  To see that, I did simple
> tests using a list-partitioned table pt created with the attached,
> which has 10 partitions, each with 1000 list values, so ndatums is
> 1.  (The tests below are performed with
> enable_partitionwise_join=on.)
>
> * 2-way self-join of pt: explain analyze select * from pt t0, pt t1
> where t0.a = t1.a;
>  - HEAD:
>  Planning Time: 1.731 ms
>  Execution Time: 15.159 ms
>  - Patched:
>  Planning Time: 1.884 ms
>  Execution Time: 15.127 ms
>
> * 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a;
>  - HEAD:
>  Planning Time: 28.787 ms
>  Execution Time: 34.313 ms
>  - Patched:
>  Planning Time: 40.263 ms
>  Execution Time: 35.019 ms
>
> * 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
> t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
> and t6.a = t7.a;
>  - HEAD:
>  Planning Time: 2279.653 ms
>  Execution Time: 63.303 ms
>  - Patched:
>  Planning Time: 3834.751 ms
>  Execution Time: 62.949 ms
>
> Actually, these joins would not need the partition-matching algorithm
> the patch adds; we could probably avoid this regression by modifying
> the patch to plan these joins the same way as before, but ISTM that
> these results imply that the cost of performing the merge join for
> each such pair would not be negligible when considering large N-way
> partitionwise joins mentioned above.  Maybe I'm missing something,
> though.
>
> > But in general I don't see an alternative to doing
> > some kind of merging on each pair of input relations. That's just how
> > planning works, and I don't see why it should need to be prohibitively
> > expensive.  I might be missing something, though; do you have an idea?
>
> Yes, I do; but I think I should think a little more about that.

I gave more thought to this.  My idea is to generate the list of
matching partitions to be joined from the partbound info after
creating it with partition_bounds_merge(), as I stated before.  Let me
explain using an example.  Suppose that R, S and T are partitioned
tables, that R=R(a,b,c) is partitioned on ranges of a into three
partitions R1, R2 and R3, that S=S(a,b,c) is partitioned on ranges of
a into three partitions S1, S2 and S3, and that T=T(a,b,c) is
partitioned on ranges of a into three partitions T1, T2 and T3.
Consider a 3-way join query: SELECT * FROM R, S, T WHERE R.a=S.a AND
S.a=T.a;  Suppose that when considering 2-way join R IJ S,
partition_bounds_merge() successfully merges the partition bounds for
R and S, and generates join pairs (R1, S1), (R2, S2) and (R3, S3), and
that when considering 3-way join (R IJ S) IJ T, that function
successfully merges the partition bounds for (R IJ S) and T, and
generates join pairs ((R1 IJ S1), T1) and ((R2 IJ S2), T2).  The
question here is: do we really need to perform
partition_bounds_merge() to generate the list of matching partitions
to be joined for 3-way join R IJ (S IJ T), for example?  I don't think
so; because 1) we see from the 3-way join pairs ((R1 IJ S1), T1) and
((R2 IJ S2), T2) that Ri, Si and Ti (i=1,2) have boundaries
overlapping with each other, which means that there would be (S1, T1)
and (S2, T2) as 2-way join pairs 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Bruce Momjian
On Thu, Aug 15, 2019 at 09:01:05PM -0400, Stephen Frost wrote:
> * Bruce Momjian (br...@momjian.us) wrote:
> > I assume you are talking about my option #1.  I can see if you only need
> > a few tables encrypted, e.g., credit card numbers, it can be excessive
> > to encrypt the entire cluster.  (I think you would need to encrypt
> > pg_statistic too.)
> 
> Or we would need a seperate encrypted pg_statistic, or a way to encrypt
> certain entries inside pg_statistic.

Yes.

> > The tricky part will be WAL --- if we encrypt all of WAL, the per-table
> > overhead might be minimal compared to the WAL encryption overhead.  The
> > better solution would be to add a flag to WAL records to indicate
> > encrypted entries, but you would then leak when an encryption change
> > happens and WAL record length.  (FYI, numeric values have different
> > lengths, as do character strings.)  I assume we would still use a single
> > key for all tables/indexes, and one for WAL, plus key rotation
> > requirements.
> 
> I don't think the fact that a change was done to an encrypted blob is an
> actual 'leak'- anyone can tell that by looking at the at the encrypted
> data before and after.  Further, the actual change would be encrypted,
> right?  Length of data is necessary to include in the vast majority of
> cases that the data is being dealt with and so I'm not sure that it
> makes sense for us to be worrying about that as a leak, unless you have
> a specific recommendation from a well known source discussing that
> concern..?

Yes, it is a minor negative, but we would need to see some performance
reason to have that minor negative, and I have already stated why I
think there might be no performance reason to do so.  Masahiko Sawada
talk at PGCon 2019 supports that conclusion:

https://www.youtube.com/watch?v=TXKoo2SNMzk

> > I personally would like to see full cluster implemented first to find
> > out exactly what the overhead is.  As I stated earlier, the overhead of
> > determining which things to encrypt, both in code complexity, user
> > interface, and processing overhead, might not be worth it.
> 
> I disagree with this and feel that the overhead that's being discussed
> here (user interface, figuring out if we should encrypt it or not,
> processing overhead for those determinations) is along the lines of
> UNLOGGED tables, yet there wasn't any question about if that was a valid
> or useful feature to implement.  The biggest challenge here is really

We implemented UNLOGGED tables because where was a clear performance win
to doing so.  I have not seen any measurements for encryption,
particularly when WAL is considered.

> around key management and I agree that's difficult but it's also really
> important and something that we need to be thinking about- and thinking
> about how to work with multiple keys and not just one.  Building in an
> assumption that we will only ever work with one key would make this
> capability nothing more than DBA-managed filesystem-level encryption

Agreed, that's all it is.

> (though even there different tablespaces could have different keys...)
> and I worry would make later work to support multiple keys more
> difficult and less likely to actually happen.  It's also not clear to me
> why we aren't building in *some* mechanism to work with multiple keys
> from the start as part of the initial design.

Well, every time I look at multiple keys, I go over exactly what that
means and how it behaves, but get no feedback on how to address the
problems.

> > I can see why you would think that encrypting less would be easier than
> > encrypting more, but security boundaries are hard to construct, and
> > anything that requires a user API, even more so.
> 
> I'm not sure I'm follwing here- I'm pretty sure everyone understands
> that selective encryption will require more work to implement, in part
> because an API needs to be put in place and we need to deal with
> multiple keys, etc.  I don't think anyone thinks that'll be "easier".

Uh, I thought Masahiko Sawada stated but, but looking back, I don't see
it, so I must be wrong.

> > > > > At least it should be clear how [2] will retrieve the master key 
> > > > > because [1]
> > > > > should not do it in a differnt way. (The GUC 
> > > > > cluster_passphrase_command
> > > > > mentioned in [3] seems viable, although I think [1] uses approach 
> > > > > which is
> > > > > more convenient if the passphrase should be read from console.)
> > > 
> > > I think that we can also provide a way to pass encryption key directly
> > > to postmaster rather than using passphrase. Since it's common that
> > > user stores keys in KMS it's useful if we can do that.
> > 
> > Why would it not be simpler to have the cluster_passphrase_command run
> > whatever command-line program it wants?  If you don't want to use a
> > shell command, create an executable and call that.
> 
> Having direct integration with a KMS would certainly be valuable, and I
> don't see a reaso

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Bruce Momjian
On Fri, Aug 16, 2019 at 05:58:59PM +0500, Ibrar Ahmed wrote:
> 
> 
> On Thu, Aug 15, 2019 at 8:21 PM Bruce Momjian  wrote:
> 
> On Thu, Aug 15, 2019 at 11:24:46AM +0200, Antonin Houska wrote:
> > > I think there are several directions we can go after all-cluster
> > > encryption,
> >
> > I think I misunderstood. What you summarize in
> >
> > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#
> TODO_for_Full-Cluster_Encryption
> >
> 
> Do we have any status of TODO's, what has been done and what left? It's much
> better if we have a link of discussion of each item.

I think some are done and some are in process, but I don't have a good
handle on that yet.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Add test case for sslinfo

2019-08-16 Thread Peter Eisentraut
On 2019-07-08 10:18, Michael Paquier wrote:
> On Mon, Jul 08, 2019 at 02:11:34PM +0800, Hao Wu wrote:
>> Thank you for your quick response! I work on greenplum, and I didn't see
>> this folder(src/test/ssl/ssl) before.
>> I will add more certificates to test and resend again.
> 
> Not having duplicates would be nice.

I think sslinfo should be tested as an extension of src/test/ssl/
instead of its own test suite.  There are too many complications that we
would otherwise have to solve again.

You might want to review commit f60a0e96778854ed0b7fd4737488ba88022e47bd
and how it adds test cases.  You can't just hardcode a specific output
since different installations might report TLS 1.2 vs 1.3, different
ciphers etc.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Global temporary tables

2019-08-16 Thread Konstantin Knizhnik
I did more investigations of performance of global temp tables with 
shared buffers vs. vanilla (local) temp tables.


1. Combination of persistent and temporary tables in the same query.

Preparation:
create table big(pk bigint primary key, val bigint);
insert into big values 
(generate_series(1,1),generate_series(1,1));

create temp table lt(key bigint, count bigint);
create global temp table gt(key bigint, count bigint);

Size of table is about 6Gb, I run this test on desktop with 16GB of RAM 
and postgres with 1Gb shared buffers.

I run two queries:

insert into T (select count(*),pk/P as key from big group by key);
select sum(count) from T;

where P is (100,10,1) and T is name of temp table (lt or gt).
The table below contains times of both queries in msec:

Percent of selected data
1%
10%
100%
Local temp table
44610
90
47920
891
63414
21612
Global temp table
44669
35
47939
298
59159
26015


As you can see, time of insertion in temporary table is almost the same
and time of traversal of temporary table is about twice smaller for 
global temp table
when it fits in RAM together with persistent table and slightly worser 
when it doesn't fit.




2. Temporary table only access.
The same system, but Postgres is configured with shared_buffers=10GB, 
max_parallel_workers = 4, max_parallel_workers_per_gather = 4


Local temp tables:
create temp table local_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint, 
x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint);
insert into local_temp values 
(generate_series(1,1),0,0,0,0,0,0,0,0);

select sum(x1) from local_temp;

Global temp tables:
create global temporary table global_temp(x1 bigint, x2 bigint, x3 
bigint, x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint);
insert into global_temp values 
(generate_series(1,1),0,0,0,0,0,0,0,0);

select sum(x1) from global_temp;

Results (msec):

Insert
Select
Local temp table37489
48322
Global temp table   44358
3003


So insertion in local temp table is performed slightly faster but select 
is 16 times slower!


Conclusion:
In the assumption then temp table fits in memory, global temp tables 
with shared buffers provides better performance than local temp table.
I didn't consider here global temp tables with local buffers because for 
them results should be similar with local temp tables.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Unexpected "shared memory block is still in use"

2019-08-16 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-08-14 01:22, Tom Lane wrote:
>> Attached is a draft patch to change both shmem and sema key selection
>> to be based on data directory inode rather than port.

> For the POSIX APIs where the numbers are just converted to a string, why
> not use both -- or forget about the inodes and use the actual data
> directory string.

Considering that we still need an operation equivalent to "nextSemKey++"
(in case of a key collision), I'm not really sure how working with strings
rather than ints would make life better.

> For the SYSV APIs, the scenario that came to my mind is if someone
> starts a bunch of servers each on their own mount, it could happen that
> the inodes of the data directories are very similar.

Sure.  That's why I didn't throw away any of the duplicate-key-handling
logic, and why we're still checking for st_dev match when inspecting
particular shmem blocks.  (It also seems likely that somebody
who's doing that would be using similar pathnames on the different
mounts, so that string-based approaches wouldn't exactly be free of
collision problems either.)

> There is also the issue that AFAICT the key_t in the SYSV APIs is always
> 32-bit whereas inodes are 64-bit.  Probably not a big deal, but it might
> prevent an exact one-to-one mapping.

True, although the width of inode numbers is probably pretty platform-
and filesystem-dependent.  We could consider trying some more complicated
mapping like xor'ing high and low halves, but I don't entirely see what
it buys us.

> Of course, ftok() is also available here as an existing solution.

I looked at that briefly, but I don't really see what it'd buy us either,
except for opacity which doesn't seem useful.  The Linux man page pretty
much says in so many words that it's a wrapper for st_ino and st_dev;
and how does it help us if other platforms do it differently?

(Actually, if Linux does it the way the man page suggests, it'd really
be a net negative, because there'd only be 24 bits of key variation
not 32.)

regards, tom lane




Re: block-level incremental backup

2019-08-16 Thread Ibrar Ahmed
On Fri, Aug 16, 2019 at 4:12 PM Ibrar Ahmed  wrote:

>
>
>
>
> On Fri, Aug 16, 2019 at 3:24 PM Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote:
>
>>
>>
>> On Fri, Aug 2, 2019 at 6:43 PM vignesh C  wrote:
>>
>>> Some comments:
>>> 1) There will be some link files created for tablespace, we might
>>> require some special handling for it
>>>
>>
>> Yep. I have that in my ToDo.
>> Will start working on that soon.
>>
>>
>>> 2)
>>> Retry functionality is hanlded only for copying of full files, should
>>> we handle retry for copying of partial files
>>> 3)
>>> we can have some range for maxretries similar to sleeptime
>>>
>>
>> I took help from pg_standby code related to maxentries and sleeptime.
>>
>> However, as we don't want to use system() call now, I have
>> removed all this kludge and just used fread/fwrite as discussed.
>>
>>
>>> 4)
>>> Should we check for malloc failure
>>>
>>
>> Used pg_malloc() instead. Same is also suggested by Ibrar.
>>
>>
>>>
>>> 5) Should we add display of progress as backup may take some time,
>>> this can be added as enhancement. We can get other's opinion on this.
>>>
>>
>> Can be done afterward once we have the functionality in place.
>>
>>
>>>
>>> 6)
>>> If the backup count increases providing the input may be difficult,
>>> Shall user provide all the incremental backups from a parent folder
>>> and can we handle the ordering of incremental backup internally
>>>
>>
>> I am not sure of this yet. We need to provide the tablespace mapping too.
>> But thanks for putting a point here. Will keep that in mind when I
>> revisit this.
>>
>>
>>>
>>> 7)
>>> Add verbose for copying whole file
>>>
>> Done
>>
>>
>>>
>>> 8) We can also check if approximate space is available in disk before
>>> starting combine backup, this can be added as enhancement. We can get
>>> other's opinion on this.
>>>
>>
>> Hmm... will leave it for now. User will get an error anyway.
>>
>>
>>>
>>> 9)
>>> Combine backup into directory can be combine backup directory
>>>
>> Done
>>
>>
>>>
>>> 10)
>>> MAX_INCR_BK_COUNT can be increased little, some applications use 1
>>> full backup at the beginning of the month and use 30 incremental
>>> backups rest of the days in the month
>>>
>>
>> Yeah, agree. But using any number here is debatable.
>> Let's see others opinion too.
>>
> Why not use a list?
>
>
>>
>>
>>> Regards,
>>> Vignesh
>>> EnterpriseDB: http://www.enterprisedb.com
>>>
>>
>>
>> Attached new sets of patches with refactoring done separately.
>> Incremental backup patch became small now and hopefully more
>> readable than the first version.
>>
>> --
>> Jeevan Chalke
>> Technical Architect, Product Development
>> EnterpriseDB Corporation
>> The Enterprise PostgreSQL Company
>>
>>
>
> +   buf = (char *) malloc(statbuf->st_size);
>
> +   if (buf == NULL)
>
> +   ereport(ERROR,
>
> +   (errcode(ERRCODE_OUT_OF_MEMORY),
>
> +errmsg("out of memory")));
>
> Why are you using malloc, you can use palloc here.
>
>
>
> Hi, I gave another look at the patch and have some quick comments.


-
> char   *extptr = strstr(fn, ".partial");

I think there should be a better and strict way to check the file
extension.

-
> +   extptr = strstr(outfn, ".partial");
> +   Assert (extptr != NULL);

Why are you checking that again, you just appended that in the above
statement?

-
> +   if (verbose && statbuf.st_size > (RELSEG_SIZE * BLCKSZ))
> +   pg_log_info("found big file \"%s\" (size: %.2lfGB): %m",
fromfn,
> +   (double) statbuf.st_size /
(RELSEG_SIZE * BLCKSZ));

This is not just a log, you find a file which is bigger which surely has
some problem.

-
> +* We do read entire 1GB file in memory while taking incremental
backup; so
> +* I don't see any reason why can't we do that here.  Also,
copying data in
> +* chunks is expensive.  However, for bigger files, we still
slice at 1GB
> +* border.


What do you mean by bigger file, a file greater than 1GB? In which case you
get file > 1GB?


-- 
Ibrar Ahmed


Re: UNION ALL

2019-08-16 Thread Tom Lane
Mark Pasterkamp  writes:
> I am comparing two queries, q1 and q2 respectively.
> Query q1 is the original query and q2 is an attempt to reduce the cost of
> execution via leveraging the materialized view ci_t_15.
> ...
> Running explain analyze on both queries I get the following execution plans.

Huh ... I wonder why the planner decided to try to parallelize Q2 (and not
Q1)?  That seems like a pretty stupid choice, because if I'm reading the
plan correctly (I might not be) each worker process has to read all of
the "title" table and build its own copy of the hash table.  That seems
likely to swamp whatever performance gain might come from parallelizing
the scan of cast_info --- which is likely to be not much, anyway, on a
laptop with probably-not-great disk I/O bandwidth.

In any case, whether that decision was good or bad, making it differently
renders the performance of Q1 and Q2 not very comparable.  It'd be worth
disabling parallelism (SET max_parallel_workers_per_gather = 0) and
retrying Q2 to get a more apples-to-apples comparison.

Another bit of advice is to increase work_mem, so the hashes don't
have to be split into quite so many batches.

I'm noting also that your queries aren't giving the same results ---
Q2 reports returning fewer rows overall.  Do you have rows where
title.production_year is null, perhaps?

regards, tom lane




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-16 Thread Anastasia Lubennikova

13.08.2019 18:45, Anastasia Lubennikova wrote:

  I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.


Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.


Attached is v7.

In this version of the patch, I heavily refactored the code of insertion 
into

posting tuple. bt_split logic is quite complex, so I omitted a couple of
optimizations. They are mentioned in TODO comments.

Now the algorithm is the following:

- If bt_findinsertloc() found out that tuple belongs to existing posting 
tuple's

TID interval, it sets 'in_posting_offset' variable and passes it to
_bt_insertonpg()

- If 'in_posting_offset' is valid and origtup is valid,
merge our itup into origtup.

It can result in one tuple neworigtup, that must replace origtup; or two 
tuples:

neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

- If two new tuple(s) fit into the old page, we're lucky.
call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
atomically replace oldtup with new tuple(s) and generate xlog record.

- In case page split is needed, pass both tuples to _bt_split().
 _bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

It seems that now all replace operations are crash-safe. The new patch 
passes

all regression tests, so I think it's ready for review again.

In the meantime, I'll run more stress-tests.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..504bca2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 	   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-			ItemPointerGetBlockNumber(&(itup->t_tid)),
-			ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			ItemPointerGetBlockNumber(tid),
+			ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
 	 errmsg("could not find tuple using search from root page in index \"%s\"",
 			RelationGetRelationName(state->rel)),
-	 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+	 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 		itid, htid,
 		(uint32) (state->targetlsn >> 32),
 		(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+current = BTreeTupleGetPostingN(itup, i);
+
+if (ItemPointerCompare(current, &last) <= 0)
+{
+	char	   *itid,
+			   *htid;
+
+	itid = psprintf("(%u,%u)", state->targetblock, offset);
+	htid = psprintf("(%u,%u)",
+	ItemPointerGetBlockNumberNoCheck(current),
+	ItemPointerGetOffsetNumberNoCheck(current));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_INDEX_CORRUPTED),
+			 errmsg("posting list heap TIDs out of order in index \"%s\"",
+	RelationGetRelationName(state->rel)),
+			 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+itid, htid,
+(uint32) (state->targetlsn >> 32),
+(uint32) state->targetlsn)));
+}
+
+ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-			  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+IndexTuple	onetup;
+

REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity

2019-08-16 Thread Hadi Moshayedi
It seems that sometimes when DELETE cascades to referencing tables we fail
to acquire locks on replica identity index.

To reproduce, set wal_level to logical, and run 1.sql.

I can look into this, but I thought first I should send it here in case
someone who is more familiar with these related functions can solve it
quickly.

I get the following backtrace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7f301154b801 in __GI_abort () at abort.c:79
#2  0x55df8858a923 in ExceptionalCondition (
conditionName=conditionName@entry=0x55df885fd138
"!(CheckRelationLockedByMe(idx_rel, 1, 1))",
errorType=errorType@entry=0x55df885de8fd
"FailedAssertion",
fileName=fileName@entry=0x55df885fca32 "heapam.c",
lineNumber=lineNumber@entry=7646)
at assert.c:54
#3  0x55df88165e53 in ExtractReplicaIdentity
(relation=relation@entry=0x7f3012b54db0,

tp=tp@entry=0x7ffcf47d53f0, key_changed=key_changed@entry=true,
copy=copy@entry=0x7ffcf47d53d3) at heapam.c:7646
#4  0x55df8816c22b in heap_delete (relation=0x7f3012b54db0,
tid=,
cid=, crosscheck=0x0, wait=true, tmfd=0x7ffcf47d54b0,
changingPart=false)
at heapam.c:2676
#5  0x55df88318b62 in table_tuple_delete (changingPart=false,
tmfd=0x7ffcf47d54b0,
wait=true, crosscheck=, snapshot=,
cid=,
tid=0x7ffcf47d558a, rel=0x7f3012b54db0) at
../../../src/include/access/tableam.h:1216
#6  ExecDelete (mtstate=mtstate@entry=0x55df8a8196a0,
tupleid=0x7ffcf47d558a, oldtuple=0x0,
planSlot=planSlot@entry=0x55df8a81a8e8,
epqstate=epqstate@entry=0x55df8a819798,

estate=estate@entry=0x55df8a819058, processReturning=true,
canSetTag=true,
changingPart=false, tupleDeleted=0x0, epqreturnslot=0x0) at
nodeModifyTable.c:769
#7  0x55df8831aa25 in ExecModifyTable (pstate=0x55df8a8196a0) at
nodeModifyTable.c:2230
#8  0x55df882efa9a in ExecProcNode (node=0x55df8a8196a0)
at ../../../src/include/executor/executor.h:239
#9  ExecutePlan (execute_once=, dest=0x55df88a89a00
,
direction=, numberTuples=0, sendTuples=,
operation=CMD_DELETE, use_parallel_mode=,
planstate=0x55df8a8196a0,
estate=0x55df8a819058) at execMain.c:1648
#10 standard_ExecutorRun (queryDesc=0x55df8a7de4b0, direction=, count=0,
execute_once=) at execMain.c:365
#11 0x55df8832b90c in _SPI_pquery (tcount=0, fire_triggers=false,
queryDesc=0x55df8a7de4b0) at spi.c:2521
#12 _SPI_execute_plan (plan=plan@entry=0x55df8a812828, paramLI=,
snapshot=snapshot@entry=0x0,
crosscheck_snapshot=crosscheck_snapshot@entry=0x0,
read_only=read_only@entry=false, fire_triggers=fire_triggers@entry=false,

tcount=) at spi.c:2296
#13 0x55df8832c15c in SPI_execute_snapshot (plan=plan@entry=0x55df8a812828,

Values=Values@entry=0x7ffcf47d5820, Nulls=Nulls@entry=0x7ffcf47d5a20 "
",
snapshot=snapshot@entry=0x0,
crosscheck_snapshot=crosscheck_snapshot@entry=0x0,
read_only=read_only@entry=false, fire_triggers=false, tcount=0) at
spi.c:616
#14 0x55df88522f32 in ri_PerformCheck (riinfo=riinfo@entry=0x55df8a7f8050,

qkey=qkey@entry=0x7ffcf47d5b28, qplan=0x55df8a812828,
fk_rel=fk_rel@entry=0x7f3012b54db0, pk_rel=pk_rel@entry=0x7f3012b44a28,
oldslot=oldslot@entry=0x55df8a826f88, newslot=0x0, detectNewRows=true,
expect_OK=8)
at ri_triggers.c:2276
#15 0x55df88524653 in RI_FKey_cascade_del (fcinfo=) at
ri_triggers.c:819
#16 0x55df882c9996 in ExecCallTriggerFunc
(trigdata=trigdata@entry=0x7ffcf47d5ff0,

tgindx=tgindx@entry=0, finfo=finfo@entry=0x55df8a825710,
instr=instr@entry=0x0,
per_tuple_context=per_tuple_context@entry=0x55df8a812f10) at
trigger.c:2432
#17 0x55df882cb459 in AfterTriggerExecute (trigdesc=0x55df8a825530,
trigdesc=0x55df8a825530, trig_tuple_slot2=0x0, trig_tuple_slot1=0x0,
per_tuple_context=0x55df8a812f10, instr=0x0, finfo=0x55df8a825710,
relInfo=0x55df8a825418, event=0x55df8a81f0a8, estate=0x55df8a825188) at
trigger.c:4342
#18 afterTriggerInvokeEvents (events=events@entry=0x55df8a7c3e40,
firing_id=1,
estate=estate@entry=0x55df8a825188, delete_ok=delete_ok@entry=false) at
trigger.c:4539
#19 0x55df882d1408 in AfterTriggerEndQuery (estate=estate@entry
=0x55df8a825188)
at trigger.c:4850
#20 0x55df882efd99 in standard_ExecutorFinish (queryDesc=0x55df8a722ab8)
at execMain.c:440
#21 0x55df88464bdd in ProcessQuery (plan=,
sourceText=0x55df8a702f78 "DELETE FROM t1 RETURNING id;", params=0x0,
queryEnv=0x0,
dest=0x55df8a722a20, completionTag=0x7ffcf47d6180 "DELETE 11") at
pquery.c:203
#22 0x55df88464e0b in PortalRunMulti (portal=portal@entry=0x55df8a7692f8,

isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=true,

dest=dest@entry=0x55df8a722a20, altdest=0x55df88a81040 ,
completionTag=completionTag@entry=0x7ffcf47d6180 "DELETE 11") at
pquery.c:1283
#23 0x55df88465119 in FillPortalStore (portal=portal@entry=0x55df8a7692f8,

isTopLevel=isTopLevel@entry=true) at pquery.c:1030
#24 0x55df88465d1d in 

Re: Global temporary tables

2019-08-16 Thread Pavel Stehule
pá 16. 8. 2019 v 16:12 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

> I did more investigations of performance of global temp tables with shared
> buffers vs. vanilla (local) temp tables.
>
> 1. Combination of persistent and temporary tables in the same query.
>
> Preparation:
> create table big(pk bigint primary key, val bigint);
> insert into big values
> (generate_series(1,1),generate_series(1,1));
> create temp table lt(key bigint, count bigint);
> create global temp table gt(key bigint, count bigint);
>
> Size of table is about 6Gb, I run this test on desktop with 16GB of RAM
> and postgres with 1Gb shared buffers.
> I run two queries:
>
> insert into T (select count(*),pk/P as key from big group by key);
> select sum(count) from T;
>
> where P is (100,10,1) and T is name of temp table (lt or gt).
> The table below contains times of both queries in msec:
>
> Percent of selected data
> 1%
> 10%
> 100%
> Local temp table
> 44610
> 90
> 47920
> 891
> 63414
> 21612
> Global temp table
> 44669
> 35
> 47939
> 298
> 59159
> 26015
>
> As you can see, time of insertion in temporary table is almost the same
> and time of traversal of temporary table is about twice smaller for global
> temp table
> when it fits in RAM together with persistent table and slightly worser
> when it doesn't fit.
>
>
>
> 2. Temporary table only access.
> The same system, but Postgres is configured with shared_buffers=10GB,
> max_parallel_workers = 4, max_parallel_workers_per_gather = 4
>
> Local temp tables:
> create temp table local_temp(x1 bigint, x2 bigint, x3 bigint, x4 bigint,
> x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint);
> insert into local_temp values
> (generate_series(1,1),0,0,0,0,0,0,0,0);
> select sum(x1) from local_temp;
>
> Global temp tables:
> create global temporary table global_temp(x1 bigint, x2 bigint, x3 bigint,
> x4 bigint, x5 bigint, x6 bigint, x7 bigint, x8 bigint, x9 bigint);
> insert into global_temp values
> (generate_series(1,1),0,0,0,0,0,0,0,0);
> select sum(x1) from global_temp;
>
> Results (msec):
>
> Insert
> Select
> Local temp table 37489
> 48322
> Global temp table 44358
> 3003
>
> So insertion in local temp table is performed slightly faster but select
> is 16 times slower!
>
> Conclusion:
> In the assumption then temp table fits in memory, global temp tables with
> shared buffers provides better performance than local temp table.
> I didn't consider here global temp tables with local buffers because for
> them results should be similar with local temp tables.
>

Probably there is not a reason why shared buffers should be slower than
local buffers when system is under low load.

access to shared memory is protected by spin locks (are cheap for few
processes), so tests in one or few process are not too important (or it is
just one side of space)

another topic can be performance on MS Sys - there are stories about not
perfect performance of shared memory there.

Regards

Pavel



>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Antonin Houska
Bruce Momjian  wrote:

> I have seen no one present a clear description of how anything beyond
> all-cluster encryption would work or be secure.  Wishing that were not
> the case doesn't change things.

Since this email thread has grown a lot and is difficult to follow, it might
help if we summarized various approaches on the wiki, with their pros and
cons, and included some links to the corresponding emails in the
archive. There might be people who would like think about the problems but
don't have time to read the whole thread. Overview of the pending problems of
particular approaches might be useful for newcomers, but also for people who
followed only part of the discussion. I mean an overview of the storage
problems; the key management seems to be less controversial.

If you think it makes sense, I can spend some time next week on the
research. However I'd need at least an outline of the approaches proposed
because I also missed some parts of the thread.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity

2019-08-16 Thread Andres Freund
Hi,

On 2019-08-16 09:44:15 -0700, Hadi Moshayedi wrote:
> It seems that sometimes when DELETE cascades to referencing tables we fail
> to acquire locks on replica identity index.
> 
> To reproduce, set wal_level to logical, and run 1.sql.
> 
> I can look into this, but I thought first I should send it here in case
> someone who is more familiar with these related functions can solve it
> quickly.

I suspect this "always" has been broken, it's just that we previously
didn't have checks in place that detect these cases. I don't think it's
likely to cause actual harm, due to the locking on the table itself when
dropping indexes etc.  But we still should fix it.

The relevant code is:

/*
 * If there are indices on the result relation, open them and 
save
 * descriptors in the result relation info, so that we can add 
new
 * index entries for the tuples we add/update.  We need not do 
this
 * for a DELETE, however, since deletion doesn't affect 
indexes. Also,
 * inside an EvalPlanQual operation, the indexes might be open
 * already, since we share the resultrel state with the original
 * query.
 */
if (resultRelInfo->ri_RelationDesc->rd_rel->relhasindex &&
operation != CMD_DELETE &&
resultRelInfo->ri_IndexRelationDescs == NULL)
ExecOpenIndices(resultRelInfo,
node->onConflictAction 
!= ONCONFLICT_NONE);


I'm not quite sure what the best way to fix this would be however. It
seems like a bad idea to make locking dependent on wal_level, but I'm
also not sure we want to incur the price of locking one more table to
every delete on a table with a primary key?


Greetings,

Andres Freund




Re: Add "password_protocol" connection parameter to libpq

2019-08-16 Thread Jonathan S. Katz
On 8/15/19 9:28 PM, Stephen Frost wrote:
> Greetings,
> 
> * Jeff Davis (pg...@j-davis.com) wrote:
>> On Wed, 2019-08-14 at 11:38 +0900, Michael Paquier wrote:
>>> What I got in mind was a comma-separated list of authorized protocols
>>> which can be specified as a connection parameter, which extends to
>>> all
>>> the types of AUTH_REQ requests that libpq can understand, plus an
>>> extra for channel binding.  I also liked the idea mentioned upthread
>>> of "any" to be an alias to authorize everything, which should be the
>>> default.  So you basically get at that:
>>> auth_protocol = {any,password,md5,scram-sha-256,scram-sha-256-
>>> plus,krb5,gss,sspi}
>>
>> What about something corresponding to the auth methods "trust" and
>> "cert", where no authentication request is sent? That's a funny case,
>> because the server trusts the client; but that doesn't imply that the
>> client trusts the server.
> 
> I agree that "trust" is odd.  If you want my 2c, we should have to
> explicitly *enable* that for psql, otherwise there's the potential that
> a MITM could perform a downgrade attack to "trust" and while that might
> not expose a user's password, it could expose other information that the
> client ends up sending (INSERTs, UPDATEs, etc).
> 
> When it comes to "cert"- there is certainly an authentication that
> happens and we already have options in psql/libpq to require validation
> of the server.  If users want that, they should enable it (I wish we
> could make it the default too but that's a different discussion...).
> 
>> This is another reason I don't really like the list. It's impossible to
>> make it cleanly map to the auth methods, and there are a few ways it
>> could be confusing to the users.
> 
> I agree with these concerns, just to be clear.

+1.

> 
>> Given that we all pretty much agree on the need for the separate
>> channel_binding param, the question is whether we want to (a) address
>> additional use cases with specific parameters that also justify
>> themselves; or (b) have a generic list that is supposed to solve many
>> future use cases.
>>
>> I vote (a). With (b), the generic list is likely to cause more
>> confusion, ugliness, and clients that break needlessly in the future.
> 
> Admittedly, one doesn't preclude the other, and so we could move forward
> with the channel binding param, and that's fine- but I seriously hope
> that someone finds time to work on further improving the ability for
> clients to control what happens during authentication as this, imv
> anyway, is an area that we are weak in and it'd be great to improve on
> it.

To be pedantic, +1 on the channel_binding param.

I do agree with option (a), but we should narrow down what that means
for this iteration.

I do see "password_protocol" making sense as a comma-separated list of
options e.g. {plaintext, md5, scram-sha-256}. I would ask if
scram-sha-256-plus makes the list if we have the channel_binding param?

If channel_binding = require, it would essentially ignore any non-plus
options in password_protocol and require scram-sha-256-plus. In a future
scram-sha-512 world, then having scram-sha-256-plus or
scram-sha-512-plus options in "password_protocol" then could be
necessary based on what the user would prefer or require in their
application.

So if we do add a "password_protocol" parameter, then we likely need to
include the -plus's.

I think this is also fairly easy for a user to configure. Some scenarios
scenarios I can see for this are:

1. The user requiring channel binding, so only "channel_binding=require"
 is set.

2. A PostgreSQL cluster transitioning between SCRAM + MD5, and the user
setting password_protocol="scram-sha-256" to guarantee md5 auth does not
take place.

3. A user wanting to ensure a stronger method is used, with some
combination of the scram methods or md5, i.e. ensuring plaintext is not
used.

We would need to provide documentation around the types of password
validation methods are used for the external validators (e.g. LDAP) so
the user's known what to expect if their server is using those methods.

Jonathan



signature.asc
Description: OpenPGP digital signature


allocation limit for encoding conversion

2019-08-16 Thread Alvaro Herrera
Somebody ran into issues when generating large XML output (upwards of
256 MB) and then sending via a connection with a different
client_encoding.  This occurs because we pessimistically allocate 4x as
much memory as the string needs, and we run into the 1GB palloc
limitation.  ISTM we can do better now by using huge allocations, as per
the preliminary attached patch (which probably needs an updated overflow
check rather than have it removed altogether); but at least it is able
to process this query, which it wasn't without the patch:

select query_to_xml(
'select a, cash_words(a::text::money) from generate_series(0, 200) a',
true, false, '');

-- 
Álvaro Herrera
>From a718f299d72617c366a08735895e0439482ede2f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 8 Aug 2019 15:36:53 -0400
Subject: [PATCH v1] Use huge pallocs on encoding conversions

---
 src/backend/utils/mb/mbutils.c | 26 --
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index bec54bb5cb..e2853fcf6e 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -348,17 +348,8 @@ pg_do_encoding_conversion(unsigned char *src, int len,
 		pg_encoding_to_char(src_encoding),
 		pg_encoding_to_char(dest_encoding;
 
-	/*
-	 * Allocate space for conversion result, being wary of integer overflow
-	 */
-	if ((Size) len >= (MaxAllocSize / (Size) MAX_CONVERSION_GROWTH))
-		ereport(ERROR,
-(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("out of memory"),
- errdetail("String of %d bytes is too long for encoding conversion.",
-		   len)));
-
-	result = palloc(len * MAX_CONVERSION_GROWTH + 1);
+	result = palloc_extended((Size) len * MAX_CONVERSION_GROWTH + 1,
+			 MCXT_ALLOC_HUGE);
 
 	OidFunctionCall5(proc,
 	 Int32GetDatum(src_encoding),
@@ -681,17 +672,8 @@ perform_default_encoding_conversion(const char *src, int len,
 	if (flinfo == NULL)
 		return unconstify(char *, src);
 
-	/*
-	 * Allocate space for conversion result, being wary of integer overflow
-	 */
-	if ((Size) len >= (MaxAllocSize / (Size) MAX_CONVERSION_GROWTH))
-		ereport(ERROR,
-(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("out of memory"),
- errdetail("String of %d bytes is too long for encoding conversion.",
-		   len)));
-
-	result = palloc(len * MAX_CONVERSION_GROWTH + 1);
+	result = palloc_extended((Size) len * MAX_CONVERSION_GROWTH + 1,
+			 MCXT_ALLOC_HUGE);
 
 	FunctionCall5(flinfo,
   Int32GetDatum(src_encoding),
-- 
2.17.1



Re: allocation limit for encoding conversion

2019-08-16 Thread Alvaro Herrera
On 2019-Aug-16, Alvaro Herrera wrote:

> Somebody ran into issues when generating large XML output (upwards of
> 256 MB) and then sending via a connection with a different
> client_encoding.

ref: https://postgr.es/m/43a889a1-45fb-1d60-31ae-21e607307...@gmail.com
(pgsql-es-ayuda)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: doc: Add some images

2019-08-16 Thread Alvaro Herrera
On 2019-Mar-27, Peter Eisentraut wrote:

> doc: Add some images
> 
> Add infrastructure for having images in the documentation, in SVG
> format.  Add two images to start with.  See the included README file
> for instructions.

> Author: Jürgen Purtz 
> Author: Peter Eisentraut 

Now when I test Jürgen's new proposed image genetic-algorithm I find
that this stuff doesn't work in VPATH builds, at least for PDF -- I
don't get a build failure, but instead I get just a section title that
doesn't precede any actual image.  (There's a very small warning hidden
in the tons of other fop output).  If I edit the .fo file by hand to
make the path to .svg absolute, the image appears correctly.

I don't see any way in the fop docs to specify the base path for images.

I'm not sure what's a good way to fix this problem in a general way.
Would some new rule in the xslt would work?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: allocation limit for encoding conversion

2019-08-16 Thread Tom Lane
Alvaro Herrera  writes:
> Somebody ran into issues when generating large XML output (upwards of
> 256 MB) and then sending via a connection with a different
> client_encoding.  This occurs because we pessimistically allocate 4x as
> much memory as the string needs, and we run into the 1GB palloc
> limitation.  ISTM we can do better now by using huge allocations, as per
> the preliminary attached patch (which probably needs an updated overflow
> check rather than have it removed altogether); but at least it is able
> to process this query, which it wasn't without the patch:

> select query_to_xml(
> 'select a, cash_words(a::text::money) from generate_series(0, 200) a',
> true, false, '');

I fear that allowing pg_do_encoding_conversion to return strings longer
than 1GB is just going to create failure cases somewhere else.

However, it's certainly true that 4x growth is a pretty unlikely worst
case.  Maybe we could do something like

1. If string is short (say up to a few megabytes), continue to do it
like now.  This avoids adding overhead for typical cases.

2. Otherwise, run some lobotomized form of encoding conversion that
just computes the space required (as an int64, I guess) without saving
the result anywhere.

3. If space required > 1GB, fail.

4. Otherwise, allocate just the space required, and convert.

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Bruce Momjian
On Fri, Aug 16, 2019 at 07:47:37PM +0200, Antonin Houska wrote:
> Bruce Momjian  wrote:
> 
> > I have seen no one present a clear description of how anything beyond
> > all-cluster encryption would work or be secure.  Wishing that were not
> > the case doesn't change things.
> 
> Since this email thread has grown a lot and is difficult to follow, it might
> help if we summarized various approaches on the wiki, with their pros and
> cons, and included some links to the corresponding emails in the
> archive. There might be people who would like think about the problems but
> don't have time to read the whole thread. Overview of the pending problems of
> particular approaches might be useful for newcomers, but also for people who
> followed only part of the discussion. I mean an overview of the storage
> problems; the key management seems to be less controversial.
> 
> If you think it makes sense, I can spend some time next week on the
> research. However I'd need at least an outline of the approaches proposed
> because I also missed some parts of the thread.

I suggest we schedule a voice call and I will go over all the issues and
explain why I came to the conclusions listed.  It is hard to know what
level of detail to explain that in an email, beyond what I have already
posted on this thread.  The only other options is to read all the emails
_I_ sent on the thread to get an idea.

I am able to do that for others as well.  

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: allocation limit for encoding conversion

2019-08-16 Thread Andres Freund
Hi,

On 2019-08-16 17:31:49 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > Somebody ran into issues when generating large XML output (upwards of
> > 256 MB) and then sending via a connection with a different
> > client_encoding.  This occurs because we pessimistically allocate 4x as
> > much memory as the string needs, and we run into the 1GB palloc
> > limitation.  ISTM we can do better now by using huge allocations, as per
> > the preliminary attached patch (which probably needs an updated overflow
> > check rather than have it removed altogether); but at least it is able
> > to process this query, which it wasn't without the patch:
> 
> > select query_to_xml(
> > 'select a, cash_words(a::text::money) from generate_series(0, 200) 
> > a',
> > true, false, '');
> 
> I fear that allowing pg_do_encoding_conversion to return strings longer
> than 1GB is just going to create failure cases somewhere else.
> 
> However, it's certainly true that 4x growth is a pretty unlikely worst
> case.  Maybe we could do something like
> 
> 1. If string is short (say up to a few megabytes), continue to do it
> like now.  This avoids adding overhead for typical cases.
> 
> 2. Otherwise, run some lobotomized form of encoding conversion that
> just computes the space required (as an int64, I guess) without saving
> the result anywhere.
> 
> 3. If space required > 1GB, fail.
> 
> 4. Otherwise, allocate just the space required, and convert.

It's probably too big a hammer for this specific case, but I think at
some point we ought to stop using fixed size allocations for this kind
of work. Instead we should use something roughly like our StringInfo,
except that when exceeding the current size limit, the overflowing data
is stored in a separate allocation. And only once we actually need the
data in a consecutive form, we allocate memory that's large enough to
store the all the separate allocations in their entirety.

Greetings,

Andres Freund




Re: default_table_access_method is not in sample config file

2019-08-16 Thread Andres Freund
On 2019-08-13 15:03:13 +0900, Michael Paquier wrote:
> On Fri, Aug 09, 2019 at 11:34:05AM +0300, Heikki Linnakangas wrote:
> > On 11/04/2019 19:49, Andres Freund wrote:
> >> Hm, I think we should rather add it to sample. That's an oversight, not
> >> intentional.
> > 
> > I just noticed that this is still an issue. default_table_access_method is
> > not in the sample config file, and it's not marked with GUC_NOT_IN_SAMPLE.
> > I'll add this to the open items list so we don't forget.

Thanks!


> I think that we should give it the same visibility as default_tablespace,
> so adding it to the sample file sounds good to me.

> diff --git a/src/backend/utils/misc/postgresql.conf.sample 
> b/src/backend/utils/misc/postgresql.conf.sample
> index 65a6da18b3..39fc787851 100644
> --- a/src/backend/utils/misc/postgresql.conf.sample
> +++ b/src/backend/utils/misc/postgresql.conf.sample
> @@ -622,6 +622,7 @@
>  #default_tablespace = '' # a tablespace name, '' uses the default
>  #temp_tablespaces = ''   # a list of tablespace names, 
> '' uses
>   # only default tablespace
> +#default_table_access_method = 'heap'

Pushed, thanks.


>  #check_function_bodies = on
>  #default_transaction_isolation = 'read committed'
>  #default_transaction_read_only = off

Hm.  I find the current ordering there a bit weird. Unrelated to your
proposed change.  The header of the group is

#--
# CLIENT CONNECTION DEFAULTS
#--

# - Statement Behavior -

but I don't quite see GUCs like default_tablespace, search_path (due to
determining a created table's schema), temp_tablespace,
default_table_access_method fit reasonably well under that heading. They
all can affect persistent state. That seems pretty different from a
number of other settings (client_min_messages,
default_transaction_isolation, lock_timeout, ...) which only have
transient effects.

Should we perhaps split that group? Not that I have a good proposal for
better names.

Greetings,

Andres Freund




Re: Unused header file inclusion

2019-08-16 Thread Andres Freund
Hi,

On 2019-08-03 12:37:33 -0700, Andres Freund wrote:
> Think the first three are pretty clearly a good idea, I'm a bit less
> sanguine about the fourth:
> Headers like utils/timestamp.h are often included just because we need a
> TimestampTz type somewhere, or call GetCurrentTimestamp(). Approximately
> none of these need the PG_GETARG_* macros, which are the only reason for
> including fmgr.h in these headers.  As they're macros that's not
> actually needed, although I think normally good style. But I' think here
> avoiding exposing fmgr.h to more headers is a bigger win.

I still think the fourth is probably worthwhile, but I don't feel
confident enough to do it without somebody else +0.5'ing it...

I've pushed the other ones.

Greetings,

Andres Freund




Can't we give better table bloat stats easily?

2019-08-16 Thread Greg Stark
Everywhere I've worked I've seen people struggle with table bloat. It's
hard to even measure how much of it you have or where, let alone actually
fix it.

If you search online you'll find dozens of different queries estimating how
much empty space are in your tables and indexes based on pg_stats
statistics and suppositions about header lengths and padding and plugging
them into formulas of varying credibility.

But isn't this all just silliiness these days? We could actually sum up the
space recorded in the fsm and get a much more trustworthy number in
milliseconds.

I rigged up a quick proof of concept and the code seems super simple and
quick. There's one or two tables where the number is a bit suspect and
there's no fsm if vacuum hasn't run but that seems pretty small potatoes
for such a huge help in reducing user pain.


Re: Can't we give better table bloat stats easily?

2019-08-16 Thread Andres Freund
Hi,

On 2019-08-16 20:39:21 -0400, Greg Stark wrote:
> But isn't this all just silliiness these days? We could actually sum up the
> space recorded in the fsm and get a much more trustworthy number in
> milliseconds.

You mean like pgstattuple_approx()?

https://www.postgresql.org/docs/current/pgstattuple.html

Or something different?

> I rigged up a quick proof of concept and the code seems super simple and
> quick. There's one or two tables where the number is a bit suspect and
> there's no fsm if vacuum hasn't run but that seems pretty small potatoes
> for such a huge help in reducing user pain.

Hard to comment on what you propose, without more details. But note that
you can't just look at the FSM, because in a lot of workloads it is
often hugely out of date. And fairly obviously it doesn't provide you
with information about how much space is currently occupied by dead
tuples.  What pgstattuple_approx does is to use the FSM for blocks that
are all-visible, and look at the page otherwise.

Greetings,

Andres Freund




Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity

2019-08-16 Thread Tom Lane
Andres Freund  writes:
> On 2019-08-16 09:44:15 -0700, Hadi Moshayedi wrote:
>> It seems that sometimes when DELETE cascades to referencing tables we fail
>> to acquire locks on replica identity index.

> I suspect this "always" has been broken, it's just that we previously
> didn't have checks in place that detect these cases. I don't think it's
> likely to cause actual harm, due to the locking on the table itself when
> dropping indexes etc.  But we still should fix it.

Yeah ... see the discussion leading up to 9c703c169,

https://www.postgresql.org/message-id/flat/19465.1541636036%40sss.pgh.pa.us

We didn't pull the trigger on removing the CMD_DELETE exception here,
but I think the handwriting has been on the wall for some time.

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-16 Thread Antonin Houska
Bruce Momjian  wrote:

> On Thu, Aug 15, 2019 at 09:01:05PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > > Why would it not be simpler to have the cluster_passphrase_command run
> > > whatever command-line program it wants?  If you don't want to use a
> > > shell command, create an executable and call that.
> > 
> > Having direct integration with a KMS would certainly be valuable, and I
> > don't see a reason to deny users that option if someone would like to
> > spend time implementing it- in addition to a simpler mechanism such as a
> > passphrase command, which I believe is what was being suggested here.
> 
> OK,  I am just trying to see why we would not use the
> cluster_passphrase_command-like interface to do that.

One problem that occurs to me is that PG may need to send some sort of
credentials to the KMS. If it runs a separate process to execute the command,
it needs to pass those credentials to it. Whether it does so via parameters or
environment variables, both can be seen by other users.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com