date:20191121

Re: Rework manipulation and structure of attribute mappings

2019-11-21 Thread Michael Paquier

On Fri, Nov 22, 2019 at 02:21:41PM +0900, Amit Langote wrote:
> Thanks for working on this.  I guess this is a follow up to:
> https://www.postgresql.org/message-id/20191102052001.GB1614%40paquier.xyz

Exactly.  I got that in my mind for a couple of days, so I gave it a
shot and the result was kind of nice.  And here we are.

> On Thu, Nov 21, 2019 at 1:26 PM Michael Paquier  wrote:
> The refactoring to use AttrMap looks good, though attmap.c as a
> separate module contains too little functionality (just palloc and
> pfree) to be a separate file, IMHO.  If we are to build a separate
> module, why not move a bit more functionality into it from
> tupconvert.c.  How about move all the code that actually creates the
> map to attmap.c?  The entry points would be all the
> convert_tuples_by_name_map() and convert_tuples_by_name_map_if_req()
> functions that return AttrMap, rather than simply make_attrmap(int
> len) which can be a static routine.

Yeah, the current part is a little bit shy about that.  Moving
convert_tuples_by_name_map() and the second one to attmap.c makes
sense.

> Actually, we should also refactor
> convert_tuples_by_position() to carve out the code that builds the
> AttrMap into a separate function and move it to attmap.c.

Not sure how to name that.  One logic uses a match based on the
attribute name, and the other uses the type.  So the one based on the
attribute name should be something like build_attrmap_by_name() and
the second attrmap_build_by_position()?  We could use a better
convention like AttrMapBuildByPosition for example.  Any suggestions
of names are welcome.  Please note that I still have a commit fest to 
run and finish, so I'll unlikely come back to that before December.
Let's continue with the tuning of the function names though.

> To be honest, "convert_tuples_" part in those names might start
> sounding a bit outdated in the future, so we should really consider
> inventing a new interface map_attributes(TupleDesc indesc, TupleDesc
> outdesc), because most call sites that fetch the AttrMap directly
> don't really use it for "converting tuples", but to convert
> expressions or to map key arrays.
>
> After all the movement, tupconvert.c will only retain the
> functionality to build a TupleConversionMap (wrapping the AttrMap) and
> to convert HeapTuples, that is, execute_attr_map_tuple() and
> execute_attr_map_slot(), which makes sense.

Agreed.  Let's design that carefully.

> Actually, the patch can make addFkRecurseReferencing() crash, because
> the fkattnum[] array doesn't really contain attmap->maplen elements:
> 
> -for (int j = 0; j < numfks; j++)
> -mapped_fkattnum[j] = attmap[fkattnum[j] - 1];
> +for (int j = 0; j < attmap->maplen; j++)
> +mapped_fkattnum[j] = attmap->attnums[fkattnum[j] - 1];
> 
> You failed to notice that j is really used as index into fkattnum[],
> not the map array returned by convert_tuples_by_name(). So, I think
> the original coding is fine here.

Ouch, yes.  The regression tests did not complain on this one.  It
means that we could improve the coverage.  The second, though...  I
need to check it more closely.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Do not use StdRdOptions in Access Methods

2019-11-21 Thread Michael Paquier

On Fri, Nov 22, 2019 at 11:01:54AM +0900, Amit Langote wrote:
> The latest patch looks good to me, except, maybe the comment of
> StdRdOptions should be updated:
> 
>  * StdRdOptions
>  *  Standard contents of rd_options for heaps and generic indexes.
> 
> IIUC, StdRdOptions no longer applies to indexes, right?

Noted, thanks.  I'll sit on this thing for a couple of days, and will
likely look at it again on Monday in order to commit it.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Do not use StdRdOptions in Access Methods

2019-11-21 Thread Michael Paquier

On Thu, Nov 21, 2019 at 09:39:53PM +0300, Nikolay Shaplov wrote:
> BRIN_AM_OID and friends are defined in catalog/pg_am_d.h so for core indexes 
> we can do relation->rd_rel->relam == BRIN_AM_OID check. But for contrib 
> indexes we can't do such a thing.
> Bloom index does not need such check as it uses options only when index is 
> created. At that point you can not choose wrong relation. But if in future we 
> will have some contrib index that uses options when it some data is inserted 
> (as it is done with fillfactor in core indexes) then index author will not be 
> able to do such relam check. I would not call it a big problem, but it is 
> something to think about, for sure...

I don't think that you actually need that for custom index AMs anyway,
as all code paths leading to the lookup of their reloption values is
within the module they are defined in.

> Thaks for joining this work, and sorry for late replies. Now I quite rarely 
> have time for postgres :-(

We all have a life, don't worry.  I am glad to see you around.
--
Michael


signature.asc
Description: PGP signature

Re: Why overhead of SPI is so large?

2019-11-21 Thread Konstantin Knizhnik




On 22.11.2019 10:08, Pavel Stehule wrote:


I test it, and there is a problem already. We doesn't raise a 
exception, but the result is wrong



create table foo(a int);

create or replace function f1(int)
returns void as $$
begin
  insert into foo values($1);
end;
$$ language plpgsql;

create or replace function f2(int)
returns void as $$declare r record;
begin
  perform f1(); for r in select * from foo loop raise notice '%', r; 
end loop;

end;
$$ language plpgsql immutable; -- or stable with same behave

So current state is:

a) we allow to call volatile functions from nonvolatile (stable, 
immutable) that really does write
b) but this change is not visible in parent nonvolatile functions. Is 
visible only in volatile functions.


Is it expected behave?


I think that in theory it is definitely not correct to call volatile 
function from non-volatile.

But there are two questions:
1. Are we able to check it? Please taken in account that:
 - at the moment of "create function f2()"  called function f1() may 
not yet be defined
 - instead of perform f1() it can do "execute 'select f1()'" and it is 
not possible to check it at compile time.
2. Is it responsibility of programmer to correctly specify function 
properties or it should be done by compiler?
  If we follow YWIYGI rule, then your definition of f2() is not correct 
and that it is why you will get wrong result in this case.
  If we what to completely rely on compiler, then... we do not not 
volatile/immutable/stable/qualifiers at all! Compiler should deduce this 
information itself.

  But it will be non-trivial if ever possible, take in account 1)

In principle it is possible to add checks which will produce warning in 
case of calling volatile function or executing dynamic SQL from 
non-volatile function.
If such extra checking will be considered useful, I can propose patch 
doing it.
But IMHO optimizer should rely on function qualifier provided by 
programmer and it is acceptable to produce wrong result if this 
information is not correct.




So now, there are described issues already. And the limit to just 
immutable function is not enough - these functions should be immutable 
buildin.


The core of these problems is based on function's flags related to 
planner optimizations.


Maybe other flags WRITE | READ | PURE can help.

Now we don't know if volatile function writes to database or not - 
surely random function doesn't do this. We can implement new set of 
flags, that can reduce a overhead with snapshots.


The function random() can be VOLATILE PURE - and we will know, so  
result of this function is not stable, but this function doesn't touch 
data engine.


When we don't have these flags, then the current logic is used, when 
we have these flags, then it will be used. These flags can be more strict


we should not to allow any WRITE from READ function, or we should not 
allow READ from PURE functions.


Notes, comments?
I think that even current model with "volatile", "immutable" and 
"stable" is complex enough.

Adding more qualifiers will make it even more obscure and error-prone.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Attempt to consolidate reading of XLOG page

2019-11-21 Thread Antonin Houska

Michael Paquier  wrote:

> On Thu, Nov 21, 2019 at 05:05:50PM +0900, Michael Paquier wrote:
> > And with WAL segments at 1MB, I was seeing quite a slowdown with the
> > patch...  Then I have done an extra test with pg_waldump with the
> > segments generated previously with the output redirected to /dev/null.
> > Going through 512 segments takes 15.730s with HEAD (average of 3 runs)
> > and 15.851s with the patch.
> 
> Here are more tests with pg_waldump and 1MB/1GB segment sizes with
> records generated from pgbench, (7 runs, eliminated the two highest
> and two lowest, these are the remaining 3 runs as real time):
> 1) 1MB segment size, 512 segments:
> time pg_waldump 000100010C00 000100010F00 > /dev/null
> - HEAD: 0m4.512s, 0m4.446s, 0m4.501s
> - Patch + system's pg_read: 0m4.495s, 0m4.502s, 0m4.486s
> - Patch + fallback pg_read: 0m4.505s, 0m4.527s, 0m4.495s
> 2) 1GB segment size, 3 segments:
> time pg_waldump 000100020001 000100020003 > /dev/null
> - HEAD: 0m11.802s, 0m11.834s, 0m11.846s
> - Patch + system's pg_read: 0m11.939s, 0m11.991s, 0m11.966s
> - Patch + fallback pg_read: 0m12.054s, 0m12.066s, 0m12.159s
> So there is a tendency for a small slowdown here.  Still it is not
> that much, so I withdraw my concerns.

Thanks for the testing!

I thought that in [1] you try discourage me from using pg_pread(), but now it
seems to be the opposite. Ideally I'd like to see no overhead added by my
patch at all, but the code simplicity should matter too.

As a clue, we can perhaps consider the fact that commit c24dcd0c removed
explicit lseek() also from XLogWrite(), but I'm not sure how much we can
compare XLOG writing and reading (I'd expect writing to be a bit less
sequential than reading because XLogWrite() may need to write the last page
more than once.)

Let's wait for Alvaro's judgement.

> Another thing:
> +void   WALReadRaiseError(WALReadError *errinfo);
> This is missing an "extern" declaration.

I'll fix this as well as the other problem reported in [1] as soon as I know
whether pg_pread() should be used or not.

[1] https://www.postgresql.org/message-id/20191121080550.GG153437%40paquier.xyz

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

2019-11-21 Thread Michael Paquier

On Fri, Nov 22, 2019 at 12:49:33AM -0300, Alvaro Herrera wrote:
> Yes :-)  hopefully next week.  Thanks for reviewing.

Thanks, I am switching the entry as ready for committer then.  Please
note that the latest patch series have a conflict at the top of
walsender.c easy enough to resolve, and that the function declaration
in xlogutils.h misses an "extern".  I personally find unnecessary the
last sentence in the new comment block of xlogreader.h to describe the
new callback to open a segment about BasicOpenFile() and open()
because one could also use a transient file opened in the backend, but
I'll be fine with anything you think is most fit.  That's a minor
point.

Thanks Antonin for doing the refactoring effort.
--
Michael


signature.asc
Description: PGP signature

Re: Why overhead of SPI is so large?

2019-11-21 Thread Pavel Stehule

pá 22. 11. 2019 v 7:33 odesílatel Kyotaro Horiguchi 
napsal:

> At Fri, 22 Nov 2019 06:15:25 +0100, Pavel Stehule 
> wrote in
> > čt 21. 11. 2019 v 20:44 odesílatel Tom Lane  napsal:
> >
> > > Pavel Stehule  writes:
> > > > čt 21. 11. 2019 v 10:31 odesílatel Konstantin Knizhnik <
> > > > k.knizh...@postgrespro.ru> napsal:
> > > >> With contain_mutable_functions the patch becomes trivial.
> > >
> > > > Stable functions doesn't need own snapshot too, so it is not fully
> > > correct,
> > > > but it is on safe side.
> > >
> > > No, I doubt that.  A stable function is allowed to inspect database
> state,
> > > and if it's being called by a volatile function, it has every right to
> > > expect that it'd see updates-so-far made by the volatile function.
> >
> > for this I need new snapshot?
>
> It depends on what we regard as "query" or "command" here. It seems to
> me that every line in a plpgsql function is regarded as a "query" for
> stable function, even if the function is called in another "query".
>
> In short, we need it, I think.
>

ok, then the limits just to only immutable functions is wrong

I test it, and there is a problem already. We doesn't raise a exception,
but the result is wrong

create table foo(a int);

create or replace function f1(int)
returns void as $$
begin
  insert into foo values($1);
end;
$$ language plpgsql;

create or replace function f2(int)
returns void as $$declare r record;
begin
  perform f1(); for r in select * from foo loop raise notice '%', r; end
loop;
end;
$$ language plpgsql immutable; -- or stable with same behave

So current state is:

a) we allow to call volatile functions from nonvolatile (stable, immutable)
that really does write
b) but this change is not visible in parent nonvolatile functions. Is
visible only in volatile functions.

Is it expected behave?

So now, there are described issues already. And the limit to just immutable
function is not enough - these functions should be immutable buildin.

The core of these problems is based on function's flags related to planner
optimizations.

Maybe other flags WRITE | READ | PURE can help.

Now we don't know if volatile function writes to database or not - surely
random function doesn't do this. We can implement new set of flags, that
can reduce a overhead with snapshots.

The function random() can be VOLATILE PURE - and we will know, so  result
of this function is not stable, but this function doesn't touch data engine.

When we don't have these flags, then the current logic is used, when we
have these flags, then it will be used. These flags can be more strict

we should not to allow any WRITE from READ function, or we should not allow
READ from PURE functions.

Notes, comments?

Pavel

This test should

>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>

Re: Why overhead of SPI is so large?

2019-11-21 Thread Kyotaro Horiguchi

At Fri, 22 Nov 2019 06:15:25 +0100, Pavel Stehule  
wrote in 
> čt 21. 11. 2019 v 20:44 odesílatel Tom Lane  napsal:
> 
> > Pavel Stehule  writes:
> > > čt 21. 11. 2019 v 10:31 odesílatel Konstantin Knizhnik <
> > > k.knizh...@postgrespro.ru> napsal:
> > >> With contain_mutable_functions the patch becomes trivial.
> >
> > > Stable functions doesn't need own snapshot too, so it is not fully
> > correct,
> > > but it is on safe side.
> >
> > No, I doubt that.  A stable function is allowed to inspect database state,
> > and if it's being called by a volatile function, it has every right to
> > expect that it'd see updates-so-far made by the volatile function.
> 
> for this I need new snapshot?

It depends on what we regard as "query" or "command" here. It seems to
me that every line in a plpgsql function is regarded as a "query" for
stable function, even if the function is called in another "query".

In short, we need it, I think.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Implementing Incremental View Maintenance

2019-11-21 Thread Tatsuo Ishii

Attached is the latest patch to add support for Incremental
Materialized View Maintenance (IVM). IVM allows to reflect
modifications made on base tables immediately to the target
materialized views.

Up to now, IVM supports materialized views using:

- Inner joins
- Some aggregate functions (count, sum, min, max, avg)
- GROUP BY
- Self joins

With the latest patch now IVM supports subqueries in addition to
above.

Known limitations are listed here:

https://github.com/sraoss/pgsql-ivm/issues

See more details at:
https://wiki.postgresql.org/wiki/Incremental_View_Maintenance

About subquery support:

The patch supports simple subqueries using EXISTS:

CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_exists_subquery AS SELECT
a.i, a.j FROM mv_base_a a WHERE EXISTS(SELECT 1 FROM mv_base_b b WHERE
a.i = b.i);

and subqueries in the FROM clause:

CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_subquery AS SELECT a.i,a.j
FROM mv_base_a a,( SELECT * FROM mv_base_b) b WHERE a.i = b.i;

Other form of subqueries such as below are not supported:

-- WHERE IN .. (subquery) is not supported
CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm03 AS SELECT i,j FROM
mv_base_a WHERE i IN (SELECT i FROM mv_base_b WHERE k < 103 );

-- subqueries in target list is not supported
CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm05 AS SELECT i,j, (SELECT k
FROM mv_base_b b WHERE a.i = b.i) FROM mv_base_a a;

-- nested EXISTS subqueries is not supported
CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm11 AS SELECT a.i,a.j FROM
mv_base_a a WHERE EXISTS(SELECT 1 FROM mv_base_b b WHERE EXISTS(SELECT
1 FROM mv_base_b c WHERE b.i = c.i));

-- EXISTS subquery with aggragate function is not supported
CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_exists AS SELECT COUNT(*)
FROM mv_base_a a WHERE EXISTS(SELECT 1 FROM mv_base_b b WHERE a.i =
b.i) OR a.i > 5;

--  EXISTS subquery with condition except AND is not supported.
CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm10 AS SELECT a.i,a.j FROM
mv_base_a a WHERE EXISTS(SELECT 1 FROM mv_base_b b WHERE a.i = b.i) OR
a.i > 5;

This work has been done by Yugo Nagata (nag...@sraoss.co.jp), Takuma
Hoshiai (hosh...@sraoss.co.jp).  Adding support for EXISTS clause has
been done by Takuma.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


IVM_v7.patch.gz
Description: Binary data

Re: adding partitioned tables to publications

2019-11-21 Thread Amit Langote

Hi Peter,

On Wed, Nov 20, 2019 at 4:55 PM Peter Eisentraut
 wrote:
> On 2019-11-18 09:53, Amit Langote wrote:
> > I have spent some time hacking on this.  With the attached updated
> > patch, adding a partitioned table to publication results in publishing
> > the inserts, updates, deletes of the table's leaf partitions as
> > inserts, updates, deletes of the table itself (it all happens inside
> > pgoutput).  So, the replication target table doesn't necessarily have
> > to be a partitioned table and even if it is partitioned its partitions
> > don't have to match one-to-one.
> >
> > One restriction remains though: partitioned tables on a subscriber
> > can't accept updates and deletes, because we'd need to map those to
> > updates and deletes of their partitions, including handling a tuple
> > possibly moving from one partition to another during an update.
>
> Right.  Without that second part, the first part isn't really that
> useful yet, is it?

I would say yes.

> I'm not sure what your intent with this patch is now.  I thought the
> previous behavior -- add a partitioned table to a publication and its
> leaf tables appear in the replication output -- was pretty welcome.  Do
> we not want that anymore?

Hmm, I thought it would be more desirable to not expose a published
partitioned table's leaf partitions to a subscriber, because it allows
the target table to be defined more flexibly.

> There should probably be an option to pick the behavior, like we do in
> pg_dump.

I don't understand which existing behavior.  Can you clarify?

Regarding allowing users to choose between publishing partitioned
table changes using leaf tables' schema vs as using own schema, I tend
to agree that there would be value in that.  Users who choose the
former will have to ensure that target leaf partitions match exactly.
Users who want flexibility in how the target table is defined can use
the latter.

> What happens when you add a leaf table directly to a publication?  Is it
> replicated under its own identity or under its ancestor partitioned
> table?  (What if both the leaf table and a partitioned table are
> publication members?)

If both a leaf partition and an ancestor belong to the same
publication, then leaf partition changes are replicated using the
ancestor's schema.  For a leaf partition to be replicated using its
own schema it must be published via a separate publication that
doesn't contain the ancestor.  At least that's what the current patch
does.

Thanks,
Amit

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-21 Thread Amit Khandekar

On Fri, 22 Nov 2019 at 09:08, Amit Kapila  wrote:
> Have you tried before that fix , if not, can you once try by
> temporarily reverting that fix in your environment and share the
> output of each step?  After you get the error due to EOF, check that
> you have .spill files in pg_replslot// and then again try
> to get changes by pg_logical_slot_get_changes().   If you want, you
> can use the test provided in Amit Khandekar's patch.

On my Linux machine, I added elog() in ReorderBufferRestoreChanges(),
just after FileRead() returns 0. This results in error. But the thing is, in
ReorderBufferCommit(), the error is already handled using PG_CATCH :

PG_CATCH();
{
.
   AbortCurrentTransaction();
...
   if (using_subtxn)
  RollbackAndReleaseCurrentSubTransaction();

   /* remove potential on-disk data, and deallocate */
   ReorderBufferCleanupTXN(rb, txn);
}

So ReorderBufferCleanupTXN() removes all the .spill files using unlink().

And on Windows, what should happen is : unlink() should succeed
because the file is opened using FILE_SHARE_DELETE. But the files
should still remain there because these are still open. It is just
marked for deletion until there is no one having opened the file. That
is what is my conclusion from running a sample attached program test.c
.
But what you are seeing is "Permission denied" errors. Not sure why
unlink() is failing.

The thing that is still a problem is : On Windows, if the file remains
open, and later even when the unlink() succeeds, the file will be left
there until it is closed. So subsequent operations will open the same
old file. Not sure what happens if we open a file that is marked for
deletion.

-

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
//#include 
#include 
#include 

void pgwin32_open(const char *fileName, int fileFlags,...)
{

	HANDLE		h;
	SECURITY_ATTRIBUTES sa;

	sa.nLength = sizeof(sa);
	sa.bInheritHandle = TRUE;
	sa.lpSecurityDescriptor = NULL;

	h = CreateFile(fileName,
		   GENERIC_READ,
	/* These flags allow concurrent rename/unlink */
		   (FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE),
		   &sa,
		   OPEN_EXISTING,
		   FILE_ATTRIBUTE_NORMAL ,
		   NULL);

	if (h == INVALID_HANDLE_VALUE)
	{
		printf("File could not be opened. Error %d\n", GetLastError());
		exit(1);
	}

	/* CloseHandle(h); */

	if (_unlink(fileName) != 0)
	{
		printf("File could not be deleted. Error %d\n", GetLastError());
		exit(1);
	}
}

int main(void)
{
	pgwin32_open("c:/temp/amit/file", 0 /* O_RDONLY | PG_BINARY */);
}

Re: Rework manipulation and structure of attribute mappings

2019-11-21 Thread Amit Langote

Hi Michael,

Thanks for working on this.  I guess this is a follow up to:
https://www.postgresql.org/message-id/20191102052001.GB1614%40paquier.xyz

On Thu, Nov 21, 2019 at 1:26 PM Michael Paquier  wrote:
>
> Hi all,
>
> After working on dc816e58, I have noticed that what we are doing with
> attribute mappings is not that good.  In a couple of code paths of the
> rewriter, the executor, and more particularly ALTER TABLE, when
> working on the creation of inherited relations or partitions an
> attribute mapping gets used to make sure that the new cloned elements
> (indexes, fk, etc.) have correct definitions linked correctly from the
> parent to the child's attributes.
>
> Sometimes things can go wrong, because the attribute array is just an
> AttrNumber pointer and it is up to the caller building the map to
> guess which length it has.  Existing callers do that fine, but this
> can lead to errors as recent history has proved.
>
> Attached is a patch to refactor all that which simply adds the
> attribute mapping length directly with the attribute list.  The nice
> effect of the refactoring is that now callers willing to get attribute
> maps don't need to think about which length it should have, and this
> allows to perform checks on the expected number of attributes in the
> map particularly in the executor part.  A couple of structures also
> have their logic simplified.

The refactoring to use AttrMap looks good, though attmap.c as a
separate module contains too little functionality (just palloc and
pfree) to be a separate file, IMHO.  If we are to build a separate
module, why not move a bit more functionality into it from
tupconvert.c.  How about move all the code that actually creates the
map to attmap.c?  The entry points would be all the
convert_tuples_by_name_map() and convert_tuples_by_name_map_if_req()
functions that return AttrMap, rather than simply make_attrmap(int
len) which can be a static routine.  Actually, we should also refactor
convert_tuples_by_position() to carve out the code that builds the
AttrMap into a separate function and move it to attrmap.c.

To be honest, "convert_tuples_" part in those names might start
sounding a bit outdated in the future, so we should really consider
inventing a new interface map_attributes(TupleDesc indesc, TupleDesc
outdesc), because most call sites that fetch the AttrMap directly
don't really use it for "converting tuples", but to convert
expressions or to map key arrays.

After all the movement, tupconvert.c will only retain the
functionality to build a TupleConversionMap (wrapping the AttrMap) and
to convert HeapTuples, that is, execute_attr_map_tuple() and
execute_attr_map_slot(), which makes sense.

Thoughts?

> On top of that, I have spotted two fishy attribute mapping calculations
> in addFkRecurseReferencing() when adding a foreign key for partitions
> when there are dropped columns and in CloneFkReferencing().  The
> mapping was using the number of attributes from the foreign key, which
> can be lower than the mapping of the parent if there are dropped
> columns in-between.  I am pretty sure that if some attributes of the
> parent are dropped (aka mapping set to 0 in the middle of its array
> then we could finish with an incorrect attribute mapping, and I
> suspect that this could lead to failures similar to what was fixed in
> dc816e58, but I have not spent much time yet into that part.

Actually, the patch can make addFkRecurseReferencing() crash, because
the fkattnum[] array doesn't really contain attmap->maplen elements:

-for (int j = 0; j < numfks; j++)
-mapped_fkattnum[j] = attmap[fkattnum[j] - 1];
+for (int j = 0; j < attmap->maplen; j++)
+mapped_fkattnum[j] = attmap->attnums[fkattnum[j] - 1];

You failed to notice that j is really used as index into fkattnum[],
not the map array returned by convert_tuples_by_name(). So, I think
the original coding is fine here.

Thanks,
Amit

Re: range_agg

2019-11-21 Thread Pavel Stehule

čt 21. 11. 2019 v 21:15 odesílatel Paul Jungwirth <
p...@illuminatedcomputing.com> napsal:

> On 11/21/19 1:06 AM, Pavel Stehule wrote:
> > 2. I don't like introduction "safe" operators - now the basic operators
> > are doubled, and nobody without documentation will use @* operators.
> >
> > It is not intuitive. I think is better to map this functionality to
> > basic operators +- * and implement it just for pairs (Multirange,
> > Multirange) and (Multirange, Range) if it is possible
> >
> > It's same relation line Numeric X integer. There should not be
> > introduced new operators. If somebody need it for ranges, then he can
> > use cast to multirange, and can continue.
>  > [snip]
> > 3. There are not prepared casts -
> >
> > postgres=# select int8range(10,15)::int8multirange;
> > ERROR:  cannot cast type int8range to int8multirange
> > LINE 1: select int8range(10,15)::int8multirange;
> > ^
> > There should be some a) fully generic solution, or b) possibility to
> > build implicit cast when any multirange type is created.
>
> Okay, I like the idea of just having `range + range` and `multirange +
> multirange`, then letting you cast between ranges and multiranges. The
> analogy to int/numeric seems strong. I guess if you cast a multirange
> with more than one element to a range it will raise an error. That will
> let me clean up the docs a lot too.
>

I though about it, and I think so cast from multirange to range is useless,
minimally it should be explicit.

On second hand - from range to multirange should be implicit.

The original patch did

1. MR @x MR = MR
2. R @x R = MR
3. MR @x R = MR

I think so @1 & @3 has sense, but without introduction of special operator.
@2 is bad and can be solved by cast one or second operand.

Pavel


> Thanks!
>
> --
> Paul  ~{:-)
> p...@illuminatedcomputing.com
>

Re: ssl passphrase callback

2019-11-21 Thread Craig Ringer

On Fri, 15 Nov 2019 at 00:34, Andrew Dunstan 
wrote:

>
> On 11/14/19 11:07 AM, Bruce Momjian wrote:
> > On Thu, Nov 14, 2019 at 11:42:05AM +0100, Magnus Hagander wrote:
> >> On Wed, Nov 13, 2019 at 9:23 PM Tomas Vondra <
> tomas.von...@2ndquadrant.com>
> >> I think it would be beneficial to explain why shared object is more
> >> secure than an OS command. Perhaps it's common knowledge, but it's
> not
> >> quite obvious to me.
> >>
> >>
> >> Yeah, that probably wouldn't hurt. It's also securely passing from more
> than
> >> one perspective -- both from the "cannot be eavesdropped" (like putting
> the
> >> password on the commandline for example) and the requirement for
> escaping.
> > I think a bigger issue is that if you want to give people the option of
> > using a shell command or a shared object, and if you use two commands to
> > control it, it isn't clear what happens if both are defined.  By using
> > some character prefix to control if a shared object is used, you can use
> > a single variable and there is no confusion over having two variables
> > and their conflicting behavior.
> >
>
>
> I'm  not sure how that would work in the present instance. The shared
> preloaded module installs a function and defines the params it wants. If
> we somehow unify the params with ssl_passphrase_command that could look
> icky, and the module would have to parse the settings string. That's not
> a problem for the sample module which only needs one param, but it will
> be for other more complex implementations.
>
> I'm quite open to suggestions, but I want things to be tolerably clean.


If someone wants a shell command wrapper, they can load a contrib that
delegates the hook to a shell command. So we can just ship a contrib, which
acts both as test coverage for the feature, and a shell-command-support
wrapper for anyone who desires that.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: checkpointer: PANIC: could not fsync file: No such file or directory

2019-11-21 Thread Craig Ringer

On Thu, 21 Nov 2019 at 09:07, Justin Pryzby  wrote:

> On Tue, Nov 19, 2019 at 07:22:26PM -0600, Justin Pryzby wrote:
> > I was trying to reproduce what was happening:
> > set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i
> int unique); INSERT INTO t SELECT generate_series(1,99)"; echo
> "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At
> >/tmp/snapshot& sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`;
> PGOPTIONS='-cclient_min_messages=debug' psql postgres -txc "ALTER TABLE t
> ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap"
> |head -44;
> >
> > Under v12, with or without the CHECKPOINT command, it fails:
> > |pg_dump: error: query failed: ERROR:  cache lookup failed for index 0
> > But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead
> fails like:
> > |pg_dump: [archiver (db)] query failed: ERROR:  cache lookup failed for
> index 16391
> > With the CHECKPOINT command, 9.5.2 works, but I don't see why it should
> be
> > needed, or why it would behave differently (or if it's related to this
> crash).
>
> Actually, I think that's at least related to documented behavior:
>
> https://www.postgresql.org/docs/12/mvcc-caveats.html
> |Some DDL commands, currently only TRUNCATE and the table-rewriting forms
> of ALTER TABLE, are not MVCC-safe. This means that after the truncation or
> rewrite commits, the table will appear empty to concurrent transactions, if
> they are using a snapshot taken before the DDL command committed.
>
> I don't know why CHECKPOINT allows it to work under 9.5, or if it's even
> related to the PANIC ..

The PANIC is a defense against potential corruptions that can be caused by
some kinds of disk errors. It's likely that we used to just ERROR and
retry, then the retry would succeed without getting upset.

fsync_fname() is supposed to ignore errors for files that cannot be opened.
But that same message may be emitted by a number of other parts of the
code, and it looks like you didn't have log_error_verbosity = verbose so we
don't have file/line info.

The only other place I see that emits that error where a relation path
could be a valid argument is in rewriteheap.c
in logical_end_heap_rewrite(). That calls the vfd layer's FileSync() and
assumes that any failure is a fsync() syscall failure. But FileSync() can
return failure if we fail to reopen the underlying file managed by the vfd
too, per FileAccess().

Would there be a legitimate case where a logical rewrite file mapping could
vanish without that being a problem? If so, we should probably be more
tolerante there.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: Why overhead of SPI is so large?

2019-11-21 Thread Pavel Stehule

čt 21. 11. 2019 v 20:44 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > čt 21. 11. 2019 v 10:31 odesílatel Konstantin Knizhnik <
> > k.knizh...@postgrespro.ru> napsal:
> >> With contain_mutable_functions the patch becomes trivial.
>
> > Stable functions doesn't need own snapshot too, so it is not fully
> correct,
> > but it is on safe side.
>
> No, I doubt that.  A stable function is allowed to inspect database state,
> and if it's being called by a volatile function, it has every right to
> expect that it'd see updates-so-far made by the volatile function.
>

for this I need new snapshot?


> regards, tom lane
>

Re: obsolete example

2019-11-21 Thread Pavel Stehule

pá 22. 11. 2019 v 1:13 odesílatel Michael Paquier 
napsal:

> On Thu, Nov 21, 2019 at 08:19:56PM -0300, Euler Taveira wrote:
> > Em qui., 21 de nov. de 2019 às 15:59, Pavel Stehule
> >  escreveu:
> >>
> >> isn't src/tutorial/func.c obsolete? There is not any benefit for users.
> >
> > version-0 calling conventions were removed in v10. It seems an
> > oversight at commit 5ded4bd2140. Tutorial needs some care (I'm not
> > volunteering to improve it). I suggest unbreak the funcs module with
> > 'mv funcs_new.c func.c'.
>
> No objections from here, let's get rid of it.  The docs actually make
> use of the V1 versions, and funcs_new.c is not even compiled (it does
> compile).  Any objections to the attached?  On top of moving the file,
> there is one comment to update and a sentence to remove.  Some
> progress is always better than no progress.
>

+1

Pavel

> --
> Michael
>

Re: an OID >= 8000 in master

2019-11-21 Thread Kyotaro Horiguchi

At Thu, 21 Nov 2019 10:35:25 -0500, Tom Lane  wrote in 
> > If we don't intend what Peter pointed (arrangement of low-OIDs at
> > feature freeze), it can be done by moving OIDs to lower values at
> > commit. (I don't mean commiters should do that, it may be bothersome.)
> 
> Yes, that's exactly the point: when we discussed this new policy,
> it was agreed that making committers deal with the issue in each
> commit was an unreasonable burden.  Aside from just being more
> work, there's the risk that two committers working on different
> patches concurrently would choose to map the development OIDs to
> the same "final" OIDs.  It seems better to deal with the problem
> once at feature freeze.
> 
> Anyway, we've only had this policy in place for a few months.
> I'm not eager to redesign it until we've had more experience.

Thanks for the explantion. I understood and agreed.

> >>> By the way even if we work this way, developers tend to pick up low
> >>> range OIDs since it is printed at the beginning of the output. I think
> >>> we should hide the whole list of unused oids defaultly and just
> >>> suggest random one.
> 
> >> -1, that pretty much destroys the point of the unused_oids script.
> 
> > Is the "point" is what the name suggests? The tool is, for developers,
> > a means of finding OIDs *usable for their project*. It doesn't seem
> > appropriate to show OIDs that developers are supposed to refrain from
> > using. In my proposal the tool still shows all unused OIDs as the name
> > suggests when some option specified.
> 
> The existing output seems perfectly clear to me.  What you propose
> just adds complication and reduces usefulness.

For clarity, it's perfect also to me. I don't insist on the change
since no supporters come up.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pause recovery if pitr target not reached

2019-11-21 Thread Kyotaro Horiguchi

Hello, Lief, Peter.

At Thu, 21 Nov 2019 12:50:18 +, "Leif Gunnar Erlandsen"  
wrote in 
> Adding another patch which is not only for recovery_target_time but also for 
> xid, name and lsn.
> 
> > After studying this a bit more, I think the current behavior is totally 
> > bogus and needs a serious
> > rethink.
> > 
> > If you specify a recovery target and it is reached, recovery pauses 
> > (depending on
> > recovery_target_action).
> > 
> > If you specify a recovery target and it is not reached when the end of the 
> > archive is reached
> > (i.e., restore_command fails), then recovery ends and the server is 
> > promoted, without any further
> > information. This is clearly wrong in multiple ways.
> 
> Yes, that is why I have created the patch.

It seems premising to be used in prepeated trial-and-error recovery by
well experiecned operators. When it is used, I think that the target
goes back gradually through repetitions so anyway we need to start
from a clean backup for each repetition, in the expected
usage. Unintended promotion doesn't harm in the case.

In this persipective, I don't think the behavior is totally wrong but
FATAL'ing at EO-WAL before target seems good to do.

> > I think what we should do is if we specify a recovery target and we don't 
> > reach it, we should
> > ereport(FATAL). Somewhere around
> > 
> If recovery pauses or a FATAL error is reported, is not important, as long as 
> it is possible to get some more WAL and continue recovery. Pause has the 
> benefit of the possibility to inspect tables in the database.
> 
> > in StartupXLOG(), where we already check for other conditions that are 
> > undesirable at the end of
> > recovery. Then a user can make fixes either by getting more WAL files to 
> > restore and adjusting the
> > recovery target and starting again. I don't think pausing is the right 
> > behavior, but perhaps an
> > argument could be made to offer it as a nondefault behavior.
> 
> Pausing was choosen in the patch as pause was the expected behaivior if 
> target was reached.
> 
> And the patch does not interfere with any other functionality as far as I 
> know.

With the current behavior, if server promotes without stopping as told
by target_action variables, it is a sign that something's wrong. But
if server pauses before reaching target, operators may overlook the
message if they don't know of the behavior. And if server poses in the
case, I think there's nothing to do.

So +1 for FATAL.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Ordering of header file inclusion

2019-11-21 Thread vignesh C

On Thu, Nov 21, 2019 at 2:11 PM Amit Kapila  wrote:
>
> On Sat, Nov 16, 2019 at 7:01 AM vignesh C  wrote:
> >
> > On Tue, Nov 12, 2019 at 11:19 AM Amit Kapila  
> > wrote:
> > >
> > > On Tue, Nov 12, 2019 at 6:33 AM vignesh C  wrote:
> > > >
> > > >
> > > > Thanks Amit for your comments. Please find the updated patch which
> > > > does not include the changes mentioned above.
> > > >
> > >
> > > Thanks for working on this.  I have pushed your latest patch.
> > >
> >
> > Thanks Amit for pushing the patch. I have re-verified and found that
> > changes need  to be done in few more places. The main changes are made
> > in the header file and plpython source files. The attached patch
> > handles the same. I have verified make check and make check-world
> > including --with-python & --with-perl in the following:
> > CentOS Linux release 7.7.1908
> > Red Hat Enterprise Linux Server release 7.1
> >
> > I have verified including --llvm in CentOS Linux release 7.7.1908.
> >
>
> Thanks for finding the remaining places, the patch looks good to me.
> I hope this covers the entire code.  BTW, are you using some script to
> find this or is this a result of manual inspection of code?  I have
> modified the commit message in the attached patch.  I will commit this
> early next week unless someone else wants to review it.
>

I have used script to verify if the inclusions are sorted. There are
few files which I did not modify intentionally, they are mainly like
the below type as in uuid-ossp.c:
#include "postgres.h"

#include "fmgr.h"
#include "port/pg_bswap.h"
#include "utils/builtins.h"
#include "utils/uuid.h"

/*
 * It's possible that there's more than one uuid.h header file present.
 * We expect configure to set the HAVE_ symbol for only the one we want.
 *
 * BSD includes a uuid_hash() function that conflicts with the one in
 * builtins.h; we #define it out of the way.
 */
#define uuid_hash bsd_uuid_hash

#if defined(HAVE_UUID_H)
#include 
#elif defined(HAVE_OSSP_UUID_H)
#include 
#elif defined(HAVE_UUID_UUID_H)
#include 

After the inclusion they have define and further include based on #if
defined. In few cases I had seen the include happens at the end of the
file like in regcomp.c as there may be impact. I felt it is better not
to change these files. Let me know your thoughts on the same.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Assertion failing in master, predicate.c

2019-11-21 Thread Tom Lane

Mark Dilger  writes:
> I have winnowed down the test a bit further.  The attached
> smaller patch still triggers the same assertion as the prior
> patch did.

FWIW, I can reproduce the assertion failure with your first test,
but not with this simplified one.

I also confirm that it only happens in HEAD, not v12.  I've not
actually bisected, but a look at the git history for predicate.c
sure makes it look like db2687d1f ("Optimize PredicateLockTuple")
must be to blame.

regards, tom lane

Re: Attempt to consolidate reading of XLOG page

2019-11-21 Thread Alvaro Herrera

On 2019-Nov-22, Michael Paquier wrote:

> Alvaro, you are marked as a committer of this CF entry.  Are you
> planning to look at it again?  Sorry for the delay from my side.

Yes :-)  hopefully next week.  Thanks for reviewing.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-21 Thread Amit Kapila

On Thu, Nov 21, 2019 at 8:32 PM Juan José Santamaría Flecha
 wrote:
>
> On Thu, Nov 21, 2019 at 5:02 AM Amit Kapila  wrote:
>>
>> On Wed, Nov 20, 2019 at 5:41 PM Juan José Santamaría Flecha
>>  wrote:
>> >
>> > On Wed, Nov 20, 2019 at 9:48 AM Amit Khandekar  
>> > wrote:
>> >>
>> >> On Wed, 20 Nov 2019 at 13:10, Amit Kapila  wrote:
>> >> > See comment in pgunlink() "We need to loop because even though
>> >> > PostgreSQL uses flags that allow unlink while the file is open, other
>> >> > applications might have the file
>> >> > open without those flags.".  Can you once see if there is any flag
>> >> > that you have missed to pass to allow this?
>> >>
>> >> > If there is nothing we
>> >> > can do about it, then we might need to use some different API or maybe
>> >> > define a new API that can handle this.
>> >>
>> >> There were objections against modifying the vfd api only for this
>> >> replication-related use-case. Having a new API will require all the
>> >> changes required to enable the virtual FDs feature that we need from
>> >> vfd. If nothing works out from the FILE_SHARE_DELETE thing, I am
>> >> thinking, we can use VFD, plus we can keep track of per-subtransaction
>> >> vfd handles, and do something similar to AtEOSubXact_Files().
>> >>
>> >
>> > The comment about "other applications might have the file open without 
>> > those flags." is surely due to systems working with an antivirus touching 
>> > Postgres files.
>> >
>> > I was not able to reproduce the Permission denied error with current HEAD,
>> >
>>
>> I am not sure what exactly you tried.  Can you share the steps and
>> your environment details?
>>
>
> Sure, I was trying to reproduce the Permission denied error after the 
> ERROR_HANDLE_EOF fix.
>

Have you tried before that fix , if not, can you once try by
temporarily reverting that fix in your environment and share the
output of each step?  After you get the error due to EOF, check that
you have .spill files in pg_replslot// and then again try
to get changes by pg_logical_slot_get_changes().   If you want, you
can use the test provided in Amit Khandekar's patch.

>
> [1] Win10 (1903) MSVC 19.22.27905
>

I have tested this on Windows7. I am not sure if it is due to a
different version of windows, but I think we can't rule out that
possibility.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: fix "Success" error messages

2019-11-21 Thread TAKATSUKA Haruka



On Thu, 21 Nov 2019 10:40:36 +0100
Peter Eisentraut  wrote:

> On 2019-11-21 02:42, TAKATSUKA Haruka wrote:
> >   FATAL: could not access status of transaction ..
> >   DETAIL: Could not read from file (pg_clog/ or pg_xact/) ...: 
> > Success.
> > 
> > This error has caused the server to fail to start with recovery.
> > I got a report that it happend repeatedly at the newly generated
> > standby cluster. I gave them advice to comfirm the low level server
> > environment.
> > 
> > However, in addition to improving the message, should we retry to read
> > the rest of the data in the case reading too few bytes?
> > What about a limited number of retries instead of a complete loop?
> 
> If we thought that would help, there are probably hundreds or more other 
> places where we read files that would need to be fixed up in the same 
> way.  That doesn't seem reasonable.
> 
> Also, it is my understanding that short reads can in practice only 
> happen if the underlying storage is having a serious problem, so 
> retrying wouldn't actually help much.

OK, I understand.
In our case, the standby DB cluster space is on DRBD.
I will report the clear occurrence condition if it is found.

thanks,
Haruka Takatsuka

Re: Assertion failing in master, predicate.c

2019-11-21 Thread Mark Dilger




On 11/21/19 6:20 PM, Mark Dilger wrote:

Hackers,

I stumbled upon an assertion while testing master for possible
bugs.  I am reporting it here in the hope that this report will
be useful.  The attached small regression test patch consistently
triggers an assert in predicate.c:

   TRAP: FailedAssertion("!isCommit || 
SxactIsPrepared(MySerializableXact)", File: "predicate.c", Line: 3372)


I have winnowed down the test a bit further.  The attached
smaller patch still triggers the same assertion as the prior
patch did.

--
Mark Dilger
diff --git a/src/test/regress/expected/nonsense.out 
b/src/test/regress/expected/nonsense.out
new file mode 100644
index 00..10ce3496e8
--- /dev/null
+++ b/src/test/regress/expected/nonsense.out
@@ -0,0 +1,3 @@
+SET SESSION CHARACTERISTICS AS TRANSACTION DEFERRABLE , ISOLATION LEVEL 
SERIALIZABLE;
+SET SESSION CHARACTERISTICS AS TRANSACTION NOT DEFERRABLE;
+LISTEN mouse;
diff --git a/src/test/regress/parallel_schedule 
b/src/test/regress/parallel_schedule
index d33a4e143d..0a47f299bf 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -121,3 +121,5 @@ test: fast_default
 
 # run stats by itself because its delay may be insufficient under heavy load
 test: stats
+
+test: nonsense
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index f86f5c5682..3cfebd7842 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -196,3 +196,4 @@ test: tuplesort
 test: event_trigger
 test: fast_default
 test: stats
+test: nonsense
diff --git a/src/test/regress/sql/nonsense.sql 
b/src/test/regress/sql/nonsense.sql
new file mode 100644
index 00..10ce3496e8
--- /dev/null
+++ b/src/test/regress/sql/nonsense.sql
@@ -0,0 +1,3 @@
+SET SESSION CHARACTERISTICS AS TRANSACTION DEFERRABLE , ISOLATION LEVEL 
SERIALIZABLE;
+SET SESSION CHARACTERISTICS AS TRANSACTION NOT DEFERRABLE;
+LISTEN mouse;

Assertion failing in master, predicate.c

2019-11-21 Thread Mark Dilger


Hackers,

I stumbled upon an assertion while testing master for possible
bugs.  I am reporting it here in the hope that this report will
be useful.  The attached small regression test patch consistently
triggers an assert in predicate.c:

  TRAP: FailedAssertion("!isCommit || 
SxactIsPrepared(MySerializableXact)", File: "predicate.c", Line: 3372)


I originally hit this from sources with less than recent
code checked out, but the error is the same in a recent,
fresh `git clone` (4a0aab14dcb35550b55e623a3c194442c5666084)
The problem does not reproduce for me in REL_12_STABLE, though the
same assertion does exist in that branch.

I built on my laptop:

  Linux 4.19.0-5-amd64 #1 SMP Debian 4.19.37-3 (2019-05-15) x86_64 
GNU/Linux


I built using

  `./configure --enable-cassert --enable-tap-tests --with-perl 
--with-python --with-tcl`


The perl, python, and tcl options don't appear to matter, as nothing
changes using

  `./configure --enable-cassert && make -j4 && make check-world`

--
Mark Dilger
diff --git a/src/test/regress/expected/nonsense.out 
b/src/test/regress/expected/nonsense.out
new file mode 100644
index 00..5537473a0b
--- /dev/null
+++ b/src/test/regress/expected/nonsense.out
@@ -0,0 +1,4 @@
+SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE NOT DEFERRABLE , 
DEFERRABLE , DEFERRABLE READ WRITE , READ ONLY , ISOLATION LEVEL SERIALIZABLE;
+SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE READ WRITE , NOT 
DEFERRABLE , DEFERRABLE READ WRITE , DEFERRABLE;
+NOTIFY pig ;
+LISTEN mouse;
diff --git a/src/test/regress/parallel_schedule 
b/src/test/regress/parallel_schedule
index d33a4e143d..0a47f299bf 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -121,3 +121,5 @@ test: fast_default
 
 # run stats by itself because its delay may be insufficient under heavy load
 test: stats
+
+test: nonsense
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index f86f5c5682..3cfebd7842 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -196,3 +196,4 @@ test: tuplesort
 test: event_trigger
 test: fast_default
 test: stats
+test: nonsense
diff --git a/src/test/regress/sql/nonsense.sql 
b/src/test/regress/sql/nonsense.sql
new file mode 100644
index 00..5537473a0b
--- /dev/null
+++ b/src/test/regress/sql/nonsense.sql
@@ -0,0 +1,4 @@
+SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE NOT DEFERRABLE , 
DEFERRABLE , DEFERRABLE READ WRITE , READ ONLY , ISOLATION LEVEL SERIALIZABLE;
+SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE READ WRITE , NOT 
DEFERRABLE , DEFERRABLE READ WRITE , DEFERRABLE;
+NOTIFY pig ;
+LISTEN mouse;

Re: [PATCH] Do not use StdRdOptions in Access Methods

2019-11-21 Thread Amit Langote

Hello,

On Thu, Nov 21, 2019 at 2:22 PM Michael Paquier  wrote:
> Any opinions or objections to share regarding the recent
> progress done?

The latest patch looks good to me, except, maybe the comment of
StdRdOptions should be updated:

 * StdRdOptions
 *  Standard contents of rd_options for heaps and generic indexes.

IIUC, StdRdOptions no longer applies to indexes, right?

Thanks,
Amit

Re: Attempt to consolidate reading of XLOG page

2019-11-21 Thread Michael Paquier

On Thu, Nov 21, 2019 at 05:05:50PM +0900, Michael Paquier wrote:
> And with WAL segments at 1MB, I was seeing quite a slowdown with the
> patch...  Then I have done an extra test with pg_waldump with the
> segments generated previously with the output redirected to /dev/null.
> Going through 512 segments takes 15.730s with HEAD (average of 3 runs)
> and 15.851s with the patch.

Here are more tests with pg_waldump and 1MB/1GB segment sizes with
records generated from pgbench, (7 runs, eliminated the two highest
and two lowest, these are the remaining 3 runs as real time):
1) 1MB segment size, 512 segments:
time pg_waldump 000100010C00 000100010F00 > /dev/null
- HEAD: 0m4.512s, 0m4.446s, 0m4.501s
- Patch + system's pg_read: 0m4.495s, 0m4.502s, 0m4.486s
- Patch + fallback pg_read: 0m4.505s, 0m4.527s, 0m4.495s
2) 1GB segment size, 3 segments:
time pg_waldump 000100020001 000100020003 > /dev/null
- HEAD: 0m11.802s, 0m11.834s, 0m11.846s
- Patch + system's pg_read: 0m11.939s, 0m11.991s, 0m11.966s
- Patch + fallback pg_read: 0m12.054s, 0m12.066s, 0m12.159s
So there is a tendency for a small slowdown here.  Still it is not
that much, so I withdraw my concerns.

Another thing:
+void   WALReadRaiseError(WALReadError *errinfo);
This is missing an "extern" declaration.

Alvaro, you are marked as a committer of this CF entry.  Are you
planning to look at it again?  Sorry for the delay from my side.
--
Michael


signature.asc
Description: PGP signature

Re: obsolete example

2019-11-21 Thread Tom Lane

Michael Paquier  writes:
> No objections from here, let's get rid of it.  The docs actually make
> use of the V1 versions, and funcs_new.c is not even compiled (it does
> compile).  Any objections to the attached?  On top of moving the file,
> there is one comment to update and a sentence to remove.  Some
> progress is always better than no progress.

+1

regards, tom lane

Re: obsolete example

2019-11-21 Thread Michael Paquier

On Thu, Nov 21, 2019 at 08:19:56PM -0300, Euler Taveira wrote:
> Em qui., 21 de nov. de 2019 às 15:59, Pavel Stehule
>  escreveu:
>>
>> isn't src/tutorial/func.c obsolete? There is not any benefit for users.
>
> version-0 calling conventions were removed in v10. It seems an
> oversight at commit 5ded4bd2140. Tutorial needs some care (I'm not
> volunteering to improve it). I suggest unbreak the funcs module with
> 'mv funcs_new.c func.c'.

No objections from here, let's get rid of it.  The docs actually make
use of the V1 versions, and funcs_new.c is not even compiled (it does
compile).  Any objections to the attached?  On top of moving the file,
there is one comment to update and a sentence to remove.  Some
progress is always better than no progress.
--
Michael
diff --git a/src/tutorial/funcs.c b/src/tutorial/funcs.c
index 0bc90d18de..cdd155ebbd 100644
--- a/src/tutorial/funcs.c
+++ b/src/tutorial/funcs.c
@@ -6,9 +6,6 @@
 
   The calling format for these functions is defined by the CREATE FUNCTION
   SQL statement that binds them to the backend.
-
-  NOTE: this file shows examples of "old style" function call conventions.
-  See funcs_new.c for examples of "new style".
 */
 
 #include "postgres.h"			/* general Postgres declarations */
@@ -18,94 +15,112 @@
 
 PG_MODULE_MAGIC;
 
-/* These prototypes just prevent possible warnings from gcc. */
-
-int			add_one(int arg);
-float8	   *add_one_float8(float8 *arg);
-Point	   *makepoint(Point *pointx, Point *pointy);
-text	   *copytext(text *t);
-text	   *concat_text(text *arg1, text *arg2);
-bool		c_overpaid(HeapTupleHeader t,	/* the current instance of EMP */
-	   int32 limit);
-
 
 /* By Value */
 
-int
-add_one(int arg)
+PG_FUNCTION_INFO_V1(add_one);
+
+Datum
+add_one(PG_FUNCTION_ARGS)
 {
-	return arg + 1;
+	int32		arg = PG_GETARG_INT32(0);
+
+	PG_RETURN_INT32(arg + 1);
 }
 
 /* By Reference, Fixed Length */
 
-float8 *
-add_one_float8(float8 *arg)
+PG_FUNCTION_INFO_V1(add_one_float8);
+
+Datum
+add_one_float8(PG_FUNCTION_ARGS)
 {
-	float8	   *result = (float8 *) palloc(sizeof(float8));
+	/* The macros for FLOAT8 hide its pass-by-reference nature */
+	float8		arg = PG_GETARG_FLOAT8(0);
 
-	*result = *arg + 1.0;
-
-	return result;
+	PG_RETURN_FLOAT8(arg + 1.0);
 }
 
-Point *
-makepoint(Point *pointx, Point *pointy)
+PG_FUNCTION_INFO_V1(makepoint);
+
+Datum
+makepoint(PG_FUNCTION_ARGS)
 {
+	Point	   *pointx = PG_GETARG_POINT_P(0);
+	Point	   *pointy = PG_GETARG_POINT_P(1);
 	Point	   *new_point = (Point *) palloc(sizeof(Point));
 
 	new_point->x = pointx->x;
 	new_point->y = pointy->y;
 
-	return new_point;
+	PG_RETURN_POINT_P(new_point);
 }
 
 /* By Reference, Variable Length */
 
-text *
-copytext(text *t)
+PG_FUNCTION_INFO_V1(copytext);
+
+Datum
+copytext(PG_FUNCTION_ARGS)
 {
+	text	   *t = PG_GETARG_TEXT_PP(0);
+
 	/*
-	 * VARSIZE is the total size of the struct in bytes.
+	 * VARSIZE_ANY_EXHDR is the size of the struct in bytes, minus the
+	 * VARHDRSZ or VARHDRSZ_SHORT of its header.  Construct the copy with a
+	 * full-length header.
 	 */
-	text	   *new_t = (text *) palloc(VARSIZE(t));
+	text	   *new_t = (text *) palloc(VARSIZE_ANY_EXHDR(t) + VARHDRSZ);
 
-	SET_VARSIZE(new_t, VARSIZE(t));
+	SET_VARSIZE(new_t, VARSIZE_ANY_EXHDR(t) + VARHDRSZ);
 
 	/*
-	 * VARDATA is a pointer to the data region of the struct.
+	 * VARDATA is a pointer to the data region of the new struct.  The source
+	 * could be a short datum, so retrieve its data through VARDATA_ANY.
 	 */
 	memcpy((void *) VARDATA(new_t), /* destination */
-		   (void *) VARDATA(t), /* source */
-		   VARSIZE(t) - VARHDRSZ);	/* how many bytes */
-	return new_t;
+		   (void *) VARDATA_ANY(t), /* source */
+		   VARSIZE_ANY_EXHDR(t));	/* how many bytes */
+	PG_RETURN_TEXT_P(new_t);
 }
 
-text *
-concat_text(text *arg1, text *arg2)
+PG_FUNCTION_INFO_V1(concat_text);
+
+Datum
+concat_text(PG_FUNCTION_ARGS)
 {
-	int32		arg1_size = VARSIZE(arg1) - VARHDRSZ;
-	int32		arg2_size = VARSIZE(arg2) - VARHDRSZ;
+	text	   *arg1 = PG_GETARG_TEXT_PP(0);
+	text	   *arg2 = PG_GETARG_TEXT_PP(1);
+	int32		arg1_size = VARSIZE_ANY_EXHDR(arg1);
+	int32		arg2_size = VARSIZE_ANY_EXHDR(arg2);
 	int32		new_text_size = arg1_size + arg2_size + VARHDRSZ;
 	text	   *new_text = (text *) palloc(new_text_size);
 
 	SET_VARSIZE(new_text, new_text_size);
-	memcpy(VARDATA(new_text), VARDATA(arg1), arg1_size);
-	memcpy(VARDATA(new_text) + arg1_size, VARDATA(arg2), arg2_size);
-	return new_text;
+	memcpy(VARDATA(new_text), VARDATA_ANY(arg1), arg1_size);
+	memcpy(VARDATA(new_text) + arg1_size, VARDATA_ANY(arg2), arg2_size);
+	PG_RETURN_TEXT_P(new_text);
 }
 
 /* Composite types */
 
-bool
-c_overpaid(HeapTupleHeader t,	/* the current instance of EMP */
-		   int32 limit)
+PG_FUNCTION_INFO_V1(c_overpaid);
+
+Datum
+c_overpaid(PG_FUNCTION_ARGS)
 {
+	HeapTupleHeader t = PG_GETARG_HEAPTUPLEHEADER(0);
+	int32		limit = PG_GETARG_INT32(1);
 	bool		isnull;
 	int32		salary

Re: [PATCH] Implement INSERT SET syntax

2019-11-21 Thread Gareth Palmer


> On 19/11/2019, at 5:05 PM, Gareth Palmer  wrote:
>> 
>> Since nobody has objected to this, I'm supposing that there's general
>> consensus that that design sketch is OK, and we can move on to critiquing
>> implementation details.  I took a look, and didn't like much of what I saw.

Attached is an updated patch with for_locking_clause added, test-cases
re-use existing tables and the comments and documentation have been
expanded.

>> I'm setting this back to Waiting on Author.



insert-set-v4.patch
Description: Binary data

Re: obsolete example

2019-11-21 Thread Euler Taveira

Em qui., 21 de nov. de 2019 às 15:59, Pavel Stehule
 escreveu:
>
> isn't src/tutorial/func.c obsolete? There is not any benefit for users.
>
version-0 calling conventions were removed in v10. It seems an
oversight at commit 5ded4bd2140. Tutorial needs some care (I'm not
volunteering to improve it). I suggest unbreak the funcs module with
'mv funcs_new.c func.c'.


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Copyright information in source files

2019-11-21 Thread Tom Lane

Thomas Munro  writes:
> I'd like to get rid of those IDENTIFICATION lines completely (they are
> left over from the time when the project used CVS, and that section
> had a $Header$ "ident" tag, but in the git era, those ident tags are
> no longer in fashion).

I'm not for that.  Arguments about CVS vs git are irrelevant: the
usefulness of those lines comes up when you've got a file that's
not in your source tree but somewhere else.  It's particularly
useful for the Makefiles, which are otherwise often same-y and
hard to identify.

> There are other inconsistencies in the copyright messages, like
> whether we say "Portions" or not for PGDU, and whether we use 1996- or
> the year the file was created, and whether the Berkeley copyright is
> there or not (different people seem to have different ideas about
> whether that's needed for a post-Berkeley file).

Yeah, it'd be nice to have some greater consistency there.  My own
thought about it is that it's rare to have a file that's *completely*
de novo code, and can be guaranteed to stay that way --- more usually
there is some amount of copying&pasting, and then you have to wonder
how much of that material could be traced back to Berkeley.  So I
prefer to err on the side of including their copyright.  That line of
argument basically leads to the conclusion that all the copyright tags
should be identical, which doesn't seem like an unreasonable rule.

regards, tom lane

Re: Copyright information in source files

2019-11-21 Thread Thomas Munro

On Sun, Nov 17, 2019 at 6:36 AM vignesh C  wrote:
> I noticed that some of the source files does not include the copyright
> information. Most of the files have included it, but few files have
> not included it. I felt it should be included. The attached patch
> contains the fix for including the copyright information in the source
> files. Let me know your thoughts on the same.

I'd like to get rid of those IDENTIFICATION lines completely (they are
left over from the time when the project used CVS, and that section
had a $Header$ "ident" tag, but in the git era, those ident tags are
no longer in fashion).

There are other inconsistencies in the copyright messages, like
whether we say "Portions" or not for PGDU, and whether we use 1996- or
the year the file was created, and whether the Berkeley copyright is
there or not (different people seem to have different ideas about
whether that's needed for a post-Berkeley file).

Re: range_agg

2019-11-21 Thread Paul Jungwirth


On 11/21/19 1:06 AM, Pavel Stehule wrote:
2. I don't like introduction "safe" operators - now the basic operators 
are doubled, and nobody without documentation will use @* operators.


It is not intuitive. I think is better to map this functionality to 
basic operators +- * and implement it just for pairs (Multirange, 
Multirange) and (Multirange, Range) if it is possible


It's same relation line Numeric X integer. There should not be 
introduced new operators. If somebody need it for ranges, then he can 
use cast to multirange, and can continue.

> [snip]

3. There are not prepared casts -

postgres=# select int8range(10,15)::int8multirange;
ERROR:  cannot cast type int8range to int8multirange
LINE 1: select int8range(10,15)::int8multirange;
                                ^
There should be some a) fully generic solution, or b) possibility to 
build implicit cast when any multirange type is created.


Okay, I like the idea of just having `range + range` and `multirange + 
multirange`, then letting you cast between ranges and multiranges. The 
analogy to int/numeric seems strong. I guess if you cast a multirange 
with more than one element to a range it will raise an error. That will 
let me clean up the docs a lot too.


Thanks!

--
Paul  ~{:-)
p...@illuminatedcomputing.com

Re: Why overhead of SPI is so large?

2019-11-21 Thread Tom Lane

Pavel Stehule  writes:
> čt 21. 11. 2019 v 10:31 odesílatel Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> napsal:
>> With contain_mutable_functions the patch becomes trivial.

> Stable functions doesn't need own snapshot too, so it is not fully correct,
> but it is on safe side.

No, I doubt that.  A stable function is allowed to inspect database state,
and if it's being called by a volatile function, it has every right to
expect that it'd see updates-so-far made by the volatile function.

regards, tom lane

Re: TAP tests aren't using the magic words for Windows file access

2019-11-21 Thread Juan José Santamaría Flecha

On Thu, Nov 21, 2019 at 3:35 PM Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

>
> On 11/20/19 3:40 PM, Thomas Munro wrote:
> > On Fri, Nov 8, 2019 at 3:41 AM Andrew Dunstan
> >  wrote:
> >> On 11/7/19 9:12 AM, Alvaro Herrera wrote:
>  The patch says:
> 
>  +require Win32::API;
>  +Win32::API->import;
> >>> Oh, you're right, it does.  I wonder why, though:
> >>>
> >> On further inspection I think those lines are unnecessary. The remainder
> >> of the patch doesn't use this at all, AFAICT.
> > So does that mean we're back on, we can use a patch like Juan Jose's?
> > I'd love to get rid of these intermittent buildfarm failures, like
> > this one just now:
> >
> >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2019-11-20%2010%3A00%3A10
> >
> > Here you can see:
> >
> > could not read
> "C:/prog/bf/root/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/t_004_logrotate_primary_data/pgdata/current_logfiles":
> > Permission denied at
> > C:/prog/bf/root/HEAD/pgsql.build/src/test/perl/TestLib.pm line 397.
> >
> > That line is in the subroutine slurp_file, and says open(my $in, '<',
> > $filename).  Using various clues from this thread, it seems like we
> > could, on Windows only, add code to TestLib.pm's INIT to rebind
> > *CORE::GLOBAL::open to a wrapper function that would just do
> > CreateFile(..., PLEASE_BE_MORE_LIKE_UNIX, ...).
>
>
> Possibly. I will do some testing on drongo in the next week or so.
>
>
I think Perl's open() is a bad candidate for an overload, so I will update
the previous patch that only touches slurp_file().

This version address the issues with the required libraries and uses
functions that expose less of the Windows API.

Regards,

Juan José Santamaría Flecha


0001-tap-file_share-windows-file-access-v2.patch
Description: Binary data

obsolete example

2019-11-21 Thread Pavel Stehule

Hi

isn't src/tutorial/func.c obsolete? There is not any benefit for users.

Regards

Pavel

Re: Why overhead of SPI is so large?

2019-11-21 Thread Pavel Stehule

čt 21. 11. 2019 v 10:31 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
> > I've set the CF entry to "Waiting on Author" pending a new patch
> > that does it like that.
>
> With contain_mutable_functions the patch becomes trivial.
>

Stable functions doesn't need own snapshot too, so it is not fully correct,
but it is on safe side.

Pavel


> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>

Re: dropdb --force

2019-11-21 Thread Pavel Stehule

čt 21. 11. 2019 v 6:33 odesílatel vignesh C  napsal:

> On Mon, Nov 18, 2019 at 6:30 PM Pavel Stehule 
> wrote:
> >>
> >> I'll send this test today
> >
> >
> > here is it
> >
>
> Thanks for adding the test.
> Few comments:
> This function is same as in test/recovery/t/013_crash_restart.pl, we
> can add to a common file and use in both places:
> +# Pump until string is matched, or timeout occurs
> +sub pump_until
> +{
> +my ($proc, $stream, $untl) = @_;
> +$proc->pump_nb();
> +while (1)
> +{
> +last if $$stream =~ /$untl/;
> +if ($psql_timeout->is_expired)
> +{
>

yes, I know - probably it can be moved to testlib.pm. Unfortunately it is
perl, and I am not able to this correctly. More it use global object - timer

This can be Tests drop database with force option:
>

done

+#
> +# Tests killing session connected to dropped database
> +#
>
> This can be changed to check database foobar1 does not exist.
>

done

+# and there is not a database with this name
> +is($node->safe_psql('postgres', qq[SELECT EXISTS(SELECT * FROM
> pg_database WHERE datname='foobar1');]),
> +'f', 'database foobar1 was removed');
>
> This can be changed to check the connections on foobar1 database
> +
> +# and it is connected to foobar1 database
> +is($node->safe_psql('postgres', qq[SELECT pid FROM pg_stat_activity
> WHERE datname='foobar1' AND pid = $pid;]),
> +$pid, 'database foobar1 is used');
>

done


> This can be changed to restart psql as the previous connection will be
> terminated by previous drop database.
> +
>

done

new patch attached

Regards

Pavel


+# restart psql processes, now that the crash cycle finished
> +($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
> +$killme->run();
>
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
>


dropdb-force.pl-patch
Description: Binary data

Re: [PATCH] Do not use StdRdOptions in Access Methods

2019-11-21 Thread Nikolay Shaplov

В письме от среда, 20 ноября 2019 г. 16:44:18 MSK пользователь Michael Paquier 
написал:

> > It seems to me that if the plan is to have one option structure for
> > each index AM, which has actually the advantage to reduce the bloat of
> > each relcache entry currently relying on StdRdOptions, then we could
> > have those extra assertion checks in the same patch, because the new
> > macros are introduced.
> I have looked at this patch, and did not like much having the
> calculations of the page free space around, so I have moved that into
> each AM's dedicated header.
Sounds as a good idea. I try not touch such things following the rule "is not 
broken do not fix" but this way it is definitely better. Thanks!

> > There is rd_rel->relam.  You can for example refer to pgstatindex.c
> > which has AM-related checks to make sure that the correct index AM is
> > being used.
> 
> We can do something similar for GIN and BRIN on top of the rest, so
> updated the patch with that.
That is what I've been trying to tell speaking about code consistency. But ok, 
this way is also good.

> Nikolay, I would be fine to commit this patch as-is.
Yeah. I've read the patch. I like it, actually I started doing same thing 
myself but you were faster. I have opportunity to pay attention to postgres 
once a week these days...

I like the patch, and also agree that it should be commited as is.

Though I have a notion to think about.

BRIN_AM_OID and friends are defined in catalog/pg_am_d.h so for core indexes 
we can do relation->rd_rel->relam == BRIN_AM_OID check. But for contrib 
indexes we can't do such a thing.
Bloom index does not need such check as it uses options only when index is 
created. At that point you can not choose wrong relation. But if in future we 
will have some contrib index that uses options when it some data is inserted 
(as it is done with fillfactor in core indexes) then index author will not be 
able to do such relam check. I would not call it a big problem, but  it is 
something to think about, for sure...

> Thanks for your patience on this stuff.
Thaks for joining this work, and sorry for late replies. Now I quite rarely 
have time for postgres :-(


-- 
Software Developer: https://www.upwork.com/freelancers/~014a87e140ff02c0da
Body-oriented Therapist: https://vk.com/nataraj_rebalancing  (Russian)

Re: Remove configure --disable-float4-byval and --disable-float8-byval

2019-11-21 Thread Peter Eisentraut


On 2019-11-02 08:39, Peter Eisentraut wrote:

On 2019-10-31 14:36, Tom Lane wrote:

Peter Eisentraut  writes:

float4 is now always pass-by-value; the pass-by-reference code path is
completely removed.


I think this is OK.


OK, here is a patch for just this part, and we can continue the
discussion on the rest in the meantime.


I have committed this part.

I will rebase and continue developing the rest of the patches based on 
the discussion so far.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: why doesn't optimizer can pull up where a > ( ... )

2019-11-21 Thread Tomas Vondra

On Thu, Nov 21, 2019 at 11:57:22PM +0800, Andy Fan wrote:

On Thu, Nov 21, 2019 at 6:12 PM Tomas Vondra 
wrote:

On Thu, Nov 21, 2019 at 08:30:51AM +0800, Andy Fan wrote:
>>
>>
>> Hm.  That actually raises the stakes a great deal, because if that's
>> what you're expecting, it would require planning out both the
transformed
>> and untransformed versions of the query before you could make a cost
>> comparison.
>
>
>I don't know an official name,  let's call it as "bloom filter push down
>(BFPD)" for reference.  this algorithm  may be helpful on this case with
>some extra effort.
>
>First, Take . "select ... from t1,  t2 where t1.a = t2.a and t1.b = 100"
>for example,  and assume t1 is scanned before t2 scanning, like hash
>join/sort merge and take t1's result as inner table.
>
>1.  it first scan t1  with filter t1.b = 100;
>2.  during the above scan,  it build a bloom filter *based on the join key
>(t1.a) for the "selected" rows.*
>3.  during scan t2.a,  it filters t2.a with the bloom filter.
>4.  probe the the hash table with the filtered rows from the above step.
>

So essentially just a hash join with a bloom filter?

Yes, the idea is exactly same but  we treat the value differently (both are
valid, and your point is more common) .   In my opinion  in some
environment like oracle exadata, it is much more powerful since it
transfers much less data from data node to compute node.

Of course, the benefit is not always,  but it is a good beginning to make
it smarter.

Yes, it certainly depends on the workload. As was discussed in the
other thread, to get the most benefit we'd have to push the bloom filter
down the other side of the join as far as possible, ideally to the scan
nodes. But no one tried to do that.

That doesn't seem very relevant to this thread (at least I don't see any
obvious link),

The original problem  "group by p_brand"   for "all the rows" maybe not a
good idea all the time,   and if we can do some filter before the group
by,  the result would be better.

Well, I think vast majority of optimizations depend on the data. The
reason why I think these two optimizations are quite different is that
one (blom filter with hash joins) is kinda localized and does not change
the general plan shape - you simply make the decision at the hash join
level, and that's it (although it's true it does affect row counts on
one side of the join).

The optimization discussed here is very different because it requires
transformation of the query very early, before we actually can judge if
it's a good idea or not.

And in some
cases building a bloom filter did result in nice speedups, but in other
cases it was just an extra overhead. But it does not require change of
plan shape, unlike the optimization discussed here.

I thought we could add a step named "build the filter" and another step as
"apply the filter".   If so, the plan shape is changed.  anyway I don't
think this is a key point.

Not sure. Perhaps there are similarities, but I don't see them.

Ultimately there were discussions about pushing the bloom filter much
deeper on the non-hash side, but that was never implemented.

Do you still have any plan about this feature since I see you raised the
idea and  and the idea was very welcomed also?

I'm not working on it, and I don't think I'll get to do that any time
soon. So feel free to look into the problem if you wish.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: SQL/JSON: JSON_TABLE

2019-11-21 Thread Pavel Stehule

čt 21. 11. 2019 v 17:31 odesílatel Nikita Glukhov 
napsal:

> On 17.11.2019 13:35, Pavel Stehule wrote:
>
> Hi
>
> út 12. 11. 2019 v 22:51 odesílatel Nikita Glukhov 
> napsal:
>
>> On 12.11.2019 20:54, Pavel Stehule wrote:
>>
>> > Hi
>> >
>> > please, can you rebase 0001-SQL-JSON-functions-v40.patch. I have a
>> > problem with patching
>> >
>> > Pavel
>>
>> Attached 41th version of the patches rebased onto current master.
>>
>
> I testing functionality - randomly testing some examples that I found on
> internet.
>
> Thank you for testing JSON_TABLE.
>
> I found:
>
> a) Oracle & MySQL (Oracle) supports EXISTS clause, this implementation not.
> I think should be useful support this clause too.
>
> SELECT * FROM JSON_TABLE('...', '...' COLUMNS x INT EXISTS PATH ...
>
>
> EXISTS PATH clause can be emulated with jsonpath EXISTS() predicate:
>
> =# SELECT *
>FROM JSON_TABLE('{"a": 1}', '$'
>COLUMNS (
>  a bool PATH 'exists($.a)',
>  b bool PATH 'exists($.b)'
>));
>  a | b
> ---+---
>  t | f
> (1 row)
>
> But this works as expected only in lax mode.  In strict mode EXISTS() returns
> Unknown that transformed into SQL NULL:
>
> =# SELECT *
>FROM JSON_TABLE('{"a": 1}', '$'
>COLUMNS (
>  a bool PATH 'strict exists($.a)',
>  b bool PATH 'strict exists($.b)'
>));
>  a | b
> ---+---
>  t |
> (1 row)
>
> There is no easy way to return false without external COALESCE(),
> DEFAULT false ON ERROR also does not help.
>
> So, I think it's worth to add EXISTS PATH clause to our implementation.
>
>
>
> There is a question how to map boolean result to other data types.
>
> Now, boolean result can be used in JSON_TABLE columns of bool, int4, text,
> json[b], and other types which have CAST from bool:
>
> SELECT *
> FROM JSON_TABLE('{"a": 1}', '$'
> COLUMNS (
>   a int PATH 'exists($.a)',
>   b text PATH 'exists($.b)'
> ));
>  a |   b
> ---+---
>  1 | false
> (1 row)
>
>
> b) When searched value is not scalar, then it returns null. This behave can be
> suppressed by clause FORMAT Json. I found a different behave, and maybe I 
> found
> a bug.  On MySQL this clause is by default for JSON values (what has sense).
>
> SELECT *
>  FROM
>   JSON_TABLE(
> '[{"a":[1,2]}]',
> '$[*]'
> COLUMNS(
>  aj JSON PATH '$.a' DEFAULT '{"x": 333}' ON EMPTY
> )
>   ) AS tt;
>
> It returns null, although it should to return [1,2].
>
> Yes, regular (non-formatted) JSON_TABLE columns can accept only scalar values.
> Otherwise an error is thrown, which can be caught by ON ERROR clause. This
> behavior is specified by the standard.
>
> FORMAT JSON is not implicitly added for json[b] columns now. The current SQL
> standard does not have any json data types, so I think we can add implicit
> FORMAT JSON for json[b] typed-columns.  But I'm a bit afraid that different
> behavior can be standardized after introduction of json data types in SQL.
>
>
> There is another bug maybe. Although there is DEFAULT clause. It returns NULL.
>
> ON ERROR should be used if "not a scalar" error needs to be caught:
>
> SELECT *
> FROM
> JSON_TABLE(
> '[{"a":[1,2]}]',
> '$[*]'
> COLUMNS(
> aj JSON PATH '$.a' DEFAULT '{"x": 333}' ON ERROR
> )
> ) AS tt;
>
>  aj
> 
>  {"x": 333}
> (1 row)
>
>
> ON EMPTY catches only empty-result case (for example, non-existent path in
> lax mode):
>
> SELECT *
> FROM
> JSON_TABLE(
> '[{"a":[1,2]}]',
> '$[*]'
> COLUMNS(
> aj JSON PATH '$.foo' DEFAULT '{"x": 333}' ON EMPTY
> )
> ) AS tt;
>  aj
> 
>  {"x": 333}
> (1 row)
>
>
>
> I got correct result when I used FORMAT JSON clause.
> I think it should be default behave for json and jsonb columns.
>
> I agree that FORMAT JSON could be implicit for json[b] columns.  But I think
> there could be one minor problem if we want to verify that returned value is
> scalar.
>
> Without FORMAT JSON this is verified by the underlying JSON_VALUE expression:
>
> SELECT *
> FROM
> JSON_TABLE(
> '[{"a":[1,2]}]',
> '$[*]'
> COLUMNS (
> aj JSON PATH 'lax $.a' ERROR ON ERROR
> )
> ) AS tt;
> ERROR:  JSON path expression in JSON_VALUE should return singleton scalar item
>
> (This error message with the reference to implicit JSON_VALUE needs to be 
> fixed.)
>
>
> But with FORMAT JSON we need to construct complex jsonpath with a filter and
> override ON EMPTY behavior:
>
> SELECT *
> FROM
> JSON_TABLE(
> '[{"a":[1,2]}]',
> '$[*]'
> COLUMNS (
> aj JSON FORMAT JSON
> -- strict mode is mandatory to prevent array unwrapping
> PATH 'strict $.a ? (@.type() != "array" && @.type() != "object")'
>

Re: SQL/JSON: JSON_TABLE

2019-11-21 Thread Nikita Glukhov

On 17.11.2019 13:35, Pavel Stehule wrote:

Hi

út 12. 11. 2019 v 22:51 odesílatel Nikita Glukhov 
mailto:n.glu...@postgrespro.ru>> napsal:

On 12.11.2019 20:54, Pavel Stehule wrote:

> Hi
>
> please, can you rebase 0001-SQL-JSON-functions-v40.patch. I have a
> problem with patching
>
> Pavel

Attached 41th version of the patches rebased onto current master.

I testing functionality - randomly testing some examples that I found 
on internet.

Thank you for testing JSON_TABLE.

I found:
a) Oracle & MySQL (Oracle) supports EXISTS clause, this implementation not.
I think should be useful support this clause too.
SELECT * FROM JSON_TABLE('...', '...' COLUMNS x INT EXISTS PATH ...

EXISTS PATH clause can be emulated with jsonpath EXISTS() predicate:

=# SELECT *
   FROM JSON_TABLE('{"a": 1}', '$'
   COLUMNS (
 a bool PATH 'exists($.a)',
 b bool PATH 'exists($.b)'
   ));
 a | b
---+---
 t | f
(1 row)

But this works as expected only in lax mode.  In strict mode EXISTS() returns
Unknown that transformed into SQL NULL:

=# SELECT *
   FROM JSON_TABLE('{"a": 1}', '$'
   COLUMNS (
 a bool PATH 'strict exists($.a)',
 b bool PATH 'strict exists($.b)'
   ));
 a | b
---+---
 t |
(1 row)

There is no easy way to return false without external COALESCE(),
DEFAULT false ON ERROR also does not help.

So, I think it's worth to add EXISTS PATH clause to our implementation.

There is a question how to map boolean result to other data types.

Now, boolean result can be used in JSON_TABLE columns of bool, int4, text,
json[b], and other types which have CAST from bool:

SELECT *
FROM JSON_TABLE('{"a": 1}', '$'
COLUMNS (
  a int PATH 'exists($.a)',
  b text PATH 'exists($.b)'
));
 a |   b
---+---
 1 | false
(1 row)

b) When searched value is not scalar, then it returns null. This behave can be
suppressed by clause FORMAT Json. I found a different behave, and maybe I found
a bug.  On MySQL this clause is by default for JSON values (what has sense).
SELECT *
  FROM
       JSON_TABLE(
         '[{"a":[1,2]}]',
         '$[*]'
         COLUMNS(
          aj JSON PATH '$.a' DEFAULT '{"x": 333}' ON EMPTY
         )
       ) AS tt;
It returns null, although it should to return [1,2].

Yes, regular (non-formatted) JSON_TABLE columns can accept only scalar values.
Otherwise an error is thrown, which can be caught by ON ERROR clause. This
behavior is specified by the standard.

FORMAT JSON is not implicitly added for json[b] columns now. The current SQL
standard does not have any json data types, so I think we can add implicit
FORMAT JSON for json[b] typed-columns.  But I'm a bit afraid that different
behavior can be standardized after introduction of json data types in SQL.

There is another bug maybe. Although there is DEFAULT clause. It returns NULL.

ON ERROR should be used if "not a scalar" error needs to be caught:

SELECT *
FROM
JSON_TABLE(
'[{"a":[1,2]}]',
'$[*]'
COLUMNS(
aj JSON PATH '$.a' DEFAULT '{"x": 333}' ON ERROR
)
) AS tt;

 aj

 {"x": 333}
(1 row)

ON EMPTY catches only empty-result case (for example, non-existent path in
lax mode):

SELECT *
FROM
JSON_TABLE(
'[{"a":[1,2]}]',
'$[*]'
COLUMNS(
aj JSON PATH '$.foo' DEFAULT '{"x": 333}' ON EMPTY
)
) AS tt;
 aj

 {"x": 333}
(1 row)

I got correct result when I used FORMAT JSON clause.
I think it should be default behave for json and jsonb columns.

I agree that FORMAT JSON could be implicit for json[b] columns.  But I think
there could be one minor problem if we want to verify that returned value is
scalar.

Without FORMAT JSON this is verified by the underlying JSON_VALUE expression:

SELECT *
FROM
JSON_TABLE(
'[{"a":[1,2]}]',
'$[*]'
COLUMNS (
aj JSON PATH 'lax $.a' ERROR ON ERROR
)
) AS tt;
ERROR:  JSON path expression in JSON_VALUE should return singleton scalar item

(This error message with the reference to implicit JSON_VALUE needs to be 
fixed.)

But with FORMAT JSON we need to construct complex jsonpath with a filter and
override ON EMPTY behavior:

SELECT *
FROM
JSON_TABLE(
'[{"a":[1,2]}]',
'$[*]'
COLUMNS (
aj JSON FORMAT JSON
-- strict mode is mandatory to prevent array unwrapping
PATH 'strict $.a ? (@.type() != "array" && @.type() != "object")'
ERROR ON EMPTY ERROR ON ERROR
)
) AS tt;
ERROR:  no SQL/JSON item

Another question - when I used FORMAT JSON clause, then I got syntax error
on DEFAULT keyword .. . Is it correct?

Why I cannot to use together FORMAT JSON and DEFAULT clauses?

JSON_TABLE columns with FORMAT JSON, like JSON_QUERY,

Re: why doesn't optimizer can pull up where a > ( ... )

2019-11-21 Thread Andy Fan

On Thu, Nov 21, 2019 at 6:12 PM Tomas Vondra 
wrote:

> On Thu, Nov 21, 2019 at 08:30:51AM +0800, Andy Fan wrote:
> >>
> >>
> >> Hm.  That actually raises the stakes a great deal, because if that's
> >> what you're expecting, it would require planning out both the
> transformed
> >> and untransformed versions of the query before you could make a cost
> >> comparison.
> >
> >
> >I don't know an official name,  let's call it as "bloom filter push down
> >(BFPD)" for reference.  this algorithm  may be helpful on this case with
> >some extra effort.
> >
> >First, Take . "select ... from t1,  t2 where t1.a = t2.a and t1.b = 100"
> >for example,  and assume t1 is scanned before t2 scanning, like hash
> >join/sort merge and take t1's result as inner table.
> >
> >1.  it first scan t1  with filter t1.b = 100;
> >2.  during the above scan,  it build a bloom filter *based on the join key
> >(t1.a) for the "selected" rows.*
> >3.  during scan t2.a,  it filters t2.a with the bloom filter.
> >4.  probe the the hash table with the filtered rows from the above step.
> >
>
> So essentially just a hash join with a bloom filter?


Yes, the idea is exactly same but  we treat the value differently (both are
valid, and your point is more common) .   In my opinion  in some
environment like oracle exadata, it is much more powerful since it
transfers much less data from data node to compute node.

Of course, the benefit is not always,  but it is a good beginning to make
it smarter.


> That doesn't seem very relevant to this thread (at least I don't see any
> obvious link),
>

The original problem  "group by p_brand"   for "all the rows" maybe not a
good idea all the time,   and if we can do some filter before the group
by,  the result would be better.

And in some
> cases building a bloom filter did result in nice speedups, but in other
> cases it was just an extra overhead. But it does not require change of
> plan shape, unlike the optimization discussed here.
>

I thought we could add a step named "build the filter" and another step as
"apply the filter".   If so, the plan shape is changed.  anyway I don't
think this is a key point.


>
> Ultimately there were discussions about pushing the bloom filter much
> deeper on the non-hash side, but that was never implemented.


Do you still have any plan about this feature since I see you raised the
idea and  and the idea was very welcomed also?

>Back to this problem,  if we have a chance to get the p_brand we are
> >interested,  we can use the same logic to only group by the p_brand.
> >
> >Another option may be we just keep the N versions, and search them
> >differently and compare their cost at last.
> >
>
> Maybe. I think the problem is going to be that with multiple such
> correlated queries you may significantly increase the number of plan
> variants to consider - each subquery may be transformed or not, so the
> space splits into 2. With 6 such subqueries you suddenly have 64x the
> number of plan variants you have to consider (I don't think you can just
> elimiate those early on).
>
> >>  The Greenplum page mentions they also added "join-aggregates
> >reordering", in addition to subquery unnesting.
> >Thanks,  I will search more about this.
> >
> >>Having said that, the best form of criticism is a patch.  If somebody
> >>actually wrote the code to do something like this, we could look at how
> >>much time it wasted in which unsuccessful cases and then have an
> >>informed discussion about whether it was worth adopting.
> >>
> >
> >I would try to see how far I can get.
>
> +1
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: an OID >= 8000 in master

2019-11-21 Thread Tom Lane

Kyotaro Horiguchi  writes:
> At Wed, 20 Nov 2019 23:45:21 -0500, Tom Lane  wrote in 
>> I do not think there is any easy solution that guarantees that.
>> We could imagine having some sort of pre-registration mechanism,
>> maybe, but it seems like more trouble than benefit.

> If we don't intend what Peter pointed (arrangement of low-OIDs at
> feature freeze), it can be done by moving OIDs to lower values at
> commit. (I don't mean commiters should do that, it may be bothersome.)

Yes, that's exactly the point: when we discussed this new policy,
it was agreed that making committers deal with the issue in each
commit was an unreasonable burden.  Aside from just being more
work, there's the risk that two committers working on different
patches concurrently would choose to map the development OIDs to
the same "final" OIDs.  It seems better to deal with the problem
once at feature freeze.

Anyway, we've only had this policy in place for a few months.
I'm not eager to redesign it until we've had more experience.

>>> By the way even if we work this way, developers tend to pick up low
>>> range OIDs since it is printed at the beginning of the output. I think
>>> we should hide the whole list of unused oids defaultly and just
>>> suggest random one.

>> -1, that pretty much destroys the point of the unused_oids script.

> Is the "point" is what the name suggests? The tool is, for developers,
> a means of finding OIDs *usable for their project*. It doesn't seem
> appropriate to show OIDs that developers are supposed to refrain from
> using. In my proposal the tool still shows all unused OIDs as the name
> suggests when some option specified.

The existing output seems perfectly clear to me.  What you propose
just adds complication and reduces usefulness.

regards, tom lane

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-21 Thread Juan José Santamaría Flecha

On Thu, Nov 21, 2019 at 5:02 AM Amit Kapila  wrote:

> On Wed, Nov 20, 2019 at 5:41 PM Juan José Santamaría Flecha
>  wrote:
> >
> > On Wed, Nov 20, 2019 at 9:48 AM Amit Khandekar 
> wrote:
> >>
> >> On Wed, 20 Nov 2019 at 13:10, Amit Kapila 
> wrote:
> >> > See comment in pgunlink() "We need to loop because even though
> >> > PostgreSQL uses flags that allow unlink while the file is open, other
> >> > applications might have the file
> >> > open without those flags.".  Can you once see if there is any flag
> >> > that you have missed to pass to allow this?
> >>
> >> > If there is nothing we
> >> > can do about it, then we might need to use some different API or maybe
> >> > define a new API that can handle this.
> >>
> >> There were objections against modifying the vfd api only for this
> >> replication-related use-case. Having a new API will require all the
> >> changes required to enable the virtual FDs feature that we need from
> >> vfd. If nothing works out from the FILE_SHARE_DELETE thing, I am
> >> thinking, we can use VFD, plus we can keep track of per-subtransaction
> >> vfd handles, and do something similar to AtEOSubXact_Files().
> >>
> >
> > The comment about "other applications might have the file open without
> those flags." is surely due to systems working with an antivirus touching
> Postgres files.
> >
> > I was not able to reproduce the Permission denied error with current
> HEAD,
> >
>
> I am not sure what exactly you tried.  Can you share the steps and
> your environment details?
>
>
Sure, I was trying to reproduce the Permission denied error after
the ERROR_HANDLE_EOF fix.

1. Using a clean environment [1] the spill.sql script produces the expected
output.
2. I manually injected a negative value for readBytes after @@ -2611,10
+2627,11 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN
*txn. In doing so pg_logical_slot_get_changes() failed, but following
executions did not run into Permission denied.
3. During the cleanup of some of the tests, pg_drop_replication_slot()
failed because the "pg_replslot/regression_slot" folder was is use.

[1] Win10 (1903) MSVC 19.22.27905

Regards,

Juan José Santamaría Flecha

Re: TAP tests aren't using the magic words for Windows file access

2019-11-21 Thread Andrew Dunstan



On 11/20/19 3:40 PM, Thomas Munro wrote:
> On Fri, Nov 8, 2019 at 3:41 AM Andrew Dunstan
>  wrote:
>> On 11/7/19 9:12 AM, Alvaro Herrera wrote:
 The patch says:

 +require Win32::API;
 +Win32::API->import;
>>> Oh, you're right, it does.  I wonder why, though:
>>>
>> On further inspection I think those lines are unnecessary. The remainder
>> of the patch doesn't use this at all, AFAICT.
> So does that mean we're back on, we can use a patch like Juan Jose's?
> I'd love to get rid of these intermittent buildfarm failures, like
> this one just now:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2019-11-20%2010%3A00%3A10
>
> Here you can see:
>
> could not read 
> "C:/prog/bf/root/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/t_004_logrotate_primary_data/pgdata/current_logfiles":
> Permission denied at
> C:/prog/bf/root/HEAD/pgsql.build/src/test/perl/TestLib.pm line 397.
>
> That line is in the subroutine slurp_file, and says open(my $in, '<',
> $filename).  Using various clues from this thread, it seems like we
> could, on Windows only, add code to TestLib.pm's INIT to rebind
> *CORE::GLOBAL::open to a wrapper function that would just do
> CreateFile(..., PLEASE_BE_MORE_LIKE_UNIX, ...).


Possibly. I will do some testing on drongo in the next week or so.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Conflict handling for COPY FROM

2019-11-21 Thread Alexey Kondratov


On 18.11.2019 9:42, Surafel Temesgen wrote:
On Fri, Nov 15, 2019 at 6:24 PM Alexey Kondratov 
mailto:a.kondra...@postgrespro.ru>> wrote:



1) Maybe it is fine, but now I do not like this part:

+    portal = GetPortalByName("");
+    dest = CreateDestReceiver(DestRemote);
+    SetRemoteDestReceiverParams(dest, portal);
+    dest->rStartup(dest, (int) CMD_SELECT, tupDesc);

Here you implicitly use the fact that portal with a blank name is
always
created in exec_simple_query before we get to this point. Next, you
create new DestReceiver and set it to this portal, but it is also
already created and set in the exec_simple_query.

Would it be better if you just explicitly pass ready DestReceiver to
DoCopy (similarly to how it is done for T_ExecuteStmt /
ExecuteQuery),


Good idea .Thank you


Now the whole patch works exactly as expected for me and I cannot find 
any new technical flaws. However, the doc is rather vague, especially 
these places:


+  specifying it to -1 returns all error record.

Actually, we return only rows with constraint violation, but malformed 
rows are ignored with warning. I guess that we simply cannot return 
malformed rows back to the caller in the same way as with constraint 
violation, since we cannot figure out (in general) which column 
corresponds to which type if there are extra or missing columns.


+  and same record formatting error is ignored.

I can get it, but it definitely should be reworded.

What about something like this?

+   
+ ERROR_LIMIT
+    
+ 
+  Enables ignoring of errored out rows up to limit_number. If limit_number is set
+  to -1, then all errors will be ignored.
+ 
+
+ 
+  Currently, only unique or exclusion constraint violation
+  and rows formatting errors are ignored. Malformed
+  rows will rise warnings, while constraint violating rows
+  will be returned back to the caller.
+ 
+
+    
+   

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: pause recovery if pitr target not reached

2019-11-21 Thread Leif Gunnar Erlandsen

Adding another patch which is not only for recovery_target_time but also for 
xid, name and lsn.

> After studying this a bit more, I think the current behavior is totally bogus 
> and needs a serious
> rethink.
> 
> If you specify a recovery target and it is reached, recovery pauses 
> (depending on
> recovery_target_action).
> 
> If you specify a recovery target and it is not reached when the end of the 
> archive is reached
> (i.e., restore_command fails), then recovery ends and the server is promoted, 
> without any further
> information. This is clearly wrong in multiple ways.

Yes, that is why I have created the patch.

> 
> I think what we should do is if we specify a recovery target and we don't 
> reach it, we should
> ereport(FATAL). Somewhere around
> 
If recovery pauses or a FATAL error is reported, is not important, as long as 
it is possible to get some more WAL and continue recovery. Pause has the 
benefit of the possibility to inspect tables in the database.

> in StartupXLOG(), where we already check for other conditions that are 
> undesirable at the end of
> recovery. Then a user can make fixes either by getting more WAL files to 
> restore and adjusting the
> recovery target and starting again. I don't think pausing is the right 
> behavior, but perhaps an
> argument could be made to offer it as a nondefault behavior.

Pausing was choosen in the patch as pause was the expected behaivior if target 
was reached.

And the patch does not interfere with any other functionality as far as I know.

--
Leif Gunnar Erlandsen


0002-pause-recovery-if-pitr-target-not-reached.patch
Description: Binary data

Re: pause recovery if pitr target not reached

2019-11-21 Thread Peter Eisentraut

After studying this a bit more, I think the current behavior is totally 
bogus and needs a serious rethink.


If you specify a recovery target and it is reached, recovery pauses 
(depending on recovery_target_action).


If you specify a recovery target and it is not reached when the end of 
the archive is reached (i.e., restore_command fails), then recovery ends 
and the server is promoted, without any further information.  This is 
clearly wrong in multiple ways.


I think what we should do is if we specify a recovery target and we 
don't reach it, we should ereport(FATAL).  Somewhere around


/*
 * end of main redo apply loop
 */

in StartupXLOG(), where we already check for other conditions that are 
undesirable at the end of recovery.  Then a user can make fixes either 
by getting more WAL files to restore and adjusting the recovery target 
and starting again.  I don't think pausing is the right behavior, but 
perhaps an argument could be made to offer it as a nondefault behavior.


There is an interesting overlap with the other thread that wants to make 
"end of archive" and explicitly settable recovery target.  The current 
behavior, however, is more like "recovery time (say) or end of archive, 
whichever happens first", which is not a behavior that is currently 
selectable or intended with other methods of recovery target 
specification.  Also, if you want the end of the archive as your 
recovery target, that currently does not respect the 
recovery_target_action setting, but perhaps it should.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Recovery performance of DROP DATABASE with many tablespaces

2019-11-21 Thread Fujii Masao

On Tue, Nov 19, 2019 at 3:39 PM k.jami...@fujitsu.com
 wrote:
>
> On Wed, Nov 13, 2019 5:34PM (GMT+9), Fujii Masao wrote:
> > On Wed, Nov 13, 2019 at 3:57 PM k.jami...@fujitsu.com 
> > 
> > wrote:
> > >
> > > On Wed, Oct. 2, 2019 5:40 PM, Fujii Masao wrote:
> > > > On Tue, Jul 10, 2018 at 3:04 PM Michael Paquier 
> > wrote:
> > > > >
> > > > > On Thu, Jul 05, 2018 at 01:42:20AM +0900, Fujii Masao wrote:
> > > > > > TBH, I have no numbers measured by the test.
> > > > > > One question about your test is; how did you measure the
> > > > > > *recovery
> > > > > > time* of DROP DATABASE? Since it's *recovery* performance,
> > > > > > basically it's not easy to measure that.
> > > > >
> > > > > It would be simple to measure the time it takes to replay this
> > > > > single DROP DATABASE record by putting two gettimeofday() calls or
> > > > > such things and then take the time difference.  There are many
> > > > > methods that you could use here, and I suppose that with a shared
> > > > > buffer setting of a couple of GBs of shared buffers you would see
> > > > > a measurable difference with a dozen of tablespaces or so.  You
> > > > > could also take a base backup after creating all the tablespaces,
> > > > > connect the standby and then drop the database on the primary to
> > > > > see the actual time it takes.  Your patch looks logically correct
> > > > > to me because DropDatabaseBuffers is a
> > > > > *bottleneck* with large shared_buffers, and it would be nice to
> > > > > see numbers.
> > > >
> > > > Thanks for the comment!
> > > >
> > > > I measured how long it takes to replay DROP DATABASE with 1000
> > > > tablespaces, in master and patched version. shared_buffers was set to
> > 16GB.
> > > >
> > > > [master]
> > > > It took 8 seconds to replay DROP DATABASE with 1000 tablespaces, as 
> > > > follows.
> > > > In this case, 16GB shared_buffers was fully scanned 1000 times.
> > > >
> > > > 2019-10-02 16:50:14 JST LOG:  redo starts at 0/228
> > > > 2019-10-02 16:50:22 JST LOG:  redo done at 0/300A298
> > > >
> > > > [patched]
> > > > It took less than 1 second to replay DROP DATABASE with 1000
> > > > tablespaces, as follows. In this case, 16GB shared_buffers was scanned
> > only one time.
> > > >
> > > > 2019-10-02 16:47:03 JST LOG:  redo starts at 0/228
> > > > 2019-10-02 16:47:03 JST LOG:  redo done at 0/3001588
> > > >
> > >
> > > Hi Fujii-san,
> > >
> > > It's been a while, so I checked the patch once again.
> > > It's fairly straightforward and I saw no problems nor bug in the code.
> >
> > Thanks for the review!
> >
> > > > [patched]
> > > > It took less than 1 second to replay DROP DATABASE with 1000
> > > > tablespaces,
> > > The results are good.
> > > I want to replicate the performance to confirm the results as well.
> > > Could you share how you measured the recovery replay?
> >
> > I forgot the actual steps that I used for the measurement.
> > But I think they are something like
> >
> > 1. create database "hoge"
> > 2. create 1,000 tablespaces
> > 3. create 1,000 tables on the database "hoge".
> > each table should be placed in different tablespace.
> > 4. take a base backup
> > 5. drop database "hoge"
> > 6. shutdown the server with immediate mode 7. start an archive recovery from
> > the backup taken at #4 8. measure how long it takes to apply DROP DATABASE
> > record by
> > checking the timestamp at REDO start and REDO end.
> >
> > I think that I performed the above steps on the master and the patched 
> > version.
> >
> > > Did you actually execute a failover?
> >
> > No.
>
> I'm sorry for the late reply, and thanks for the guide above.
> I replicated the same recovery test above on a standalone server
> and have confirmed with the logs that the patch made the recovery faster.
>
> [MASTER/UNPATCHED] ~10 seconds
> 2019-11-19 15:25:23.891 JST [23042] LOG:  redo starts at 0/180006A0
> ...
> 2019-11-19 15:25:34.492 JST [23042] LOG:  redo done at 0/1800A478
>
> [PATCHED]  ~less than 1 sec
> 2019-11-19 15:31:59.415 JST [17625] LOG:  redo starts at 0/40005B8
> ...
> 2019-11-19 15:32:00.159 JST [17625] CONTEXT:  WAL redo at 0/4000668 for 
> Database/DROP: dir 1663/16384 16385/16384...//further details ommitted//...
> ...
> 2019-11-19 15:32:00.159 JST [17625] LOG:  redo done at 0/4001638
>
> I believe there are no problems, so I am marking this patch now
> as "Ready for Committer".

Thanks for the review! Committed.

Regards,

-- 
Fujii Masao

Re: Allow CREATE OR REPLACE VIEW to rename the columns

2019-11-21 Thread Fujii Masao

On Wed, Nov 20, 2019 at 1:11 PM btkimurayuzk
 wrote:
>
> > Barring any objection, I'm thinking to commit this patch.
> >
> > Regards,
>
> Build and All Test has passed .
> Looks good to me .

Thanks for reviewing the patch!
I committed the following two patches.

- Allow ALTER VIEW command to rename the column in the view.
- Improve tab-completion for ALTER MATERIALIZED VIEW.

Regards,

-- 
Fujii Masao

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-21 Thread Kyotaro Horiguchi

At Thu, 21 Nov 2019 16:01:07 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> > For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
> > smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
> > sophisticated about optimizing the shared buffers scan.  Commit 279628a
> > introduced that, in 2013.  I think smgrDoPendingSyncs() should do likewise, 
> > to
> Seems reasonable. Please wait a minite.

This is the first cut of that. This makes the function 
FlushRelationBuffersWithoutRelcache useless, which was introduced in this work. 
The first patch reverts it, then the second patch adds the bulk sync feature.

The new function FlushRelFileNodesAllBuffers, differently from
DropRelFileNodesAllBuffers, takes SMgrRelation which is required by
FlushBuffer(). So it takes somewhat tricky way, where type
SMgrSortArray pointer to which is compatible with RelFileNode is used.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c51b44734d88fb19b568c4c0240848c8be2b7cf4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 19:28:35 +0900
Subject: [PATCH 1/2] Revert FlushRelationBuffersWithoutRelcache.

Succeeding patch makes the function useless and the function is no
longer useful globally. Revert it.
---
 src/backend/storage/buffer/bufmgr.c | 27 ++-
 src/include/storage/bufmgr.h|  2 --
 2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 746ce477fc..67bbb26cae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3203,27 +3203,20 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-	RelationOpenSmgr(rel);
-
-	FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
-		RelationUsesLocalBuffers(rel));
-}
-
-void
-FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
-{
-	RelFileNode rnode = smgr->smgr_rnode.node;
-	int i;
+	int			i;
 	BufferDesc *bufHdr;
 
-	if (islocal)
+	/* Open rel at the smgr level if not already done */
+	RelationOpenSmgr(rel);
+
+	if (RelationUsesLocalBuffers(rel))
 	{
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
 
 			bufHdr = GetLocalBufferDescriptor(i);
-			if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+			if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
 			{
@@ -3240,7 +3233,7 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
 
 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-smgrwrite(smgr,
+smgrwrite(rel->rd_smgr,
 		  bufHdr->tag.forkNum,
 		  bufHdr->tag.blockNum,
 		  localpage,
@@ -3270,18 +3263,18 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
 		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
 		 * and saves some cycles.
 		 */
-		if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
+		if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
 			continue;
 
 		ReservePrivateRefCountEntry();
 
 		buf_state = LockBufHdr(bufHdr);
-		if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+		if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
 			(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8097d5ab22..8cd1cf25d9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -192,8 +192,6 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
-extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
-bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
    int nforks, BlockNumber *firstDelBlock);
-- 
2.23.0

>From 882731fcf063269d0bf85c57f23c83b9570e5df5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 19:33:18 +0900
Subject: [PATCH 2/2] Improve the performance of relation syncs.

We can improve performance of syncing multiple files at once in the
same way as b41669118. This reduces the number of scans on the whole
shared_bufffers from the number of synced relations to one.
---
 src/backend/catalog/storage.c   |  28 +--
 src/backend/storage/buffer/bufmgr.c | 113 
 src/backend/storage/smgr/smgr.c |  38 +-
 src/include/storage/bufmgr.h|   1 +
 src/include/storage

Re: [proposal] recovery_target "latest"

2019-11-21 Thread Peter Eisentraut


On 2019-11-08 05:00, Grigory Smolkin wrote:

Attached new patch revision, now end of available WAL is defined as the
fact of missing required WAL.
In case of standby, the end of WAL is defined as 2 consecutive switches
of WAL source, that didn`t provided requested record.
In case of streaming standby, each switch of WAL source is forced after
3 failed attempts to get new data from walreceiver.

All constants are taken off the top of my head and serves as proof of
concept.


Well, this is now changing the meaning of the patch quite a bit.  I'm on 
board with making the existing default behavior explicit.  (This is 
similar to how we added recovery_target_timeline = 'current' in PG12.) 
Still not a fan of the name yet, but that's trivial.


If, however, you want to change the default behavior or introduce a new 
behavior, as you are suggesting here, that should be a separate discussion.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-11-21 Thread Amit Kapila

On Tue, Nov 19, 2019 at 5:25 PM Amit Kapila  wrote:
>
> On Sat, Nov 16, 2019 at 6:44 PM Amit Kapila  wrote:
> >
> > On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila  wrote:
> > >
> > > Some notes before commit:
> > > --
> > > 1.
> > > Commit message need to be changed for the first patch
> > > -
> > > A.
> > > > The memory limit is defined by a new logical_decoding_work_mem GUC, so 
> > > > for example we can do this
> > >
> > > SET logical_decoding_work_mem = '128kB'
> > >
> > > > to trigger very aggressive streaming. The minimum value is 64kB.
> > >
> > > I think this patch doesn't contain streaming, so we either need to
> > > reword it or remove it.
> > >
> > > B.
> > > > The logical_decoding_work_mem may be set either in postgresql.conf, in 
> > > > which case it serves as the default for all publishers on that 
> > > > instance, or when creating the
> > > > subscription, using a work_mem paramemter in the WITH clause (specifies 
> > > > number of kilobytes).
> > >
> > > We need to reword this as we have decided to remove the setting from
> > > the subscription side as of now.
> > >
> > > 2. I think we can change the message level in UpdateSpillStats() to 
> > > DEBUG2.
> > >
> >
> > I have made these modifications and additionally ran pgindent.
> >
> > > 4. I think we can combine both patches and commit as one patch, but it
> > > is okay to commit them separately as well.
> > >
> >
> > I am not sure if this is a good idea, so still kept them as separate.
> >
>
> I have committed the first patch.  I will commit the second one
> related to stats of spilled xacts on Thursday.  The second patch needs
> catalog version bump as well because we are modifying the catalog
> contents in that patch.
>

Committed the second one as well.  Now, we can move to a review of
patches for "streaming of in-progress transactions".

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Block level parallel vacuum

2019-11-21 Thread Amit Kapila

On Thu, Nov 21, 2019 at 3:25 PM Dilip Kumar  wrote:
>
> On Thu, 21 Nov 2019, 14:15 Masahiko Sawada,  
> wrote:
>>
>> On Thu, 21 Nov 2019 at 14:32, Dilip Kumar  wrote:
>> >
>> >
>> > In v33-0001-Add-index-AM-field-and-callback-for-parallel-ind patch,  I
>> > am a bit doubtful about this kind of arrangement, where the code in
>> > the "if" is always unreachable with the current AMs.  I am not sure
>> > what is the best way to handle this, should we just drop the
>> > amestimateparallelvacuum altogether?
>>
>> IIUC the motivation of amestimateparallelvacuum is for third party
>> index AM. If it allocates memory more than IndexBulkDeleteResult like
>> the current gist indexes (although we'll change it) it will break
>> index statistics of other indexes or even can be cause of crash. I'm
>> not sure there is such third party index AMs and it's true that all
>> index AMs in postgres code will not use this callback as you
>> mentioned, but I think we need to take care of it because such usage
>> is still possible.
>>
>> > Because currently, we are just
>> > providing a size estimate function without a copy function,  even if
>> > the in future some Am give an estimate about the size of the stats, we
>> > can not directly memcpy the stat from the local memory to the shared
>> > memory, we might then need a copy function also from the AM so that it
>> > can flatten the stats and store in proper format?
>>
>> I might be missing something but why can't we copy the stats from the
>> local memory to the DSM without the callback for copying stats? The
>> lazy vacuum code will get the pointer of the stats that are allocated
>> by index AM and the code can know the size of it. So I think we can
>> just memcpy to DSM.
>
>
> Oh sure.  But, what I meant is that if AM may keep pointers in its stats as 
> GistBulkDeleteResult do so we might not be able to copy directly outside the 
> AM.  So I thought that if we have a call back for the copy then the AM can 
> flatten the stats such that IndexBulkDeleteResult, followed by AM specific 
> stats.  Yeah but someone may argue that we might force the AM to return the 
> stats in a form that it can be memcpy directly.  So I think I am fine with 
> the way it is.
>

I think we have discussed this point earlier as well and the
conclusion was to provide an API if there is a need for the same.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: tableam vs. TOAST

2019-11-21 Thread Peter Eisentraut


On 2019-11-08 17:59, Robert Haas wrote:

OK. Could you see what you think of the attached patches? 0001 does
some refactoring of toast_fetch_datum() and toast_fetch_datum_slice()
to make them look more like each other and clean up a bunch of stuff
that I thought was annoying, and 0002 then pulls out the common logic
into a heap-specific function. If you like this direction, we could
then push the heap-specific function below tableam, but I haven't done
that yet.


Partial review: The 0001 patch seems very sensible.  Some minor comments 
on that:


Perhaps rename the residx variable (in both functions).  You have gotten 
rid of all the res* variables except that one.  That name as it is right 
now isn't very helpful at all.


You have collapsed the error messages for "chunk %d of %d" and "final 
chunk %d" and replaced it with just "chunk %d".  I think it might be 
better to keep the "chunk %d of %d" wording, for more context, or was 
there a reason why you wanted to remove the total count from the message?


I believe this assertion

+   Assert(endchunk <= totalchunks);

should be < (strictly less).

In the commit message you state that this assertion replaces a run-time 
check, but I couldn't quite make out which one you are referring to 
because all the existing run-time checks are kept, with slightly 
refactored conditions.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: why doesn't optimizer can pull up where a > ( ... )

2019-11-21 Thread Tomas Vondra


On Thu, Nov 21, 2019 at 08:30:51AM +0800, Andy Fan wrote:



Hm.  That actually raises the stakes a great deal, because if that's
what you're expecting, it would require planning out both the transformed
and untransformed versions of the query before you could make a cost
comparison.



I don't know an official name,  let's call it as "bloom filter push down
(BFPD)" for reference.  this algorithm  may be helpful on this case with
some extra effort.

First, Take . "select ... from t1,  t2 where t1.a = t2.a and t1.b = 100"
for example,  and assume t1 is scanned before t2 scanning, like hash
join/sort merge and take t1's result as inner table.

1.  it first scan t1  with filter t1.b = 100;
2.  during the above scan,  it build a bloom filter *based on the join key
(t1.a) for the "selected" rows.*
3.  during scan t2.a,  it filters t2.a with the bloom filter.
4.  probe the the hash table with the filtered rows from the above step.



So essentially just a hash join with a bloom filter? That doesn't seem
very relevant to this thread (at least I don't see any obvious link),
but note that this has been discussed in the past - see [1]. And in some
cases building a bloom filter did result in nice speedups, but in other
cases it was just an extra overhead. But it does not require change of
plan shape, unlike the optimization discussed here.

[1] 
https://www.postgresql.org/message-id/flat/5670946E.8070705%402ndquadrant.com

Ultimately there were discussions about pushing the bloom filter much
deeper on the non-hash side, but that was never implemented.


Back to this problem,  if we have a chance to get the p_brand we are
interested,  we can use the same logic to only group by the p_brand.

Another option may be we just keep the N versions, and search them
differently and compare their cost at last.



Maybe. I think the problem is going to be that with multiple such
correlated queries you may significantly increase the number of plan
variants to consider - each subquery may be transformed or not, so the
space splits into 2. With 6 such subqueries you suddenly have 64x the
number of plan variants you have to consider (I don't think you can just
elimiate those early on).


 The Greenplum page mentions they also added "join-aggregates

reordering", in addition to subquery unnesting.
Thanks,  I will search more about this.


Having said that, the best form of criticism is a patch.  If somebody
actually wrote the code to do something like this, we could look at how
much time it wasted in which unsuccessful cases and then have an
informed discussion about whether it was worth adopting.



I would try to see how far I can get.


+1

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Block level parallel vacuum

2019-11-21 Thread Dilip Kumar

On Thu, 21 Nov 2019, 14:15 Masahiko Sawada, 
wrote:

> On Thu, 21 Nov 2019 at 14:32, Dilip Kumar  wrote:
> >
> > On Thu, Nov 21, 2019 at 10:46 AM Dilip Kumar 
> wrote:
> > >
> > > On Wed, Nov 20, 2019 at 11:01 AM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Mon, 18 Nov 2019 at 15:38, Masahiko Sawada
> > > >  wrote:
> > > > >
> > > > > On Mon, 18 Nov 2019 at 15:34, Amit Kapila 
> wrote:
> > > > > >
> > > > > > On Mon, Nov 18, 2019 at 11:37 AM Masahiko Sawada
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Wed, 13 Nov 2019 at 14:31, Amit Kapila <
> amit.kapil...@gmail.com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Based on these needs, we came up with a way to allow users
> to specify
> > > > > > > > this information for IndexAm's. Basically, Indexam will
> expose a
> > > > > > > > variable amparallelvacuumoptions which can have below options
> > > > > > > >
> > > > > > > > VACUUM_OPTION_NO_PARALLEL   1 << 0 # vacuum (neither
> bulkdelete nor
> > > > > > > > vacuumcleanup) can't be performed in parallel
> > > > > > >
> > > > > > > I think VACUUM_OPTION_NO_PARALLEL can be 0 so that index AMs
> who don't
> > > > > > > want to support parallel vacuum don't have to set anything.
> > > > > > >
> > > > > >
> > > > > > make sense.
> > > > > >
> > > > > > > > VACUUM_OPTION_PARALLEL_BULKDEL   1 << 1 # bulkdelete can be
> done in
> > > > > > > > parallel (Indexes nbtree, hash, gin, gist, spgist, bloom
> will set this
> > > > > > > > flag)
> > > > > > > > VACUUM_OPTION_PARALLEL_COND_CLEANUP  1 << 2 # vacuumcleanup
> can be
> > > > > > > > done in parallel if bulkdelete is not performed (Indexes
> nbtree, brin,
> > > > > > > > gin, gist,
> > > > > > > > spgist, bloom will set this flag)
> > > > > > > > VACUUM_OPTION_PARALLEL_CLEANUP  1 << 3 # vacuumcleanup can
> be done in
> > > > > > > > parallel even if bulkdelete is already performed (Indexes
> gin, brin,
> > > > > > > > and bloom will set this flag)
> > > > > > >
> > > > > > > I think gin and bloom don't need to set both but should set
> only
> > > > > > > VACUUM_OPTION_PARALLEL_CLEANUP.
> > > > > > >
> > > > > > > And I'm going to disallow index AMs to set both
> > > > > > > VACUUM_OPTION_PARALLEL_COND_CLEANUP and
> VACUUM_OPTION_PARALLEL_CLEANUP
> > > > > > > by assertions, is that okay?
> > > > > > >
> > > > > >
> > > > > > Sounds reasonable to me.
> > > > > >
> > > > > > Are you planning to include the changes related to I/O throttling
> > > > > > based on the discussion in the nearby thread [1]?  I think you
> can do
> > > > > > that if you agree with the conclusion in the last email[1],
> otherwise,
> > > > > > we can explore it separately.
> > > > >
> > > > > Yes I agreed. I'm going to include that changes in the next version
> > > > > patches. And I think we will be able to do more discussion based on
> > > > > the patch.
> > > > >
> > > >
> > > > I've attached the latest version patch set. The patch set includes
> all
> > > > discussed points regarding index AM options as well as shared cost
> > > > balance. Also I added some test cases used all types of index AM.
> > > >
> > > > During developments I had one concern about the number of parallel
> > > > workers to launch. In current design each index AMs can choose the
> > > > participation of parallel bulk-deletion and parallel cleanup. That
> > > > also means the number of parallel worker to launch might be different
> > > > for each time of parallel bulk-deletion and parallel cleanup. In
> > > > current patch the leader will always launch the number of indexes
> that
> > > > support either one but it would not be efficient in some cases. For
> > > > example, if we have 3 indexes supporting only parallel bulk-deletion
> > > > and 2 indexes supporting only parallel index cleanup, we would launch
> > > > 5 workers for each execution but some workers will do nothing at all.
> > > > To deal with this problem, I wonder if we can improve the parallel
> > > > query so that the leader process creates a parallel context with the
> > > > maximum number of indexes and can launch a part of workers instead of
> > > > all of them.
> > > >
> > > +
> > > + /* compute new balance by adding the local value */
> > > + shared_balance = pg_atomic_read_u32(VacuumSharedCostBalance);
> > > + new_balance = shared_balance + VacuumCostBalance;
> > >
> > > + /* also compute the total local balance */
> > > + local_balance = VacuumCostBalanceLocal + VacuumCostBalance;
> > > +
> > > + if ((new_balance >= VacuumCostLimit) &&
> > > + (local_balance > 0.5 * (VacuumCostLimit / nworkers)))
> > > + {
> > > + /* compute sleep time based on the local cost balance */
> > > + msec = VacuumCostDelay * VacuumCostBalanceLocal / VacuumCostLimit;
> > > + new_balance = shared_balance - VacuumCostBalanceLocal;
> > > + VacuumCostBalanceLocal = 0;
> > > + }
> > > +
> > > + if (pg_atomic_compare_exchange_u32(VacuumSharedCostBalance,
> > > +&shared_balance,
> > > +new_balance))
> > > + {
> > > + /* Updated successfully, break *

Re: CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?

2019-11-21 Thread Amit Langote

Thanks for the review.

On Thu, Nov 21, 2019 at 6:34 AM Tom Lane  wrote:
>
> Amit Langote  writes:
> > [ v6-0001-Use-root-parent-s-permissions-when-reading-child-.patch ]
>
> I started to review this, and discovered that the new regression test
> passes just fine without applying any of the rest of the patch.
> Usually we try to design regression test additions so that they
> demonstrate that the new code does something different, so this seems
> a bit odd.  Can't we set up the test to fail with unpatched code?

Hmm, the test case *used to* fail without the code fix back in
September.  That is no longer the case, in short because text_ge() got
marked leakproof since then.

Anyway, I have modified the test case such that it now fails without
the code fix, but I don't have enough faith that it's robust enough.
:(

> Also, the test case contains no expression index, so I can't see how
> it'd provide any code coverage for the code added in examine_variable.

Added a test case involving an expression index, which helped spot a
problem with the code added in examine_variable, which fixed too.

> The comment for inh_root_relid seems rather inadequate, since it
> fails to mention the special case for UNION ALL subqueries.
> But do we even need that special case?  It looks to me like the
> walk-up-to-parent code is defending against such cases by checking
> relkind, so maybe we don't need to throw away info for UNION ALL.
> In general, if we're going to add inh_root_relid, I'd like its
> definition to be as simple and consistent as possible, because
> I'm sure there will be other uses for it.  If it could be something
> like "baserel that this otherrel is a child of", full stop,
> I think that'd be good.

If inh_root_relid meant that, it would no longer be useful to
examine_variable.  In examine_variable, we need to map a child table's
relid to the relid of its root parent table.  If the root parent
itself is under a UNION ALL subquery parent, then inh_root_relid of
all relations in that ancestry chain would point to the UNION ALL
subquery parent, which is not what examine_variable would want to use,
because it's really looking for the root "table".

In examine_simple_variable however, we need to not just map the child
relid to root table relid, but also convert the Var, so we have to
traverse the ancestry chain via AppendRelInfos.  So, we don't really
need inh_root_relid there, because we can calculate that as we're
traversing the AppendRelInfo chain.

I have expanded the comment for inh_root_relid a bit.

> I don't especially like the logic in examine_simple_variable,
> because it walks back up the AppendRelInfo chain but then proceeds
> to use
> rte = planner_rt_fetch(rel->inh_root_relid, root);
> without any sort of cross-check that it's stopped at that relation
> and not some other one.  It'd be better to keep track of the top
> parent_relid while walking up, and use that.  Or else make the
> loop stop condition be reaching the matching relid.

I've added an Assert to cross-check that the AppendRelInfo traversal
loop stops once it has computed a Var matching inh_root_relid.

Attached updated patch.

Thanks,
Amit

v7-0001-Use-root-parent-s-permissions-when-reading-child-.patch
Description: Binary data

Re: fix "Success" error messages

2019-11-21 Thread Peter Eisentraut


On 2019-11-21 02:42, TAKATSUKA Haruka wrote:

  FATAL: could not access status of transaction ..
  DETAIL: Could not read from file (pg_clog/ or pg_xact/) ...: Success.

This error has caused the server to fail to start with recovery.
I got a report that it happend repeatedly at the newly generated
standby cluster. I gave them advice to comfirm the low level server
environment.

However, in addition to improving the message, should we retry to read
the rest of the data in the case reading too few bytes?
What about a limited number of retries instead of a complete loop?


If we thought that would help, there are probably hundreds or more other 
places where we read files that would need to be fixed up in the same 
way.  That doesn't seem reasonable.


Also, it is my understanding that short reads can in practice only 
happen if the underlying storage is having a serious problem, so 
retrying wouldn't actually help much.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Why overhead of SPI is so large?

2019-11-21 Thread Konstantin Knizhnik




I've set the CF entry to "Waiting on Author" pending a new patch
that does it like that.


With contain_mutable_functions the patch becomes trivial.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index a4697dc..55ed878 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -30,6 +30,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
 #include "optimizer/optimizer.h"
 #include "parser/parse_coerce.h"
 #include "parser/parse_type.h"
@@ -6157,7 +6158,7 @@ exec_eval_simple_expr(PLpgSQL_execstate *estate,
 	 * updates made so far by our own function.
 	 */
 	oldcontext = MemoryContextSwitchTo(get_eval_mcontext(estate));
-	if (!estate->readonly_func)
+	if (!estate->readonly_func && expr->expr_needs_snapshot)
 	{
 		CommandCounterIncrement();
 		PushActiveSnapshot(GetTransactionSnapshot());
@@ -6182,7 +6183,7 @@ exec_eval_simple_expr(PLpgSQL_execstate *estate,
 
 	estate->paramLI->parserSetupArg = save_setup_arg;
 
-	if (!estate->readonly_func)
+	if (!estate->readonly_func && expr->expr_needs_snapshot)
 		PopActiveSnapshot();
 
 	MemoryContextSwitchTo(oldcontext);
@@ -8046,6 +8047,7 @@ exec_save_simple_expr(PLpgSQL_expr *expr, CachedPlan *cplan)
 	 * current transaction".
 	 */
 	expr->expr_simple_expr = tle_expr;
+	expr->expr_needs_snapshot = contain_mutable_functions((Node*)tle_expr);
 	expr->expr_simple_generation = cplan->generation;
 	expr->expr_simple_state = NULL;
 	expr->expr_simple_in_use = false;
diff --git a/src/pl/plpgsql/src/plpgsql.h b/src/pl/plpgsql/src/plpgsql.h
index f66b2ba..454131f 100644
--- a/src/pl/plpgsql/src/plpgsql.h
+++ b/src/pl/plpgsql/src/plpgsql.h
@@ -243,6 +243,7 @@ typedef struct PLpgSQL_expr
 	 */
 	ExprState  *expr_simple_state;	/* eval tree for expr_simple_expr */
 	bool		expr_simple_in_use; /* true if eval tree is active */
+	bool		expr_needs_snapshot; /* true if simple expression calls non-immutable functions or performs subqueries */
 	LocalTransactionId expr_simple_lxid;
 } PLpgSQL_expr;

Re: backup manifests

2019-11-21 Thread Jeevan Chalke

On Wed, Nov 20, 2019 at 11:05 AM Suraj Kharage <
suraj.khar...@enterprisedb.com> wrote:

> Hi,
>
> Since now we are generating the backup manifest file with each backup, it
> provides us an option to validate the given backup.
> Let's say, we have taken a backup and after a few days, we want to check
> whether that backup is validated or corruption-free without restarting the
> server.
>
> Please find attached POC patch for same which will be based on the latest
> backup manifest patch from Rushabh. With this functionality, we add new
> option to pg_basebackup, something like --verify-backup.
> So, the syntax would be:
> ./bin/pg_basebackup --verify-backup -D 
>
> Basically, we read the backup_manifest file line by line from the given
> directory path and build the hash table, then scan the directory and
> compare each file with the hash entry.
>
> Thoughts/suggestions?
>

I like the idea of verifying the backup once we have backup_manifest with
us.
Periodically verifying the already taken backup with this simple tool
becomes
easy now.

I have reviewed this patch and here are my comments:

1.
@@ -30,7 +30,9 @@
 #include "common/file_perm.h"
 #include "common/file_utils.h"
 #include "common/logging.h"
+#include "common/sha2.h"
 #include "common/string.h"
+#include "fe_utils/simple_list.h"
 #include "fe_utils/recovery_gen.h"
 #include "fe_utils/string_utils.h"
 #include "getopt_long.h"
@@ -38,12 +40,19 @@
 #include "pgtar.h"
 #include "pgtime.h"
 #include "pqexpbuffer.h"
+#include "pgrhash.h"
 #include "receivelog.h"
 #include "replication/basebackup.h"
 #include "streamutil.h"

Please add new files in order.

2.
Can hash related file names be renamed to backuphash.c and backuphash.h?

3.
Need indentation adjustments at various places.

4.
+charbuf[100];  // 1MB chunk

It will be good if we have multiple of block /page size (or at-least power
of 2
number).

5.
+typedef struct pgrhash_entry
+{
+struct pgrhash_entry *next; /* link to next entry in same bucket */
+DataDirectoryFileInfo *record;
+} pgrhash_entry;
+
+struct pgrhash
+{
+unsignednbuckets;/* number of buckets */
+pgrhash_entry **bucket;/* pointer to hash entries */
+};
+
+typedef struct pgrhash pgrhash;

These two can be moved to .h file instead of redefining over there.

6.
+/*
+ * TODO: this function is not necessary, can be removed.
+ * Test whether the given row number is match for the supplied keys.
+ */
+static bool
+pgrhash_compare(char *bt_filename, char *filename)

Yeah, it can be removed by doing strcmp() at the required places rather than
doing it in a separate function.

7.
mdate is not compared anywhere. I understand that it can't be compared with
the file in the backup directory and its entry in the manifest as manifest
entry gives mtime from server file whereas the same file in the backup will
have different mtime. But adding a few comments there will be good.

8.
+charmdate[24];

should be mtime instead?

Thanks

-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Block level parallel vacuum

2019-11-21 Thread Dilip Kumar

On Thu, 21 Nov 2019, 13:52 Masahiko Sawada, 
wrote:

> On Thu, 21 Nov 2019 at 14:16, Dilip Kumar  wrote:
> >
> > On Wed, Nov 20, 2019 at 11:01 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Mon, 18 Nov 2019 at 15:38, Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Mon, 18 Nov 2019 at 15:34, Amit Kapila 
> wrote:
> > > > >
> > > > > On Mon, Nov 18, 2019 at 11:37 AM Masahiko Sawada
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, 13 Nov 2019 at 14:31, Amit Kapila <
> amit.kapil...@gmail.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > > Based on these needs, we came up with a way to allow users to
> specify
> > > > > > > this information for IndexAm's. Basically, Indexam will expose
> a
> > > > > > > variable amparallelvacuumoptions which can have below options
> > > > > > >
> > > > > > > VACUUM_OPTION_NO_PARALLEL   1 << 0 # vacuum (neither
> bulkdelete nor
> > > > > > > vacuumcleanup) can't be performed in parallel
> > > > > >
> > > > > > I think VACUUM_OPTION_NO_PARALLEL can be 0 so that index AMs who
> don't
> > > > > > want to support parallel vacuum don't have to set anything.
> > > > > >
> > > > >
> > > > > make sense.
> > > > >
> > > > > > > VACUUM_OPTION_PARALLEL_BULKDEL   1 << 1 # bulkdelete can be
> done in
> > > > > > > parallel (Indexes nbtree, hash, gin, gist, spgist, bloom will
> set this
> > > > > > > flag)
> > > > > > > VACUUM_OPTION_PARALLEL_COND_CLEANUP  1 << 2 # vacuumcleanup
> can be
> > > > > > > done in parallel if bulkdelete is not performed (Indexes
> nbtree, brin,
> > > > > > > gin, gist,
> > > > > > > spgist, bloom will set this flag)
> > > > > > > VACUUM_OPTION_PARALLEL_CLEANUP  1 << 3 # vacuumcleanup can be
> done in
> > > > > > > parallel even if bulkdelete is already performed (Indexes gin,
> brin,
> > > > > > > and bloom will set this flag)
> > > > > >
> > > > > > I think gin and bloom don't need to set both but should set only
> > > > > > VACUUM_OPTION_PARALLEL_CLEANUP.
> > > > > >
> > > > > > And I'm going to disallow index AMs to set both
> > > > > > VACUUM_OPTION_PARALLEL_COND_CLEANUP and
> VACUUM_OPTION_PARALLEL_CLEANUP
> > > > > > by assertions, is that okay?
> > > > > >
> > > > >
> > > > > Sounds reasonable to me.
> > > > >
> > > > > Are you planning to include the changes related to I/O throttling
> > > > > based on the discussion in the nearby thread [1]?  I think you can
> do
> > > > > that if you agree with the conclusion in the last email[1],
> otherwise,
> > > > > we can explore it separately.
> > > >
> > > > Yes I agreed. I'm going to include that changes in the next version
> > > > patches. And I think we will be able to do more discussion based on
> > > > the patch.
> > > >
> > >
> > > I've attached the latest version patch set. The patch set includes all
> > > discussed points regarding index AM options as well as shared cost
> > > balance. Also I added some test cases used all types of index AM.
> > >
> > > During developments I had one concern about the number of parallel
> > > workers to launch. In current design each index AMs can choose the
> > > participation of parallel bulk-deletion and parallel cleanup. That
> > > also means the number of parallel worker to launch might be different
> > > for each time of parallel bulk-deletion and parallel cleanup. In
> > > current patch the leader will always launch the number of indexes that
> > > support either one but it would not be efficient in some cases. For
> > > example, if we have 3 indexes supporting only parallel bulk-deletion
> > > and 2 indexes supporting only parallel index cleanup, we would launch
> > > 5 workers for each execution but some workers will do nothing at all.
> > > To deal with this problem, I wonder if we can improve the parallel
> > > query so that the leader process creates a parallel context with the
> > > maximum number of indexes and can launch a part of workers instead of
> > > all of them.
> > >
> > +
> > + /* compute new balance by adding the local value */
> > + shared_balance = pg_atomic_read_u32(VacuumSharedCostBalance);
> > + new_balance = shared_balance + VacuumCostBalance;
> >
> > + /* also compute the total local balance */
> > + local_balance = VacuumCostBalanceLocal + VacuumCostBalance;
> > +
> > + if ((new_balance >= VacuumCostLimit) &&
> > + (local_balance > 0.5 * (VacuumCostLimit / nworkers)))
> > + {
> > + /* compute sleep time based on the local cost balance */
> > + msec = VacuumCostDelay * VacuumCostBalanceLocal / VacuumCostLimit;
> > + new_balance = shared_balance - VacuumCostBalanceLocal;
> > + VacuumCostBalanceLocal = 0;
> > + }
> > +
> > + if (pg_atomic_compare_exchange_u32(VacuumSharedCostBalance,
> > +&shared_balance,
> > +new_balance))
> > + {
> > + /* Updated successfully, break */
> > + break;
> > + }
> > While looking at the shared costing delay part, I have noticed that
> > while checking the delay condition, we are considering local_balance
> > which is VacuumCostBalanceLocal + VacuumCostBalance, but while
> > computing the new balance we o

Re: range_agg

2019-11-21 Thread Pavel Stehule

st 20. 11. 2019 v 20:32 odesílatel Paul A Jungwirth <
p...@illuminatedcomputing.com> napsal:

> On Tue, Nov 19, 2019 at 9:49 PM Paul A Jungwirth
>  wrote:
> >
> > On Tue, Nov 19, 2019 at 1:17 AM Pavel Stehule 
> wrote:
> > > Hi
> > > I tested last patches. I found some issues
> >
> > Thank you for the review!
>
> Here is an updated patch series fixing the problems from that last
> review. I would still like some direction about the doc formatting:
>
>
yes, these bugs are fixed

there are not compilation's issues
tests passed
doc is ok

I have notes

1. the chapter should be renamed to "Range Functions and Operators" to
"Range and Multirange Functions and Operators"

But now the doc is not well readable - there is not clean, what functions
are for range type, what for multirange and what for both

2. I don't like introduction "safe" operators - now the basic operators are
doubled, and nobody without documentation will use @* operators.

It is not intuitive. I think is better to map this functionality to basic
operators +- * and implement it just for pairs (Multirange, Multirange) and
(Multirange, Range) if it is possible

It's same relation line Numeric X integer. There should not be introduced
new operators. If somebody need it for ranges, then he can use cast to
multirange, and can continue.

The "safe" operators can be implement on user space - but should not be
default solution.

3. There are not prepared casts -

postgres=# select int8range(10,15)::int8multirange;
ERROR:  cannot cast type int8range to int8multirange
LINE 1: select int8range(10,15)::int8multirange;
   ^
There should be some a) fully generic solution, or b) possibility to build
implicit cast when any multirange type is created.

Regards

Pavel

> > > I am not sure how much is correct to use  class="monospaced"> in doc. It is used for ranges, and multiranges, but no
> in other places
> >
> > I could use some advice here. Many operators seem best presented in
> > groups of four, where only their parameter types change, for example:
> >
> > int8range < int8range
> > int8range < int8multirange
> > int8multirange < int8range
> > int8multirange < int8multirange
> >
> > All I really want is to show those separated by line breaks. I
> > couldn't find any other examples of that happening inside a table cell
> > though (i.e. inside ). What is the best way
> > to do that?
>

Personally I think it should be cleaned. Mainly if there is not visible
differences. But range related doc it uses, so it is consistent with it.
And then this is not big issue.

> Thanks,
> Paul
>

Re: backup manifests

2019-11-21 Thread Jeevan Chalke

On Tue, Nov 19, 2019 at 3:30 PM Rushabh Lathia 
wrote:

>
>
> My colleague Suraj did testing and noticed the performance impact
> with the checksums.   On further testing, he found that specifically with
> sha its more of performance impact.
>
> Please find below statistics:
>
> no of tables without checksum SHA256
> checksum % performnce
> overhead
> with
> SHA-256 md5 checksum % performnce
> overhead with md5 CRC checksum % performnce
> overhead with
> CRC
> 10 (100 MB
> in each table) real 0m10.957s
> user 0m0.367s
> sys 0m2.275s real 0m16.816s
> user 0m0.210s
> sys 0m2.067s 53% real 0m11.895s
> user 0m0.174s
> sys 0m1.725s 8% real 0m11.136s
> user 0m0.365s
> sys 0m2.298s 2%
> 20 (100 MB
> in each table) real 0m20.610s
> user 0m0.484s
> sys 0m3.198s real 0m31.745s
> user 0m0.569s
> sys 0m4.089s
> 54% real 0m22.717s
> user 0m0.638s
> sys 0m4.026s 10% real 0m21.075s
> user 0m0.538s
> sys 0m3.417s 2%
> 50 (100 MB
> in each table) real 0m49.143s
> user 0m1.646s
> sys 0m8.499s real 1m13.683s
> user 0m1.305s
> sys 0m10.541s 50% real 0m51.856s
> user 0m0.932s
> sys 0m7.702s 6% real 0m49.689s
> user 0m1.028s
> sys 0m6.921s 1%
> 100 (100 MB
> in each table) real 1m34.308s
> user 0m2.265s
> sys 0m14.717s real 2m22.403s
> user 0m2.613s
> sys 0m20.776s 51% real 1m41.524s
> user 0m2.158s
> sys 0m15.949s
> 8% real 1m35.045s
> user 0m2.061s
> sys 0m16.308s 1%
> 100 (1 GB
> in each table) real 17m18.336s
> user 0m20.222s
> sys 3m12.960s real 24m45.942s
> user 0m26.911s
> sys 3m33.501s 43% real 17m41.670s
> user 0m26.506s
> sys 3m18.402s 2% real 17m22.296s
> user 0m26.811s
> sys 3m56.653s
>
> sometimes, this test
> completes within the
> same time as without
> checksum. approx. 0.5%
>
>
> Considering the above results, I modified the earlier Robert's patch and
> added
> "manifest_with_checksums" option to pg_basebackup.  With a new patch.
> by default, checksums will be disabled and will be only enabled when
> "manifest_with_checksums" option is provided.  Also re-based all patch set.
>

Review comments on 0004:

1.
I don't think we need o_manifest_with_checksums variable,
manifest_with_checksums can be used instead.

2.
We need to document this new option for pg_basebackup and basebackup.

3.
Also, instead of keeping manifest_with_checksums as a global variable, we
should pass that to the required function. Patch 0002 already modified the
signature of all relevant functions anyways. So just need to add one more
bool
variable there.

4.
Why we need a "File" at the start of each entry as we are adding files only?
I wonder if we also need to provide a tablespace name and directory marker
so
that we have "Tablespace" and "Dir" at the start.

5.
If I don't provide manifest-with-checksums option then too I see that
checksum
is calculated for backup_manifest file itself. Is that intentional or
missed?
I think we should omit that too if this option is not provided.

6.
Is it possible to get only a backup manifest from the server? A client like
pg_basebackup can then use that to fetch files reading that.

Thanks


>
>
>
> Regards,
>
> --
> Rushabh Lathia
> www.EnterpriseDB.com
>
> On Tue, Oct 1, 2019 at 5:43 PM Robert Haas  wrote:
>
>> On Mon, Sep 30, 2019 at 5:31 AM Jeevan Chalke
>>  wrote:
>> > Entry for directory is not added in manifest. So it might be difficult
>> > at client to get to know about the directories. Will it be good to add
>> > an entry for each directory too? May be like:
>> > Dir 
>>
>> Well, what kind of corruption would this allow us to detect that we
>> can't detect as things stand? I think the only case is an empty
>> directory. If it's not empty, we'd have some entries for the files in
>> that directory, and those files won't be able to exist unless the
>> directory does. But, how would we end up backing up an empty
>> directory, anyway?
>>
>> I don't really *mind* adding directories into the manifest, but I'm
>> not sure how much it helps.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>>
>>
>
> --
> Rushabh Lathia
>


-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: pg_upgrade fails with non-standard ACL

2019-11-21 Thread Michael Paquier

On Fri, Nov 15, 2019 at 11:30:02AM +0300, Grigory Smolkin wrote:
> On 11/9/19 5:26 AM, Michael Paquier wrote:
>> Another question I have: do we need to care more about other extra
>> ACLs applied to other object types?  For example a subset of columns
>> on a table with a column being renamed could be an issue.  Procedure
>> renamed in core are not that common still we did it.
> 
> I think that all objects must be supported.

The unfortunate part about the current approach is that it is not
really scalable from the point of view of the client.  What the patch
does is to compare the initdb-time ACLs and the ones stored in
pg_proc.  In order to support all object types we would need to have
more client-side logic to join properly with all the catalogs holding
the ACLs of the objects to be compared.  I am wondering if it would be
more simple to invent a backend function which uses an input similar
to pg_describe_object() with (classid, objid and objsubid) that
returns the ACL of the corresponding object after looking at the
catalog defined by classid.  This would simplify the client part to
just scan pg_init_privs...

>>   So I think that it would be better
>> for now to get to a point where we can warn about the function
>> signatures involved in a given database, without the generation of the
>> script with those REVOKE queries.  Or would folks prefer keep that in
>> the first version?  My take would be to handle both separately, and to
>> warn about everything so as there is no need to do pg_upgrade --check
>> more than once.
> 
> I would prefer to warn about every function (so he can more quickly assess
> the situation) AND generate script. It is good to save some user time,
> because he is going to need that script anyway.

Not arguing against the fact that it is useful, but I'd think that it
is a two-step process, where we need to understand what logic needs to
be in the backend or some frontend:
1) Warn properly about the objects involved, where the object
description returned by pg_describe_object would be fine enough to
understand what's broken in a given database.
2) Generate a script which may be used by the end-user.
--
Michael

signature.asc
Description: PGP signature

Re: [HACKERS] Block level parallel vacuum

2019-11-21 Thread Masahiko Sawada

On Thu, 21 Nov 2019 at 14:32, Dilip Kumar  wrote:
>
> On Thu, Nov 21, 2019 at 10:46 AM Dilip Kumar  wrote:
> >
> > On Wed, Nov 20, 2019 at 11:01 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Mon, 18 Nov 2019 at 15:38, Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Mon, 18 Nov 2019 at 15:34, Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Mon, Nov 18, 2019 at 11:37 AM Masahiko Sawada
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, 13 Nov 2019 at 14:31, Amit Kapila  
> > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > Based on these needs, we came up with a way to allow users to 
> > > > > > > specify
> > > > > > > this information for IndexAm's. Basically, Indexam will expose a
> > > > > > > variable amparallelvacuumoptions which can have below options
> > > > > > >
> > > > > > > VACUUM_OPTION_NO_PARALLEL   1 << 0 # vacuum (neither bulkdelete 
> > > > > > > nor
> > > > > > > vacuumcleanup) can't be performed in parallel
> > > > > >
> > > > > > I think VACUUM_OPTION_NO_PARALLEL can be 0 so that index AMs who 
> > > > > > don't
> > > > > > want to support parallel vacuum don't have to set anything.
> > > > > >
> > > > >
> > > > > make sense.
> > > > >
> > > > > > > VACUUM_OPTION_PARALLEL_BULKDEL   1 << 1 # bulkdelete can be done 
> > > > > > > in
> > > > > > > parallel (Indexes nbtree, hash, gin, gist, spgist, bloom will set 
> > > > > > > this
> > > > > > > flag)
> > > > > > > VACUUM_OPTION_PARALLEL_COND_CLEANUP  1 << 2 # vacuumcleanup can be
> > > > > > > done in parallel if bulkdelete is not performed (Indexes nbtree, 
> > > > > > > brin,
> > > > > > > gin, gist,
> > > > > > > spgist, bloom will set this flag)
> > > > > > > VACUUM_OPTION_PARALLEL_CLEANUP  1 << 3 # vacuumcleanup can be 
> > > > > > > done in
> > > > > > > parallel even if bulkdelete is already performed (Indexes gin, 
> > > > > > > brin,
> > > > > > > and bloom will set this flag)
> > > > > >
> > > > > > I think gin and bloom don't need to set both but should set only
> > > > > > VACUUM_OPTION_PARALLEL_CLEANUP.
> > > > > >
> > > > > > And I'm going to disallow index AMs to set both
> > > > > > VACUUM_OPTION_PARALLEL_COND_CLEANUP and 
> > > > > > VACUUM_OPTION_PARALLEL_CLEANUP
> > > > > > by assertions, is that okay?
> > > > > >
> > > > >
> > > > > Sounds reasonable to me.
> > > > >
> > > > > Are you planning to include the changes related to I/O throttling
> > > > > based on the discussion in the nearby thread [1]?  I think you can do
> > > > > that if you agree with the conclusion in the last email[1], otherwise,
> > > > > we can explore it separately.
> > > >
> > > > Yes I agreed. I'm going to include that changes in the next version
> > > > patches. And I think we will be able to do more discussion based on
> > > > the patch.
> > > >
> > >
> > > I've attached the latest version patch set. The patch set includes all
> > > discussed points regarding index AM options as well as shared cost
> > > balance. Also I added some test cases used all types of index AM.
> > >
> > > During developments I had one concern about the number of parallel
> > > workers to launch. In current design each index AMs can choose the
> > > participation of parallel bulk-deletion and parallel cleanup. That
> > > also means the number of parallel worker to launch might be different
> > > for each time of parallel bulk-deletion and parallel cleanup. In
> > > current patch the leader will always launch the number of indexes that
> > > support either one but it would not be efficient in some cases. For
> > > example, if we have 3 indexes supporting only parallel bulk-deletion
> > > and 2 indexes supporting only parallel index cleanup, we would launch
> > > 5 workers for each execution but some workers will do nothing at all.
> > > To deal with this problem, I wonder if we can improve the parallel
> > > query so that the leader process creates a parallel context with the
> > > maximum number of indexes and can launch a part of workers instead of
> > > all of them.
> > >
> > +
> > + /* compute new balance by adding the local value */
> > + shared_balance = pg_atomic_read_u32(VacuumSharedCostBalance);
> > + new_balance = shared_balance + VacuumCostBalance;
> >
> > + /* also compute the total local balance */
> > + local_balance = VacuumCostBalanceLocal + VacuumCostBalance;
> > +
> > + if ((new_balance >= VacuumCostLimit) &&
> > + (local_balance > 0.5 * (VacuumCostLimit / nworkers)))
> > + {
> > + /* compute sleep time based on the local cost balance */
> > + msec = VacuumCostDelay * VacuumCostBalanceLocal / VacuumCostLimit;
> > + new_balance = shared_balance - VacuumCostBalanceLocal;
> > + VacuumCostBalanceLocal = 0;
> > + }
> > +
> > + if (pg_atomic_compare_exchange_u32(VacuumSharedCostBalance,
> > +&shared_balance,
> > +new_balance))
> > + {
> > + /* Updated successfully, break */
> > + break;
> > + }
> > While looking at the shared costing delay part, I have noticed that
> > while checking the delay condition, we are considering local_balance
>

Re: Ordering of header file inclusion

2019-11-21 Thread Amit Kapila

On Sat, Nov 16, 2019 at 7:01 AM vignesh C  wrote:
>
> On Tue, Nov 12, 2019 at 11:19 AM Amit Kapila  wrote:
> >
> > On Tue, Nov 12, 2019 at 6:33 AM vignesh C  wrote:
> > >
> > >
> > > Thanks Amit for your comments. Please find the updated patch which
> > > does not include the changes mentioned above.
> > >
> >
> > Thanks for working on this.  I have pushed your latest patch.
> >
>
> Thanks Amit for pushing the patch. I have re-verified and found that
> changes need  to be done in few more places. The main changes are made
> in the header file and plpython source files. The attached patch
> handles the same. I have verified make check and make check-world
> including --with-python & --with-perl in the following:
> CentOS Linux release 7.7.1908
> Red Hat Enterprise Linux Server release 7.1
>
> I have verified including --llvm in CentOS Linux release 7.7.1908.
>

Thanks for finding the remaining places, the patch looks good to me.
I hope this covers the entire code.  BTW, are you using some script to
find this or is this a result of manual inspection of code?  I have
modified the commit message in the attached patch.  I will commit this
early next week unless someone else wants to review it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0001-Make-the-order-of-the-header-file-includes-consisten.patch
Description: Binary data

Re: function calls optimization

2019-11-21 Thread Andrzej Barszcz

I think your first thought was good.
How high ? I think it's a matter of convention, certainly more than default
100.



czw., 21 lis 2019 o 02:05 Andy Fan  napisał(a):

>
>
> On Thu, Oct 31, 2019 at 11:07 PM Tom Lane  wrote:
>
>>
>>
>> Possibly this could be finessed by only trying to find duplicates of
>> functions that have high cost estimates.  Not sure how high is high
>> enough.
>
>
> can we just add a flag on pg_proc to show if the cost is high or not,  if
> user are not happy with that,  they can change it by updating the value?
> based on that most of the function call cost are low,   this way may be
> helpful for the searching of duplicate expressions.
>

Re: [HACKERS] Block level parallel vacuum

2019-11-21 Thread Masahiko Sawada

On Thu, 21 Nov 2019 at 14:16, Dilip Kumar  wrote:
>
> On Wed, Nov 20, 2019 at 11:01 AM Masahiko Sawada
>  wrote:
> >
> > On Mon, 18 Nov 2019 at 15:38, Masahiko Sawada
> >  wrote:
> > >
> > > On Mon, 18 Nov 2019 at 15:34, Amit Kapila  wrote:
> > > >
> > > > On Mon, Nov 18, 2019 at 11:37 AM Masahiko Sawada
> > > >  wrote:
> > > > >
> > > > > On Wed, 13 Nov 2019 at 14:31, Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > Based on these needs, we came up with a way to allow users to 
> > > > > > specify
> > > > > > this information for IndexAm's. Basically, Indexam will expose a
> > > > > > variable amparallelvacuumoptions which can have below options
> > > > > >
> > > > > > VACUUM_OPTION_NO_PARALLEL   1 << 0 # vacuum (neither bulkdelete nor
> > > > > > vacuumcleanup) can't be performed in parallel
> > > > >
> > > > > I think VACUUM_OPTION_NO_PARALLEL can be 0 so that index AMs who don't
> > > > > want to support parallel vacuum don't have to set anything.
> > > > >
> > > >
> > > > make sense.
> > > >
> > > > > > VACUUM_OPTION_PARALLEL_BULKDEL   1 << 1 # bulkdelete can be done in
> > > > > > parallel (Indexes nbtree, hash, gin, gist, spgist, bloom will set 
> > > > > > this
> > > > > > flag)
> > > > > > VACUUM_OPTION_PARALLEL_COND_CLEANUP  1 << 2 # vacuumcleanup can be
> > > > > > done in parallel if bulkdelete is not performed (Indexes nbtree, 
> > > > > > brin,
> > > > > > gin, gist,
> > > > > > spgist, bloom will set this flag)
> > > > > > VACUUM_OPTION_PARALLEL_CLEANUP  1 << 3 # vacuumcleanup can be done 
> > > > > > in
> > > > > > parallel even if bulkdelete is already performed (Indexes gin, brin,
> > > > > > and bloom will set this flag)
> > > > >
> > > > > I think gin and bloom don't need to set both but should set only
> > > > > VACUUM_OPTION_PARALLEL_CLEANUP.
> > > > >
> > > > > And I'm going to disallow index AMs to set both
> > > > > VACUUM_OPTION_PARALLEL_COND_CLEANUP and VACUUM_OPTION_PARALLEL_CLEANUP
> > > > > by assertions, is that okay?
> > > > >
> > > >
> > > > Sounds reasonable to me.
> > > >
> > > > Are you planning to include the changes related to I/O throttling
> > > > based on the discussion in the nearby thread [1]?  I think you can do
> > > > that if you agree with the conclusion in the last email[1], otherwise,
> > > > we can explore it separately.
> > >
> > > Yes I agreed. I'm going to include that changes in the next version
> > > patches. And I think we will be able to do more discussion based on
> > > the patch.
> > >
> >
> > I've attached the latest version patch set. The patch set includes all
> > discussed points regarding index AM options as well as shared cost
> > balance. Also I added some test cases used all types of index AM.
> >
> > During developments I had one concern about the number of parallel
> > workers to launch. In current design each index AMs can choose the
> > participation of parallel bulk-deletion and parallel cleanup. That
> > also means the number of parallel worker to launch might be different
> > for each time of parallel bulk-deletion and parallel cleanup. In
> > current patch the leader will always launch the number of indexes that
> > support either one but it would not be efficient in some cases. For
> > example, if we have 3 indexes supporting only parallel bulk-deletion
> > and 2 indexes supporting only parallel index cleanup, we would launch
> > 5 workers for each execution but some workers will do nothing at all.
> > To deal with this problem, I wonder if we can improve the parallel
> > query so that the leader process creates a parallel context with the
> > maximum number of indexes and can launch a part of workers instead of
> > all of them.
> >
> +
> + /* compute new balance by adding the local value */
> + shared_balance = pg_atomic_read_u32(VacuumSharedCostBalance);
> + new_balance = shared_balance + VacuumCostBalance;
>
> + /* also compute the total local balance */
> + local_balance = VacuumCostBalanceLocal + VacuumCostBalance;
> +
> + if ((new_balance >= VacuumCostLimit) &&
> + (local_balance > 0.5 * (VacuumCostLimit / nworkers)))
> + {
> + /* compute sleep time based on the local cost balance */
> + msec = VacuumCostDelay * VacuumCostBalanceLocal / VacuumCostLimit;
> + new_balance = shared_balance - VacuumCostBalanceLocal;
> + VacuumCostBalanceLocal = 0;
> + }
> +
> + if (pg_atomic_compare_exchange_u32(VacuumSharedCostBalance,
> +&shared_balance,
> +new_balance))
> + {
> + /* Updated successfully, break */
> + break;
> + }
> While looking at the shared costing delay part, I have noticed that
> while checking the delay condition, we are considering local_balance
> which is VacuumCostBalanceLocal + VacuumCostBalance, but while
> computing the new balance we only reduce shared balance by
> VacuumCostBalanceLocal,  I think it should be reduced with
> local_balance?

 Right.

> I see that later we are adding VacuumCostBalance to
> the VacuumCostBalanceLocal so we are not loosing accounting for this
> balance.  But, I

Re: Attempt to consolidate reading of XLOG page

2019-11-21 Thread Michael Paquier

On Wed, Nov 20, 2019 at 03:50:29PM +0100, Antonin Houska wrote:
> Thus the use of pg_pread() makes the code quite a bit simpler, so I
> re-introduced it. If you decide that an explicit lseek() should be used yet,
> just let me know.

Skimming through the code, it looks like in a good state.  The
failures of test_deconding are fixed, and all the changes from Alvaro
have been added.

+   fatal_error("could not read in file %s, offset %u, length %zu: %s",
+   fname, seg->ws_off, (Size) errinfo.wre_req,
+   strerror(errinfo.wre_errno));
You should be able to use %m here instead of strerror().

It seems to me that it is always important to not do changes
completely blindly either so as this does not become an issue for
recovery later on.  FWIW, I ran a small set of tests with a WAL
segment sizes of 1MB and 1GB (fsync = off, max_wal_size/min_wal_size
set very high, 1 billion rows in single-column table followed by a
series of updates):
- Created a primary and a standby which archive_mode set.
- Stopped the standby.
- Produced close to 12GB worth of WAL.
- Restarted the standby with restore_command and compared the time it
takes for recovery to complete all the segments with HEAD and your
refactoring:
1GB + HEAD: 7min52s
1GB + patch: 8min10s
1MB + HEAD: 10min17s
1MB + patch: 12min1s

And with WAL segments at 1MB, I was seeing quite a slowdown with the
patch...  Then I have done an extra test with pg_waldump with the
segments generated previously with the output redirected to /dev/null.
Going through 512 segments takes 15.730s with HEAD (average of 3 runs)
and 15.851s with the patch.
--
Michael


signature.asc
Description: PGP signature

74 matches

Mail list logo