date:20200214

Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.

2020-02-14 Thread Bryn Llewellyn

Thank you both, Vik, and David, for bing so quick to respond. All is clear now. 
It seems to me that the price (giving up the ability to say explicitly what 
primitive JSON values you want) is too great to pay for the benefit (being able 
to build the semantic equivalent of a variadic list of actual arguments as text.

So I wrote my own wrapper for jsonb_build_array() and jsonb_build_object():

create function my_jsonb_build(
  kind in varchar,
  variadic_elements in varchar)
  returns jsonb
  immutable
  language plpgsql
as $body$
declare
  stmt varchar :=
case kind
 when 'array' then
   'select jsonb_build_array('||variadic_elements||')'
 when 'object' then
   'select jsonb_build_object('||variadic_elements||')'
end;
  j jsonb;
begin
  execute stmt into j;
  return j;
end;
$body$;

create type t1 as(a int, b varchar);

———
— Test it.

select jsonb_pretty(my_jsonb_build(
  'array',
  $$
17::integer, 'dog'::varchar, true::boolean
  $$));

select jsonb_pretty(my_jsonb_build(
  'array',
  $$
17::integer,
'dog'::varchar,
true::boolean,
(17::int, 'dog'::varchar)::t1
  $$));

select jsonb_pretty(my_jsonb_build(
  'object',
  $$
'a'::varchar,  17::integer,
'b'::varchar,  'dog'::varchar,
'c'::varchar,  true::boolean
  $$));

It produces the result that I want. And I’m prepared to pay the price of using 
$$ to avoid doubling up interior single quotes..

On 14-Feb-2020, at 19:24, David G. Johnston  wrote:

On Friday, February 14, 2020, Bryn Llewellyn mailto:b...@yugabyte.com>> wrote:

select jsonb_pretty(jsonb_object(
 '{a, 17, b, "dog", c, true}'::varchar[]
 ))

In other words, do the double quotes around "dog" have no effect? That would be 
a bad thing—and it would limit the usefulness of the jsonb_object() function.

The double quotes serve a specific purpose, to allow values containing commas 
to be treated as a single value (see syntax details for the exact rules) in the 
resulting array of text values.  The fact you don’t have to quote the other 
strings is a convenience behavior of the feature.

David J.

Re: Parallel copy

2020-02-14 Thread Amit Kapila

On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner  wrote:
>
> On Fri, 14 Feb 2020 at 11:57, Amit Kapila  wrote:
>>
>> On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro  wrote:
>> >
>> > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila  
>> > wrote:
>
>  ...
>>
>> > > Another approach that came up during an offlist discussion with Robert
>> > > is that we have one dedicated worker for reading the chunks from file
>> > > and it copies the complete tuples of one chunk in the shared memory
>> > > and once that is done, a handover that chunks to another worker which
>> > > can process tuples in that area.  We can imagine that the reader
>> > > worker is responsible to form some sort of work queue that can be
>> > > processed by the other workers.  In this idea, we won't be able to get
>> > > the benefit of initial tokenization (forming tuple boundaries) via
>> > > parallel workers and might need some additional memory processing as
>> > > after reader worker has handed the initial shared memory segment, we
>> > > need to somehow identify tuple boundaries and then process them.
>
>
> Parsing rows from the raw input (the work done by CopyReadLine()) in a single 
> process would accommodate line returns in quoted fields. I don't think 
> there's a way of getting parallel workers to manage the in-quote/out-of-quote 
> state required.
>

AFAIU, the whole of this in-quote/out-of-quote state is manged inside
CopyReadLineText which will be done by each of the parallel workers,
something on the lines of what Thomas did in his patch [1].
Basically, we need to invent a mechanism to allocate chunks to
individual workers and then the whole processing will be done as we
are doing now except for special handling for partial tuples which I
have explained in my previous email.  Am, I missing something here?

>>
>> >
>
> ...
>>
>>
>> > > Another thing we need to figure out is the how many workers to use for
>> > > the copy command.  I think we can use it based on the file size which
>> > > needs some experiments or may be based on user input.
>> >
>> > It seems like we don't even really have a general model for that sort
>> > of thing in the rest of the system yet, and I guess some kind of
>> > fairly dumb explicit system would make sense in the early days...
>> >
>>
>> makes sense.
>
> The ratio between chunking or line parsing processes and the parallel worker 
> pool would vary with the width of the table, complexity of the data or file 
> (dates, encoding conversions), complexity of constraints and acceptable 
> impact of the load. Being able to control it through user input would be 
> great.
>

Okay, I think one simple way could be that we compute the number of
workers based on filesize (some experiments are required to determine
this) unless the user has given the input.  If the user has provided
the input then we can use that with an upper limit to
max_parallel_workers.


[1] - 
https://www.postgresql.org/message-id/CA%2BhUKGKZu8fpZo0W%3DPOmQEN46kXhLedzqqAnt5iJZy7tD0x6sw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.

2020-02-14 Thread David G. Johnston

On Friday, February 14, 2020, Bryn Llewellyn  wrote:
>
>
> select jsonb_pretty(jsonb_object(
>  '{a, 17, b, "dog", c, true}'::varchar[]
>  ))
>
> In other words, do the double quotes around "dog" have no effect? That
> would be a bad thing—and it would limit the usefulness of the
> jsonb_object() function.
>

The double quotes serve a specific purpose, to allow values containing
commas to be treated as a single value (see syntax details for the exact
rules) in the resulting array of text values.  The fact you don’t have to
quote the other strings is a convenience behavior of the feature.

David J.

Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.

2020-02-14 Thread David G. Johnston

On Friday, February 14, 2020, Bryn Llewellyn  wrote:
>
> The doc (“Builds a JSON object out of a text array.”) is simply too terse
> to inform an answer to this question.
>

It does presume knowledge but it precisely defines the outcome:

PostgreSQL arrays are typed and all members are of the same type.  A text
array’s members are all text.

Given the above knowledge the fact that the resultant json object contains
exclusively text keys and text values directly follows.

David J.

Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.

2020-02-14 Thread Vik Fearing

On 15/02/2020 04:07, Bryn Llewellyn wrote:
> This:
> 
> select jsonb_pretty(jsonb_build_object(
>  'a'::varchar, 1.7::numeric,
>  'b'::varchar, 'dog'::varchar,
>  'c'::varchar, true::boolean
>  ))
> 
> allows me to express what I want. That’s a good thing. Are you saying that 
> this:
> 
> select jsonb_pretty(jsonb_object(
>  '{a, 17, b, "dog", c, true}'::varchar[]
>  ))
> 
> simply lacks that power of expression and that every item in the array is 
> assumed to be intended to end up as a JSON text primitive value? In other 
> words, do the double quotes around "dog" have no effect?

That is correct.

> That would be a bad thing—and it would limit the usefulness of the 
> jsonb_object() function.

Use the long form if you need to mix datatypes.
-- 
Vik Fearing

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Amit Kapila

On Fri, Feb 14, 2020 at 9:13 PM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Wed, Feb 12, 2020 at 11:53 AM Tom Lane  wrote:
>
> > That's an interesting idea, but it doesn't make the lock acquisition
> > itself interruptible, which seems pretty important to me in this case.
>
> Good point: if you think the contained operation might run too long to
> suit you, then you don't want other backends to be stuck behind it for
> the same amount of time.
>

It is not clear to me why we should add that as a requirement for this
patch when other places like WALWriteLock, etc. have similar coding
patterns and we haven't heard a ton of complaints about making it
interruptable or if there are then I am not aware.

> > I wonder if we could have an LWLockAcquireInterruptibly() or some such
> > that allows the lock acquisition itself to be interruptible. I think
> > that would require some rejiggering but it might be doable.
>
> Yeah, I had the impression from a brief look at LWLockAcquire that
> it was itself depending on not throwing errors partway through.
> But with careful and perhaps-a-shade-slower coding, we could probably
> make a version that didn't require that.
>

If this becomes a requirement to move this patch, then surely we can
do that.  BTW, what exactly we need to ensure for it?  Is it something
on the lines of ensuring that in error path the state of the lock is
cleared?  Are we worried that interrupt handler might do something
which will change the state of lock we are acquiring?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.

2020-02-14 Thread Bryn Llewellyn

This:

select jsonb_pretty(jsonb_build_object(
 'a'::varchar, 1.7::numeric,
 'b'::varchar, 'dog'::varchar,
 'c'::varchar, true::boolean
 ))

allows me to express what I want. That’s a good thing. Are you saying that this:

select jsonb_pretty(jsonb_object(
 '{a, 17, b, "dog", c, true}'::varchar[]
 ))

simply lacks that power of expression and that every item in the array is 
assumed to be intended to end up as a JSON text primitive value? In other 
words, do the double quotes around "dog" have no effect? That would be a bad 
thing—and it would limit the usefulness of the jsonb_object() function.

The doc (“Builds a JSON object out of a text array.”) is simply too terse to 
inform an answer to this question.

On 14-Feb-2020, at 18:28, Vik Fearing  wrote:

On 15/02/2020 03:21, Bryn Llewellyn wrote:
> Now execute this supposed functional equivalent:
> 
> select jsonb_pretty(jsonb_object(
>  '{a, 17, b, "dog", c, true}'::varchar[]
>  ))
> 
> It is meant to be a nice alternative when you want to build an object (rather 
> than an array) because the syntax is less verbose.
> 
> However, it gets the wrong answer, thus:
> 
> {  +
> "a": "17", +
> "b": "dog",+
> "c": "true"+
> }
> 
> Now, the numeric value and the boolean value are double-quoted—in other 
> words, they have been implicitly converted to JSON primitive text values.

They haven't been implicitly converted, you gave an array of varchars.
How should it know that you don't want texts?

> Do you agree that this is a bug?
No.
-- 
Vik Fearing

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Amit Kapila

On Fri, Feb 14, 2020 at 8:12 PM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > I think MaxBackends will generally limit the number of different
> > relations that can simultaneously extend, but maybe tables with many
> > partitions might change the situation.  You are right that some tests
> > might suggest a good number, let Mahendra write a patch and then we
> > can test it.  Do you have any better idea?
>
> In the first place, there certainly isn't more than one extension
> happening at a time per backend, else the entire premise of this
> thread is wrong.  Handwaving about partitions won't change that.
>

Having more number of partitions theoretically increases the chances
of false-sharing with the same number of concurrent sessions.  For ex.
two sessions operating on two relations vs. two sessions working on
two relations with 100 partitions each would increase the chances of
false-sharing.  Sawada-San and Mahendra have done many tests on
different systems and some monitoring with the previous patch that
with a decent number of fixed slots (1024), the false-sharing was very
less and even if it was there the effect was close to nothing.  So, in
short, this is not the point to worry about, but to ensure that we
don't create any significant regressions in this area.

> In the second place, it's ludicrous to expect that the underlying
> platform/filesystem can support an infinite number of concurrent
> file-extension operations.  At some level (e.g. where disk blocks
> are handed out, or where a record of the operation is written to
> a filesystem journal) it's quite likely that things are bottlenecked
> down to *one* such operation at a time per filesystem.  So I'm not
> that concerned about occasional false-sharing limiting our ability
> to issue concurrent requests.  There are probably worse restrictions
> at lower levels.
>

Agreed and what we have observed during the tests is what you have
said in this paragraph.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.

2020-02-14 Thread Vik Fearing

On 15/02/2020 03:21, Bryn Llewellyn wrote:
> Now execute this supposed functional equivalent:
> 
> select jsonb_pretty(jsonb_object(
>   '{a, 17, b, "dog", c, true}'::varchar[]
>   ))
> 
> It is meant to be a nice alternative when you want to build an object (rather 
> than an array) because the syntax is less verbose.
> 
> However, it gets the wrong answer, thus:
> 
>  {  +
>  "a": "17", +
>  "b": "dog",+
>  "c": "true"+
>  }
> 
> Now, the numeric value and the boolean value are double-quoted—in other 
> words, they have been implicitly converted to JSON primitive text values.

They haven't been implicitly converted, you gave an array of varchars.
How should it know that you don't want texts?

> Do you agree that this is a bug?
No.
-- 
Vik Fearing

jsonb_object() seems to be buggy. jsonb_build_object() is good.

2020-02-14 Thread Bryn Llewellyn

Execute this:

select jsonb_pretty(jsonb_build_object(
  'a'::varchar, 1.7::numeric,
  'b'::varchar, 'dog'::varchar,
  'c'::varchar, true::boolean
  ))

It produces the result that I expect:

 {  +
 "a": 1.7,  +
 "b": "dog",+
 "c": true  +
 }

Notice that the numeric, text, and boolean primitive values are properly 
rendered with the text value double-quoted and the numeric and boolean values 
unquoted.

Now execute this supposed functional equivalent:

select jsonb_pretty(jsonb_object(
  '{a, 17, b, "dog", c, true}'::varchar[]
  ))

It is meant to be a nice alternative when you want to build an object (rather 
than an array) because the syntax is less verbose.

However, it gets the wrong answer, thus:

 {  +
 "a": "17", +
 "b": "dog",+
 "c": "true"+
 }

Now, the numeric value and the boolean value are double-quoted—in other words, 
they have been implicitly converted to JSON primitive text values.

Do you agree that this is a bug?

Notice that I see this behavior in vanilla PostgreSQL 11.2 and in YugabyteDB 
Version 2.0.11.0. See this blogpost:

“Distributed PostgreSQL on a Google Spanner Architecture—Query Layer”
https://blog.yugabyte.com/distributed-postgresql-on-a-google-spanner-architecture-query-layer/

YugabyteDB uses the PostgreSQL source code for its SQL upper half.

Regards, Bryn Llewellyn, Yugabyte

Re: Use LN_S instead of "ln -s" in Makefile

2020-02-14 Thread Tom Lane

Ashwin Agrawal  writes:
> On Fri, Feb 14, 2020 at 4:57 PM Tom Lane  wrote:
>> Hm.  So, these oversights are certainly bugs in narrow terms.  However,
>> they've all been like that for a *long* time --- the three instances
>> you found date to 2005, 2006, and 2008 according to "git blame".
>> The complete lack of complaints shows that nobody cares.  So I think
>> a fairly strong case could be made for going the other way, ie
>> s/$(LN_S)/ln -s/g and get rid of the configure-time cycles wasted in
>> setting up that variable.

> I accede to it.

Oh ... 2005 was just the last time anybody touched that particular
line in backend/Makefile.  Further digging shows that we've been
installing the postmaster -> postgres symlink with raw "ln -s"
clear back to the Postgres95 virgin sources.  I didn't bother to
chase down the oldest instances of the other two cases.

(Man, "git blame" is such a great tool for software archaeology.)

regards, tom lane

Re: Use LN_S instead of "ln -s" in Makefile

2020-02-14 Thread Ashwin Agrawal

On Fri, Feb 14, 2020 at 4:57 PM Tom Lane  wrote:

> Ashwin Agrawal  writes:
> > In general, the variable LN_S is 'ln -s', however, LN_S changes to 'cp
> > -pR' if configure finds the file system does not support symbolic
> > links.
> > I was playing with 'ln' for some other reason where I symbolic linked
> > it to '/bin/false'. That revealed the failure for
> > src/backend/Makefile. Greping for 'ln -s' revealed three places it's
> > used. Attaching the patch to fix the same.
>
> Hm.  So, these oversights are certainly bugs in narrow terms.  However,
> they've all been like that for a *long* time --- the three instances
> you found date to 2005, 2006, and 2008 according to "git blame".
> The complete lack of complaints shows that nobody cares.  So I think
> a fairly strong case could be made for going the other way, ie
> s/$(LN_S)/ln -s/g and get rid of the configure-time cycles wasted in
> setting up that variable.
>

I accede to it.

Re: Just for fun: Postgres 20?

2020-02-14 Thread Peter Geoghegan

On Thu, Feb 13, 2020 at 1:34 PM Andrew Dunstan
 wrote:
> I also object because 20 is *my* unlucky number ...

I don't think we're going to do this, so you don't have to worry on that score.

-- 
Peter Geoghegan

Re: Just for fun: Postgres 20?

2020-02-14 Thread Peter Geoghegan

On Fri, Feb 14, 2020 at 4:19 PM Tom Lane  wrote:
> Not sure how serious Andrew is being here, but it does open up an
> important point: there are varying opinions on which numbers are unlucky.
> The idea that 13 is unlucky is Western, and maybe even only common in
> English-speaking countries.

I would wager that this superstition is the main reason why Oracle 12c
was followed by Oracle 18c rather than Oracle 13c. I have no evidence
for this -- I take it on faith.

I feel that I should take the proposal seriously for at least a
moment. The proposal doesn't affect anybody who isn't into numerology.
At the same time, it makes the superstitious people happy (leaving
aside the icosaphobes). Airlines do this with row numbers -- what's
the harm?

There is a real downside to this, though. It is a bad idea, even on
its own terms. If we take the idea seriously, then it has every chance
of being noticed and becoming a big distraction in all sorts of ways.
That might happen anyway, but I think it's less likely this way.

ISTM that the smart thing to do is to ignore it completely. Don't even
try to preempt a silly headline written by some tech journalist
wiseacre.

--
Peter Geoghegan

Re: Use LN_S instead of "ln -s" in Makefile

2020-02-14 Thread Tom Lane

Ashwin Agrawal  writes:
> In general, the variable LN_S is 'ln -s', however, LN_S changes to 'cp
> -pR' if configure finds the file system does not support symbolic
> links.
> I was playing with 'ln' for some other reason where I symbolic linked
> it to '/bin/false'. That revealed the failure for
> src/backend/Makefile. Greping for 'ln -s' revealed three places it's
> used. Attaching the patch to fix the same.

Hm.  So, these oversights are certainly bugs in narrow terms.  However,
they've all been like that for a *long* time --- the three instances
you found date to 2005, 2006, and 2008 according to "git blame".
The complete lack of complaints shows that nobody cares.  So I think
a fairly strong case could be made for going the other way, ie
s/$(LN_S)/ln -s/g and get rid of the configure-time cycles wasted in
setting up that variable.  Otherwise I fear somebody will "break"
it again soon, and it will stay "broken" for another 15 years till
someone happens to notice.  We have better things to do than spend
our time maintaining such nonfunctional differences.

regards, tom lane

Re: Standards compliance of SET ROLE / SET SESSION AUTHORIZATION

2020-02-14 Thread Chapman Flack

On 02/14/20 18:43, Tom Lane wrote:

> I suppose it could be argued that that's a bug in the interpretation
> of role membership: arguably, if you're a member of some superuser
> role, that ought to give you membership in anything else.  IOW, a
> superuser's implicit membership in every role isn't transitive,
> and maybe it should be.  But I'm not sure that I want to change that;
> it feels like doing so might have surprising side-effects.

I have a tendency to create roles like postgres_assumable or
dba_assumable, which are themselves members of the indicated
roles, but without rolinherit, and then grant those to my own
role. That way in my day to day faffing about, I don't get to
make superuser-powered mistakes, but I can 'set role postgres'
when needed.

Would it make sense for a proposed transitive superuser-membership-
in-everything also to stop at a role without rolinherit? Clearly
it would just add one extra step to 'set role anybody', but sometimes
one extra step inspires a useful extra moment of thought.

Regards,
-Chap

Use LN_S instead of "ln -s" in Makefile

2020-02-14 Thread Ashwin Agrawal

In general, the variable LN_S is 'ln -s', however, LN_S changes to 'cp
-pR' if configure finds the file system does not support symbolic
links.

I was playing with 'ln' for some other reason where I symbolic linked
it to '/bin/false'. That revealed the failure for
src/backend/Makefile. Greping for 'ln -s' revealed three places it's
used. Attaching the patch to fix the same.
From 60b3374de1e4d0fd143c2a7b3bf78893e32579af Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 14 Feb 2020 16:16:37 -0800
Subject: [PATCH v1] Use LN_S instead of "ln -s" in Makefile

---
 src/backend/Makefile  | 2 +-
 src/interfaces/ecpg/test/Makefile | 2 +-
 src/makefiles/pgxs.mk | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 9706a95848..f21a454286 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -224,7 +224,7 @@ install-bin: postgres $(POSTGRES_IMP) installdirs
 	$(INSTALL_PROGRAM) postgres$(X) '$(DESTDIR)$(bindir)/postgres$(X)'
 ifneq ($(PORTNAME), win32)
 	@rm -f '$(DESTDIR)$(bindir)/postmaster$(X)'
-	ln -s postgres$(X) '$(DESTDIR)$(bindir)/postmaster$(X)'
+	$(LN_S) postgres$(X) '$(DESTDIR)$(bindir)/postmaster$(X)'
 else
 	$(INSTALL_PROGRAM) postgres$(X) '$(DESTDIR)$(bindir)/postmaster$(X)'
 endif
diff --git a/src/interfaces/ecpg/test/Makefile b/src/interfaces/ecpg/test/Makefile
index be53b7b94d..11d55bde3e 100644
--- a/src/interfaces/ecpg/test/Makefile
+++ b/src/interfaces/ecpg/test/Makefile
@@ -73,7 +73,7 @@ remaining_files_build := $(patsubst $(srcdir)/%, $(abs_builddir)/%, $(remaining_
 
 all: $(remaining_files_build)
 $(remaining_files_build): $(abs_builddir)/%: $(srcdir)/%
-	ln -s $< $@
+	$(LN_S) $< $@
 endif
 
 # Common options for tests. Also pick up anything passed in EXTRA_REGRESS_OPTS
diff --git a/src/makefiles/pgxs.mk b/src/makefiles/pgxs.mk
index 271e7eaba8..8c7d126593 100644
--- a/src/makefiles/pgxs.mk
+++ b/src/makefiles/pgxs.mk
@@ -397,7 +397,7 @@ test_files_build := $(patsubst $(srcdir)/%, $(abs_builddir)/%, $(test_files_src)
 all: $(test_files_build)
 $(test_files_build): $(abs_builddir)/%: $(srcdir)/%
 	$(MKDIR_P) $(dir $@)
-	ln -s $< $@
+	$(LN_S) $< $@
 endif # VPATH
 endif # REGRESS
 
-- 
2.25.0

Re: Just for fun: Postgres 20?

2020-02-14 Thread Tom Lane

Andrew Dunstan  writes:
> I also object because 20 is *my* unlucky number ...

Not sure how serious Andrew is being here, but it does open up an
important point: there are varying opinions on which numbers are unlucky.
The idea that 13 is unlucky is Western, and maybe even only common in
English-speaking countries.  In Asia, numbers containing the digit 4
are considered unlucky [1], and there are probably other rules in other
cultures.  If we establish a precedent that we'll skip release numbers
for non-technical reasons, I'm afraid we'll be right back in the mess
we sought to avoid, whereby nearly every year we had an argument about
what the next release number would be.  So let's not go there.

regards, tom lane

[1] https://en.wikipedia.org/wiki/Tetraphobia

Re: Standards compliance of SET ROLE / SET SESSION AUTHORIZATION

2020-02-14 Thread Tom Lane

I wrote:
> What I'm now thinking is that we shouldn't mess with the behavior of
> SET ROLE, as I mused about doing yesterday in [1].  It's spec-compliant,
> or close enough, so let's leave it be.  On the other hand, changing the
> behavior of SET SESSION AUTHORIZATION is not constrained by spec
> compliance concerns, only backwards compatibility.  We could address the
> pg_dump concerns I had in [1] by tweaking what SET SESSION AUTHORIZATION
> can do and then adjusting pg_dump to swap its usage of SET SESSION
> AUTHORIZATION (do that just once, in response to --role) and SET ROLE
> (do that per-object, to establish ownership).

Concretely, I propose the following semantics:

* SET SESSION AUTHORIZATION is allowed if your original login role
is a member of the target role.  If successful, it resets the role
to "NONE", ie session authorization and effective role both become
the stated role.

* SET ROLE is allowed if your session authorization is a member
of the target role.  If successful, it sets the effective role to
the target role.  SET ROLE NONE resets effective role to the
current session authorization.

This is the same behavior we have now for SET ROLE.  The difference
for SET SESSION AUTHORIZATION is that currently that requires your
login role to be superuser or equal to the target role, so the
above is a strictly weaker check.

The reason this is interesting is that currently, if you log in
as somebody who isn't superuser but is allowed to become superuser
(ie, has been granted a superuser role), you're not allowed to
SET SESSION AUTHORIZATION to the superuser, only SET ROLE to it.
And that in turn means that you can't necessarily SET ROLE to any
random other userid, which is a weird restriction that breaks
the "pg_restore --role" use-case for this whole thing [1].

I suppose it could be argued that that's a bug in the interpretation
of role membership: arguably, if you're a member of some superuser
role, that ought to give you membership in anything else.  IOW, a
superuser's implicit membership in every role isn't transitive,
and maybe it should be.  But I'm not sure that I want to change that;
it feels like doing so might have surprising side-effects.

Note that AFAICS, this is just as spec-compliant as our current
behavior.  The spec only constrains what SET ROLE does.

regards, tom lane

[1] https://www.postgresql.org/message-id/11496.1581634533%40sss.pgh.pa.us

Re: Memory-Bounded Hash Aggregation

2020-02-14 Thread Jeff Davis

On Wed, 2020-02-12 at 21:51 -0800, Jeff Davis wrote:
> On Mon, 2020-02-10 at 15:57 -0800, Jeff Davis wrote:
> > Attaching latest version (combined logtape changes along with main
> > HashAgg patch).
> 
> I ran a matrix of small performance tests to look for regressions.

I ran some more tests, this time comparing Hash Aggregation to
Sort+Group.

Summary of trends:

  group key complexity : favors Hash
  group key size   : favors Hash
  group size   : favors Hash
  higher work_mem  : favors Sort[1]
  data set size: favors Sort[1]
  number of aggregates : favors Hash[2]

  [1] I have closed the gap a bit with some post-experiment tuning.
  I have just begun to analyze this case so I think there is
  quite a bit more room for improvement.
  [2] Could use more exploration -- I don't have an explanation.

Data sets:
  t20m_1_int4: ~20 million groups of size ~1 (uniform)
  t1m_20_int4: ~1 million groups of size ~20 (uniform)
  t1k_20k_int4: ~1k groups of size ~20k (uniform)

  also, text versions of each of those with collate "C.UTF-8"

Results:

1. A general test to vary the group size, key type, and work_mem.

Query:
  select i from $TABLE group by i offset 1;

work_mem='4MB'

++--+-+--+
|| sort(ms) | hashagg(ms) | sort/hashagg |
++--+-+--+
| t20m_1_int4|   11852  |  10640  | 1.11 |
| t1m_20_int4|   11108  |   8109  | 1.37 |
| t1k_20k_int4   |8575  |   2732  | 3.14 |
| t20m_1_text|   80463  |  12902  | 6.24 |
| t1m_20_text
|   58586  |   9252  | 6.33 |
| t1k_20k_text   |   21781  | 
5739  | 3.80 |
++--+-+
--+

work_mem='32MB'

++--+-+--+
|| sort(ms) | hashagg(ms) | sort/hashagg |
++--+-+--+
| t20m_1_int4|9656  |  11702  | 0.83 |
| t1m_20_int4|8870  |   9804  | 0.90 |
| t1k_20k_int4   |6359  |   1852  | 3.43 |
| t20m_1_text|   74266  |  14434  | 5.15 |
| t1m_20_text|   56549  |  10180  | 5.55 |
| t1k_20k_text   |   21407  |   3989  | 5.37 |
++--+-+--+

2. Test group key size

data set:
   20m rows, four int4 columns.
   Columns a,b,c are all the constant value 1, forcing each
   comparison to look at all four columns.

Query: select a,b,c,d from wide group by a,b,c,d offset 1;

  work_mem='4MB'
  Sort : 30852ms
  HashAgg  : 12343ms
  Sort/HashAgg : 2.50

In theory, if the first grouping column is highly selective, then Sort
may have a slight advantage because it can look at only the first
column, while HashAgg needs to look at all 4. But HashAgg only needs to
perform this calculation once and it seems hard enough to show this in
practice that I consider it an edge case. In "normal" cases, it appears
that more grouping columns significantly favors Hash Agg.

3. Test number of aggregates

Data Set: same as for test #2 (group key size).

Query: select d, count(a),sum(b),avg(c),min(d)
   from wide group by d offset 1;

  work_mem='4MB'
  Sort : 22373ms
  HashAgg  : 17338ms
  Sort/HashAgg : 1.29

I don't have an explanation of why HashAgg is doing better here. Both
of them are using JIT and essentially doing the same number of
advancements. This could use more exploration, but the effect isn't
major.

4. Test data size

Data 400 million rows of four random int8s. Group size of one.

Query: select a from t400m_1_int8 group by a offset 10;

  work_mem='32MB'
  Sort : 300675ms
  HashAgg  : 560740ms
  Sort/HashAgg : 0.54

I tried increasing the max number of partitions and brought the HashAgg
runtime down to 481985 (using 1024 partitions), which closes the gap to
0.62. That's not too bad for HashAgg considering this is a group size
of one with a simple group key. A bit more tuning might be able to
close the gap further.

Conclusion:

HashAgg is winning in a lot of cases, and this will be an important
improvement for many workloads. Not only is it faster in a lot of
cases, but it's also less risky. When an input has unknown group size,
it's much easier for the planner to choose HashAgg -- a small downside
and a big upside.

Regards,
Jeff Davis

Re: Standards compliance of SET ROLE / SET SESSION AUTHORIZATION

2020-02-14 Thread Tom Lane

Chapman Flack  writes:
> On 2/14/20 4:01 PM, Tom Lane wrote:
>> ...  A protocol-level message
>> to set session auth could also be possible, of course.

> I'll once again whimper softly and perhaps ineffectually that an
> SQL-exposed equivalent like

>  SET SESSION AUTHORIZATION foo WITH RESET COOKIE 'lkjhikuhoihkihlj';

> would seem to suit the same purpose, with the advantage of being
> immediately usable by any kind of front- or middle-end code the
> instant there is a server version that supports it, without having
> to wait for something new at the protocol level to trickle through
> to n different driver implementations.

Yeah, I'm not that thrilled with the idea of a protocol message
that's not equivalent to any SQL-level functionality, either.

But the immediate point here is that I think we could get away with
playing around with SET SESSION AUTHORIZATION's semantics.  Or,
seeing that that's just syntactic sugar for "SET session_authorization",
we could invent some new GUCs that allow control over this, rather
than new syntax.

regards, tom lane

Re: Standards compliance of SET ROLE / SET SESSION AUTHORIZATION

2020-02-14 Thread Chapman Flack

On 2/14/20 4:01 PM, Tom Lane wrote:
> Robert Haas  writes:
>> It wouldn't be difficult to introduce a new protocol-level option that
>> prohibits RESET SESSION AUTHORIZATION; and it would also be possible
>> to introduce a new protocol message that has the same effect as RESET
>> SESSION AUTHORIZATION. If you do those two things, then it's possible
>> to create a sandbox which the end client cannot escape but which the
>> pooler can escape easily.
> ...
>   SET SESSION AUTHORIZATION foo PERMANENT;
> ...  A protocol-level message
> to set session auth could also be possible, of course.

I'll once again whimper softly and perhaps ineffectually that an
SQL-exposed equivalent like

 SET SESSION AUTHORIZATION foo WITH RESET COOKIE 'lkjhikuhoihkihlj';

would seem to suit the same purpose, with the advantage of being
immediately usable by any kind of front- or middle-end code the
instant there is a server version that supports it, without having
to wait for something new at the protocol level to trickle through
to n different driver implementations.

Regards,
-Chap

Re: [DOC] Document auto vacuum interruption

2020-02-14 Thread James Coleman

On Thu, Sep 19, 2019 at 5:34 AM Amit Kapila  wrote:
>
> On Wed, Sep 18, 2019 at 10:25 AM Amit Kapila  wrote:
> >
> > On Tue, Sep 17, 2019 at 5:48 PM James Coleman  wrote:
> > >
> > > On Tue, Sep 17, 2019 at 2:21 AM Amit Kapila  
> > > wrote:
> > > >
> > > >
> > > > Let me know what you think of attached?  I think we can back-patch
> > > > this patch.  What do you think?  Does anyone else have an opinion on
> > > > this patch especially if we see any problem in back-patching this?
> > >
> > > The attached looks great!
> > >
> > > I was working on HEAD for the patch, but this concern has been an
> > > issue for quite a long time. We were running into it on 9.6 in
> > > production, for example. And given how frequently it seems like there
> > > are large-scale production issues related to auto vacuum, I think any
> > > amount of back patching we can do to make that footgun less likely
> > > would be a good thing.
> > >
> >
> > Okay, I will commit this tomorrow unless someone has any comments or 
> > objections.
> >
>
> Pushed with minor changes.  There was one extra space in a few lines
> and the tag for back-branches (from 10~9.4) was slightly different.

I completely forgot to reply to this; thanks Amit for working on this.

James

Re: [DOC] Document concurrent index builds waiting on each other

2020-02-14 Thread James Coleman

On Sun, Sep 29, 2019 at 9:24 PM Michael Paquier  wrote:
>
> On Sat, Sep 28, 2019 at 10:22:28PM -0300, Alvaro Herrera wrote:
> > I always thought that create index concurrently was prevented from
> > running concurrently in a table by the ShareUpdateExclusive lock that's
> > held during the operation.
>
> REINDEX CONCURRENTLY and CIC can deadlock while waiting for each other
> to finish after their validation phase, see:
> https://www.postgresql.org/message-id/20190507030756.gd1...@paquier.xyz
> https://www.postgresql.org/message-id/20190507032543.gh1...@paquier.xyz

Michael,

Thanks for the cross-link. Do you think this would be valuable to
document at the same time? Or did you just want to ensure we were also
aware of this particular downfall? If the latter, I appreciate it,
it's helpful info. If the latter, let me know, and I'll try to update
the patch.

Thanks,
James

Re: [DOC] Document concurrent index builds waiting on each other

2020-02-14 Thread James Coleman

I went ahead and registered this in the current only CF as
https://commitfest.postgresql.org/27/2454/

Alvaro: Would you like to be added as a reviewer?

James

Standards compliance of SET ROLE / SET SESSION AUTHORIZATION

2020-02-14 Thread Tom Lane

[ Starting a new thread about this, since the old one about GUC reporting
is only marginally related to this point ... if it were more so, maybe I'd
have found it when I went looking for it yesterday ]

Robert Haas  writes:
> On Tue, Nov 5, 2019 at 10:02 AM Alvaro Herrera  
> wrote:
>> There's a reason the SQL standard defines SET SESSION AUTHORIZATION but
>> no RESET SESSION AUTHORIZATION: once you enter a security context, you
>> cannot escape it.  ISTM that essentially we broke feature F321 "User
>> authorization" by adding RESET into the mix.  (I think RESET ROLE breaks
>> the spirit of feature T331 too.)  The SQL:2016 standard describes how
>> this is supposed to work in Foundation "4.40.1.1 SQL-session
>> authorization identifiers" (same section is numbered 4.35.1.1 in
>> SQL:2011), and ISTM we made a huge mess of it.
>> 
>> I don't see how to fix it, though.  If we were to adopt the standard's
>> mechanism, we'd probably break tons of existing code.

> It wouldn't be difficult to introduce a new protocol-level option that
> prohibits RESET SESSION AUTHORIZATION; and it would also be possible
> to introduce a new protocol message that has the same effect as RESET
> SESSION AUTHORIZATION. If you do those two things, then it's possible
> to create a sandbox which the end client cannot escape but which the
> pooler can escape easily.

I went looking into the SQL standard to see just what it says about this,
and I'm darned if I see anything supporting Alvaro's argument.  I do not
have SQL:2016 at hand, but in SQL:2011 what I see is that section 4.35.1.1
describes a stack of authorization identifiers and/or roles that controls
the currently-applicable privileges.  It says

Let E be an externally-invoked procedure, SQL-invoked routine,
triggered action, prepared statement, or directly executed
statement. When E is invoked, a copy of the top cell is pushed onto
the authorization stack. If the invocation of E is to be under
definer's rights, then the contents of the top cell are replaced with
the authorization identifier of the owner of E. On completion of the
execution of E, the top cell is removed.
...
The  changes the value of the
current user identifier and of the SQL- session user identifier. The
 changes the value of the current role name.
...
The term current authorization identifier denotes an authorization
identifier in the top cell of the authorization stack.

There is nothing anywhere in 4.35 that constrains the allowable
transitions of authorization identifiers.  The only thing I can find on
that point is in the General Rules of 19.2  (a/k/a SET SESSION AUTHORIZATION), which says:

4) If V is not equal to the current value of the SQL-session user
identifier of the current SQL-session context, then the restrictions
on the permissible values for V are implementation-defined.

5) If the current user identifier and the current role name are
restricted from setting the user identifier to V, then an exception
condition is raised: invalid authorization specification.

So as far as I can see, restrictions on what SET SESSION AUTHORIZATION
can set the authorization ID to are implementation-defined, full stop.
There might be considerable value in the semantics Alvaro suggests,
but I think arguing that the spec requires 'em is just wrong.

On the other hand, the restrictions on SET ROLE in 19.3 are much less
squishy:

3) If  contains a , then:

c) If no role authorization descriptor exists that indicates that
the role identified by V has been granted to either the current
user identifier or to PUBLIC, then an exception condition is
raised: invalid role specification.

d) The SQL-session role name and the current role name are set to
V.

4) If NONE is specified, then the current role name is removed.

As best I can tell, we actually are entirely compliant with that, modulo
the fact that we don't think of the current state as an 
pair.  What you can SET ROLE to is determined by your authorization
identifier, not your current role, and so doing a SET ROLE doesn't change
what you can SET ROLE to later.  The argument that "RESET ROLE" is somehow
invalid seems a little silly when "SET ROLE NONE" does the same thing.

What I'm now thinking is that we shouldn't mess with the behavior of
SET ROLE, as I mused about doing yesterday in [1].  It's spec-compliant,
or close enough, so let's leave it be.  On the other hand, changing the
behavior of SET SESSION AUTHORIZATION is not constrained by spec
compliance concerns, only backwards compatibility.  We could address the
pg_dump concerns I had in [1] by tweaking what SET SESSION AUTHORIZATION
can do and then adjusting pg_dump to swap its usage of SET SESSION
AUTHORIZATION (do that just once, in response to --role) and SET ROLE
(do that per-object, to establish ownership).

The only thing stopping us from addressing Alvaro's concern is bac

Re: Error on failed COMMIT

2020-02-14 Thread Dave Cramer

On Fri, 14 Feb 2020 at 15:19, Alvaro Herrera 
wrote:

> On 2020-Feb-14, Dave Cramer wrote:
>
> > I offered to make the behaviour in the JDBC driver dependent on a
> > configuration parameter
>
> Do you mean "if con.commit() results in a rollback, then an exception is
> thrown, unless the parameter XYZ is set to PQR"?
>

No. JDBC has a number of connection parameters which change internal
semantics of various things.

I was proposing to have a connection parameter which would be
throwExceptionOnFailedCommit (or something) which would do what it says.

None of this would touch the server. It would just change the driver
semantics.

Dave Cramer
www.postgres.rocks

Re: Zedstore - compressed in-core columnar storage

2020-02-14 Thread Soumyadeep Chakraborty

Hello,

We (David and I) recently observed that a Zedstore table can be considerably
bloated when we load data into it with concurrent copies. Also, we found
that
concurrent insert performance was less than desirable. This is a detailed
analysis of the extent of the problem, the cause of the problem and how we
fixed it. This has input from much of our team: Alex, Ashwin, David, Heikki,
Melanie, Taylor and myself.

An example of the bloat that we observed:
TPC-DS scale = 270:
Table heap   zed(serial)  zed(16 parallel COPYs)
web_sales39G19G39G

We found that it was caused due to inefficient page splits resultant from
out-of-tid-order-inserts into full/full-ish attribute tree leaf pages. The
degree of under-utilization was significant - attribute tree leaves with a
serial data load had 6-8x more datums than the attribute tree leaves
resultant with a parallel load of 16 sessions.

Consider the scenario below:

Assumptions:
1. Let us consider two concurrent copy commands executing (sessions S1 and
S2).
2. The table has only one (fixed-length for sake of argument) attribute 'a'.
3. For attribute 'a', a full attribute tree leaf page can accommodate 1500
datums.

TID allocations:
S1: 1-1000
S2: 1001-2000, 2001-3000

Order of operations:

1. S2 writes datums for tids 1001-2000, 2001-3000.
The resulting leaves are:
L1:
lokey  = 1  hikey= 2500
firsttid = 1001lasttid   = 2500
L2:
lokey  = 2501hikey= MaxZSTid
firsttid = 2501lasttid   = 3000

2. S1 now writes datums for tids 1-1000.
We have to split L1 into L1' and L1''.
L1':
lokey= 1   hikey= 1500
firsttid   = 1   lasttid  =  1500
L1'': [under-utilized page]
lokey= 1501 hikey   = 2000
firsttid = 1501   lasttid  = 2000
L2:
lokey  = 2501hikey   = MaxZSTid
firsttid = 2501lasttid  = 3000

Note: The lokeys/hikeys reflect ranges of what CAN be inserted whereas
firsttid
and lasttid reflect what actually have been inserted.

L1'' will be an under-utilized page that is not going to be filled again
because
it inherits the tight hikey from L1. In this example, space wastage in L1''
is
66% but it could very easily be close to 100%, especially under concurrent
workloads which mixes single and multi-inserts, or even unequally sized
multi-inserts.

Solution (kudos to Ashwin!):

For every multi-insert (and only multi-insert, not for singleton inserts),
allocate N times more tids. Each session will keep these extra tids in a
buffer. Subsequent calls to multi-insert would use these buffered tids. If
at
any time a tid allocation request cannot be met by the remaining buffered
tids,
a new batch of N times the number of tids requested will again be allocated.

If we take the same example above and say we allocated N=5 times the number
of
tids upon the first request for 1000 tids.:

TID allocations:
S1: 1-5000
S2: 5001-1

Order of operations:

1. S2 writes datums for tids 5001-6000, 6001-7000.
The resulting leaves are:
L1:
lokey  = 1 hikey= 6500
firsttid = 5001   lasttid  = 6500
L2:
lokey  = 6501hikey   = MaxZSTid
firsttid = 6501   lasttid  = 7000

2. S1 writes datums for tids 1-1000.
L1 will be split into L1' and L1''.

L1':
lokey  = 1 hikey  = 5500
firsttid = 1 lasttid  = 1000
L1'' [under-utilized page]:
lokey  = 5501   hikey   = 6500
firsttid = 5501   lasttid  = 6500
L2:
lokey  = 6501hikey   = MaxZSTid
firsttid = 6501lasttid  = 7000

Subsequent inserts by S1 will land on L1' whose hikey isn't restrictive.

However, we do end up with the inefficient page L1''. With a high enough
value
of N, we reduce the frequency of such pages. We could further reduce this
wastage by incorporating a special left split (Since L1 was already full, we
don't change it at all -> we simply update it's lokey -> L1 becomes L1''
and we
fork of a new leaf to its left: L1'). This would look like:

L1':
lokey  = 1   hikey   = 5000
firsttid = 1   lasttid  = 1000

L1'':
lokey  = 5001  hikey   = 6500
firsttid = 5001  lasttid  = 6500

We found that with a high enough value of N, we did not get significant
space
benefits from the left split. Thus, we decided to only incorporate N.

Results: [TPC-DS scale = 50, 16 conc copies]

Table zed N=10 N=100 N=1000 heap
zed(serial)
catalog_sales   15G9.1G  7.7G  7.5G15G
8.0G
catalog_returns1.5G0.9G 0.7G  0.7G1.2G
 0.8G
store_returns2.1G   1.2G 1.1G  1.1G1.9G
 1.2G
store_sales  17G11G   10.1G 10.1G 21G10G

Load time:
N=10 30min
N=100   10min
N=1000 7min
zed100min
heap  8min

'zed' refers to the zedstore branch without our fix. We see that with N =
10, we
get closer to what we get with serial inserts. For N = 100, we even be

Re: Error on failed COMMIT

2020-02-14 Thread Alvaro Herrera

On 2020-Feb-14, Dave Cramer wrote:

> I offered to make the behaviour in the JDBC driver dependent on a
> configuration parameter

Do you mean "if con.commit() results in a rollback, then an exception is
thrown, unless the parameter XYZ is set to PQR"?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Error on failed COMMIT

2020-02-14 Thread Dave Cramer

On Fri, 14 Feb 2020 at 15:07, Tom Lane  wrote:

> Dave Cramer  writes:
> > On Fri, 14 Feb 2020 at 14:37, Robert Haas  wrote:
> >> I'm not trying to deny that you might find some other server behavior
> >> more convenient. You might. And, to Vik's original point, it might be
> >> more compliant with the spec, too. But since changing that would have
> >> a pretty big blast radius at this stage, I think it's worth trying to
> >> make things work as well as they can with the server behavior that we
> >> already have. And I don't really see anything preventing the driver
> >> from doing that technically. I don't understand the idea that the
> >> driver is somehow not allowed to notice the command tag.
>
> > We have the same blast radius.
> > I have offered to make the behaviour requested dependent on a
> configuration
> > parameter but apparently this is not sufficient.
>
> Nope, that is absolutely not happening.


I should have been more specific.

I offered to make the behaviour in the JDBC driver dependent on a
configuration parameter

Dave Cramer
www.postgres.rocks

>
>

Re: Error on failed COMMIT

2020-02-14 Thread Tom Lane

Dave Cramer  writes:
> On Fri, 14 Feb 2020 at 14:37, Robert Haas  wrote:
>> I'm not trying to deny that you might find some other server behavior
>> more convenient. You might. And, to Vik's original point, it might be
>> more compliant with the spec, too. But since changing that would have
>> a pretty big blast radius at this stage, I think it's worth trying to
>> make things work as well as they can with the server behavior that we
>> already have. And I don't really see anything preventing the driver
>> from doing that technically. I don't understand the idea that the
>> driver is somehow not allowed to notice the command tag.

> We have the same blast radius.
> I have offered to make the behaviour requested dependent on a configuration
> parameter but apparently this is not sufficient.

Nope, that is absolutely not happening.  We learned very painfully, back
around 7.3 when we tried to put in autocommit on/off, that if server
behaviors like this are configurable then most client code has to be
prepared to work with every possible setting.  The argument that "you can
just set it to work the way you expect" is a dangerous falsehood.  I see
no reason to think that a change like this wouldn't suffer the same sort
of embarrassing and expensive failure that autocommit did.

regards, tom lane

Re: Error on failed COMMIT

2020-02-14 Thread Dave Cramer

On Fri, 14 Feb 2020 at 14:37, Robert Haas  wrote:

> On Fri, Feb 14, 2020 at 2:08 PM Dave Cramer 
> wrote:
> > Well now you are asking the driver to re-interpret the results in a
> different way than the server which is not what we tend to do.
> >
> > The server throws an error we throw an error. We really aren't in the
> business of re-interpreting the servers responses.
>
> I don't really see a reason why the driver has to throw an exception
> if and only if there is an ERROR on the PostgreSQL side. But even if
> you want to make that rule for some reason, it doesn't preclude
> correct behavior here. All you really need is to have con.commit()
> return some indication of what the command tag was, just as, say, psql
> would do. If the server provides you with status information and you
> throw it out instead of passing it along to the application, that's
> not ideal.
>

Well con.commit() returns void :(


>
> Another thing that kinda puzzles me about this situation is that, as
> far as I know, the only time COMMIT returns ROLLBACK is if the
> transaction has already previously reported an ERROR. But if an ERROR
> gets turned into an exception, then, in the code snippet previously
> provided, we'd never reach con.commit() in the first place.
>

 The OP is building a framework where it's possible for the exception to be
swallowed by the consumer of the framework.


> I'm not trying to deny that you might find some other server behavior
> more convenient. You might. And, to Vik's original point, it might be
> more compliant with the spec, too. But since changing that would have
> a pretty big blast radius at this stage, I think it's worth trying to
> make things work as well as they can with the server behavior that we
> already have. And I don't really see anything preventing the driver
> from doing that technically. I don't understand the idea that the
> driver is somehow not allowed to notice the command tag.
>

We have the same blast radius.
I have offered to make the behaviour requested dependent on a configuration
parameter but apparently this is not sufficient.



Dave Cramer
www.postgres.rocks

Re: Error on failed COMMIT

2020-02-14 Thread Robert Haas

On Fri, Feb 14, 2020 at 2:08 PM Dave Cramer  wrote:
> Well now you are asking the driver to re-interpret the results in a different 
> way than the server which is not what we tend to do.
>
> The server throws an error we throw an error. We really aren't in the 
> business of re-interpreting the servers responses.

I don't really see a reason why the driver has to throw an exception
if and only if there is an ERROR on the PostgreSQL side. But even if
you want to make that rule for some reason, it doesn't preclude
correct behavior here. All you really need is to have con.commit()
return some indication of what the command tag was, just as, say, psql
would do. If the server provides you with status information and you
throw it out instead of passing it along to the application, that's
not ideal.

Another thing that kinda puzzles me about this situation is that, as
far as I know, the only time COMMIT returns ROLLBACK is if the
transaction has already previously reported an ERROR. But if an ERROR
gets turned into an exception, then, in the code snippet previously
provided, we'd never reach con.commit() in the first place.

I'm not trying to deny that you might find some other server behavior
more convenient. You might. And, to Vik's original point, it might be
more compliant with the spec, too. But since changing that would have
a pretty big blast radius at this stage, I think it's worth trying to
make things work as well as they can with the server behavior that we
already have. And I don't really see anything preventing the driver
from doing that technically. I don't understand the idea that the
driver is somehow not allowed to notice the command tag.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Error on failed COMMIT

2020-02-14 Thread Dave Cramer

On Fri, 14 Feb 2020 at 13:29, Robert Haas  wrote:

> On Fri, Feb 14, 2020 at 1:04 PM Dave Cramer 
> wrote:
> > Thing is that con.commit() DOESN'T return a status code, nor does it
> throw an exception as we silently ROLLBACK here.
>
> Why not? There's nothing keeping the driver from doing either of those
> things, is there? I mean, if using libpq, you can use PQcmdStatus() to
> get the command tag, and find out whether it's COMMIT or ROLLBACK. If
> you're implementing the wire protocol directly, you can do something
> similar.
>
>
> https://www.postgresql.org/docs/current/libpq-exec.html#LIBPQ-EXEC-NONSELECT

Well now you are asking the driver to re-interpret the results in a
different way than the server which is not what we tend to do.

The server throws an error we throw an error. We really aren't in the
business of re-interpreting the servers responses.

Dave Cramer
www.postgres.rocks

Re: assert pg_class.relnatts is consistent

2020-02-14 Thread Tom Lane

Alvaro Herrera  writes:
> On 2020-Feb-14, John Naylor wrote:
>> One possible objection to what I wrote above is that it adds a
>> different kind of special case, but in a sneaky way. Perhaps it would
>> be more principled to treat it the same as oid after all. If we do
>> that, it would help to add a comment that we can't treat relnatts like
>> pronangs, since we need more information than what's in each pg_class
>> row.

> How about something like this? (untested)

I think John's idea of setting a dummy BKI_DEFAULT value is better,
as that means the only code that has to worry about this directly
is the code that's actually filling in relnatts.  As far as said
code goes, we don't need an additional global variable when we can
just look in the $catalogs data structure; and I'm not a fan of
cramming this into the OID-assignment logic just to save a loop.
So that leads me to the attached.

(I agree with Alvaro's thought of shortening AddDefaultValues,
but didn't do that here.)

regards, tom lane

diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index 8032512..ad24f4d 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -184,6 +184,15 @@ my $PG_CATALOG_NAMESPACE =
 	'PG_CATALOG_NAMESPACE');
 
 
+# Fill in pg_class.relnatts by looking at the referenced catalog's schema.
+# This is ugly but there's no better place; Catalog::AddDefaultValues
+# can't do it, for lack of easy access to the other catalog.
+foreach my $row (@{ $catalog_data{pg_class} })
+{
+	$row->{relnatts} = scalar(@{ $catalogs{ $row->{relname} }->{columns} });
+}
+
+
 # Build lookup tables.
 
 # access method OID lookup
diff --git a/src/include/catalog/pg_class.dat b/src/include/catalog/pg_class.dat
index f70d5ba..b28bba7 100644
--- a/src/include/catalog/pg_class.dat
+++ b/src/include/catalog/pg_class.dat
@@ -14,51 +14,15 @@
 
 # Note: only bootstrap catalogs, ie those marked BKI_BOOTSTRAP, need to
 # have entries here.  Be sure that the OIDs listed here match those given in
-# their CATALOG and BKI_ROWTYPE_OID macros, and that the relnatts values are
-# correct.
-
-# Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
-# similarly, "1" in relminmxid stands for FirstMultiXactId
+# their CATALOG and BKI_ROWTYPE_OID macros.
 
 { oid => '1247',
-  relname => 'pg_type', reltype => 'pg_type', relam => 'heap',
-  relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0',
-  reltoastrelid => '0', relhasindex => 'f', relisshared => 'f',
-  relpersistence => 'p', relkind => 'r', relnatts => '31', relchecks => '0',
-  relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f',
-  relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't',
-  relreplident => 'n', relispartition => 'f', relfrozenxid => '3',
-  relminmxid => '1', relacl => '_null_', reloptions => '_null_',
-  relpartbound => '_null_' },
+  relname => 'pg_type', reltype => 'pg_type' },
 { oid => '1249',
-  relname => 'pg_attribute', reltype => 'pg_attribute', relam => 'heap',
-  relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0',
-  reltoastrelid => '0', relhasindex => 'f', relisshared => 'f',
-  relpersistence => 'p', relkind => 'r', relnatts => '25', relchecks => '0',
-  relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f',
-  relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't',
-  relreplident => 'n', relispartition => 'f', relfrozenxid => '3',
-  relminmxid => '1', relacl => '_null_', reloptions => '_null_',
-  relpartbound => '_null_' },
+  relname => 'pg_attribute', reltype => 'pg_attribute' },
 { oid => '1255',
-  relname => 'pg_proc', reltype => 'pg_proc', relam => 'heap',
-  relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0',
-  reltoastrelid => '0', relhasindex => 'f', relisshared => 'f',
-  relpersistence => 'p', relkind => 'r', relnatts => '29', relchecks => '0',
-  relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f',
-  relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't',
-  relreplident => 'n', relispartition => 'f', relfrozenxid => '3',
-  relminmxid => '1', relacl => '_null_', reloptions => '_null_',
-  relpartbound => '_null_' },
+  relname => 'pg_proc', reltype => 'pg_proc' },
 { oid => '1259',
-  relname => 'pg_class', reltype => 'pg_class', relam => 'heap',
-  relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0',
-  reltoastrelid => '0', relhasindex => 'f', relisshared => 'f',
-  relpersistence => 'p', relkind => 'r', relnatts => '33', relchecks => '0',
-  relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f',
-  relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't',
-  relreplident => 'n', relispartition => 'f', relfrozenxid => '3',
-  relminmxid => '1', relacl => '_null_', reloptions => '_null_',
-  relpartbound => '_null_' },
+  relname => 'pg_class', reltype => 'pg_c

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Robert Haas

On Fri, Feb 14, 2020 at 1:07 PM Andres Freund  wrote:
> Yea, that seems possible.  I'm not really sure it's needed however? As
> long as you're not teaching the locking mechanism new tricks that
> influence the wait graph, why would the deadlock detector care? That's
> quite different from the group locking case, where you explicitly needed
> to teach it something fairly fundamental.

Well, you have to teach it that locks of certain types conflict even
if they are in the same group, and that bleeds over pretty quickly
into the whole area of deadlock detection, because lock waits are the
edges in the graph that the deadlock detector processes.

> It might still be a good idea independently to add the rule & enforce
> that acquire heavyweight locks while holding certain classes of locks is
> not allowed.

I think that's absolutely essential, if we're going to continue using
the main lock manager for this. I remain somewhat unconvinced that
doing so is the best way forward, but it is *a* way forward.

> Right. For me that's *the* fundamental service that lock.c delivers. And
> it's the fundamental bit this thread so far largely has been focusing
> on.

For me, the deadlock detection is the far more complicated and problematic bit.

> Isn't that mostly true to varying degrees for the majority of lock types
> in lock.c? Sure, perhaps historically that's a misuse of lock.c, but
> it's been pretty ingrained by now.  I just don't see where leaving out
> any of these features is going to give us fundamental advantages
> justifying a different locking infrastructure.

I think the group locking + deadlock detection things are more
fundamental than you might be crediting, but I agree that having
parallel mechanisms has its own set of pitfalls.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Error on failed COMMIT

2020-02-14 Thread Robert Haas

On Fri, Feb 14, 2020 at 1:04 PM Dave Cramer  wrote:
> Thing is that con.commit() DOESN'T return a status code, nor does it throw an 
> exception as we silently ROLLBACK here.

Why not? There's nothing keeping the driver from doing either of those
things, is there? I mean, if using libpq, you can use PQcmdStatus() to
get the command tag, and find out whether it's COMMIT or ROLLBACK. If
you're implementing the wire protocol directly, you can do something
similar.

https://www.postgresql.org/docs/current/libpq-exec.html#LIBPQ-EXEC-NONSELECT

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Andres Freund

Hi,

On 2020-02-14 12:08:45 -0500, Robert Haas wrote:
> On Fri, Feb 14, 2020 at 11:40 AM Andres Freund  wrote:
> > IMO the right thing here is to extend lock.c so we can better represent
> > whether certain types of lockmethods (& levels ?) are [not] to be
> > shared.
> 
> The part that I find awkward about that is the whole thing with the
> deadlock detector. The deadlock detection code is old, crufty,
> complex, and very difficult to test (or at least I have found it so).
> A bug that I introduced when inventing group locking took like 5 years
> for somebody to find.

Oh, I agree, lock.c and surrounding code is pretty crufty. Doubtful that
just building up a largely parallel piece of infrastructure next to it
is a good answer though.

> One way of looking at the requirement that we have here is that
> certain kinds of locks need to be exempted from group locking.
> Basically, these are because they are a lower-level concept: a lock on
> a relation is more of a "logical" concept, and you hold the lock until
> eoxact, whereas a lock on an extend the relation is more of a
> "physical" concept, and you give it up as soon as you are done. Page
> locks are like relation extension locks in this regard. Unlike locks
> on SQL-level objects, these should not be shared between members of a
> lock group.
> 
> Now, if it weren't for the deadlock detector, that would be easy
> enough. But figuring out what to do with the deadlock detector seems
> really painful to me. I wonder if there's some way we can make an end
> run around that problem. For instance, if we could make (and enforce)
> a coding rule that you cannot acquire a heavyweight lock while holding
> a relation extension or page lock, then maybe we could somehow teach
> the deadlock detector to just ignore those kinds of locks, and teach
> the lock acquisition machinery that they conflict between lock group
> members.

Yea, that seems possible.  I'm not really sure it's needed however? As
long as you're not teaching the locking mechanism new tricks that
influence the wait graph, why would the deadlock detector care? That's
quite different from the group locking case, where you explicitly needed
to teach it something fairly fundamental.

It might still be a good idea independently to add the rule & enforce
that acquire heavyweight locks while holding certain classes of locks is
not allowed.

> On the other hand, I think you might also be understating the
> differences between these kinds of locks and other heavyweight locks.
> I suspect that the reason why we use lwlocks for buffers and
> heavyweight locks here is because there are a conceptually infinite
> number of relations, and lwlocks can't handle that.

Right. For me that's *the* fundamental service that lock.c delivers. And
it's the fundamental bit this thread so far largely has been focusing
on.

> The only mechanism we currently have that does handle that is the
> heavyweight lock mechanism, and from my point of view, somebody just
> beat it with a stick to make it fit this application. But the fact
> that it has been made to fit does not mean that it is really fit for
> purpose. We use 2 of 9 lock levels, we don't need deadlock detection,
> we need different behavior when group locking is in use, we release
> locks right away rather than at eoxact. I don't think it's crazy to
> think that those differences are significant enough to justify having
> a separate mechanism, even if the one that is currently on the table
> is not exactly what we want.

Isn't that mostly true to varying degrees for the majority of lock types
in lock.c? Sure, perhaps historically that's a misuse of lock.c, but
it's been pretty ingrained by now.  I just don't see where leaving out
any of these features is going to give us fundamental advantages
justifying a different locking infrastructure.

E.g. not needing to support "conceptually infinite" number of relations
IMO does provide a fundamental advantage - no need for a mapping.  I'm
not yet seeing anything equivalent for the extension vs. lock.c style
lock case.

Greetings,

Andres Freund

Re: Error on failed COMMIT

2020-02-14 Thread Dave Cramer

On Fri, 14 Feb 2020 at 12:37, Robert Haas  wrote:

> On Thu, Feb 13, 2020 at 2:38 AM Haumacher, Bernhard 
> wrote:
> > Assume the application is written in Java and sees Postgres through the
> > JDBC driver:
> >
> > composeTransaction() {
> > Connection con = getConnection(); // implicitly "begin"
> > try {
> >insertFrameworkLevelState(con);
> >insertApplicationLevelState(con);
> >con.commit();
> >publishNewState();
> > } catch (Throwable ex) {
> >con.rollback();
> > }
> > }
> >
> > With the current implementation, it is possible, that the control flow
> > reaches "publishNewState()" without the changes done in
> > "insertFrameworkLevelState()" have been made persistent - without the
> > framework-level code (which is everything except
> > "insertApplicationLevelState()") being able to detect the problem (e.g.
> > if "insertApplicationLevelState()" tries add a null into a non-null
> > column catching the exception or any other application-level error that
> > is not properly handled through safepoints).
> >
> >  From a framework's perspective, this behavior is absolutely
> > unacceptable. Here, the framework-level code sees a database that
> > commits successfully but does not make its changes persistent.
> > Therefore, I don't think that altering this behavior has more downside
> > than upside.
>
> I am not sure that this example really proves anything. If
> insertFrameworkLevelState(con), insertApplicationLevelState(con), and
> con.commit() throw exceptions, or if they return a status code and you
> check it and throw an exception if it's not what you expect, then it
> will work.


Thing is that con.commit() DOESN'T return a status code, nor does it throw
an exception as we silently ROLLBACK here.

As noted up thread it's somewhat worse as depending on how the transaction
failed we seem to do different things

In Tom's example we do issue a warning and say there is no transaction
running. I would guess we silently rolled back earlier.
In my example we don't issue the warning we just roll back.

I do agree with Tom that changing this behaviour at this point would make
things worse for more people than it would help so I am not advocating
throwing an error here.

I would however advocate for consistently doing the same thing with failed
transactions

Dave Cramer

www.postgres.rocks

Re: Marking some contrib modules as trusted extensions

2020-02-14 Thread Tom Lane

Andres Freund  writes:
> On 2020-02-13 18:57:10 -0500, Tom Lane wrote:
>> Maybe we could decide that the time for supporting easy updates from
>> pre-9.1 is past, and just remove all the unpackaged-to-XXX scripts?
>> Maybe even remove the "FROM version" option altogether.

> Yea, that strikes me as a reasonable thing to do. These days that just
> seems to be dangerous, without much advantage.

Here's a patch to remove the core-code support and documentation for
that.  I have not included the actual deletion of the contrib modules'
'unpackaged' scripts, as that seems both long and boring.

regards, tom lane

diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 08bb110..261a559 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -88,22 +88,6 @@ CREATE EXTENSION module_name;
  
 
  
-  If your database was brought forward by dump and reload from a pre-9.1
-  version of PostgreSQL, and you had been using the pre-9.1
-  version of the module in it, you should instead do
-
-
-CREATE EXTENSION module_name FROM unpackaged;
-
-
-  This will update the pre-9.1 objects of the module into a proper
-  extension object.  Future updates to the module will be
-  managed by .
-  For more information about extension updates, see
-  .
- 
-
- 
   Note, however, that some of these modules are not extensions
   in this sense, but are loaded into the server in some other way, for instance
   by way of
diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index ffe068b..9ec1af7 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -917,33 +917,6 @@ SELECT pg_catalog.pg_extension_config_dump('my_config', 'WHERE NOT standard_entr
 
 
 
- The update mechanism can be used to solve an important special case:
- converting a loose collection of objects into an extension.
- Before the extension mechanism was added to
- PostgreSQL (in 9.1), many people wrote
- extension modules that simply created assorted unpackaged objects.
- Given an existing database containing such objects, how can we convert
- the objects into a properly packaged extension?  Dropping them and then
- doing a plain CREATE EXTENSION is one way, but it's not
- desirable if the objects have dependencies (for example, if there are
- table columns of a data type created by the extension).  The way to fix
- this situation is to create an empty extension, then use ALTER
- EXTENSION ADD to attach each pre-existing object to the extension,
- then finally create any new objects that are in the current extension
- version but were not in the unpackaged release.  CREATE
- EXTENSION supports this case with its FROM old_version option, which causes it to not run the
- normal installation script for the target version, but instead the update
- script named
- extension--old_version--target_version.sql.
- The choice of the dummy version name to use as old_version is up to the extension author, though
- unpackaged is a common convention.  If you have multiple
- prior versions you need to be able to update into extension style, use
- multiple dummy version names to identify them.
-
-
-
  ALTER EXTENSION is able to execute sequences of update
  script files to achieve a requested update.  For example, if only
  foo--1.0--1.1.sql and foo--1.1--2.0.sql are
diff --git a/doc/src/sgml/ref/create_extension.sgml b/doc/src/sgml/ref/create_extension.sgml
index d76ac3e..6a21bff 100644
--- a/doc/src/sgml/ref/create_extension.sgml
+++ b/doc/src/sgml/ref/create_extension.sgml
@@ -24,7 +24,6 @@ PostgreSQL documentation
 CREATE EXTENSION [ IF NOT EXISTS ] extension_name
 [ WITH ] [ SCHEMA schema_name ]
  [ VERSION version ]
- [ FROM old_version ]
  [ CASCADE ]
 
  
@@ -48,8 +47,9 @@ CREATE EXTENSION [ IF NOT EXISTS ] extension_name
 
   
The user who runs CREATE EXTENSION becomes the
-   owner of the extension for purposes of later privilege checks, as well
-   as the owner of any objects created by the extension's script.
+   owner of the extension for purposes of later privilege checks, and
+   normally also becomes the owner of any objects created by the
+   extension's script.
   
 
   
@@ -142,33 +142,6 @@ CREATE EXTENSION [ IF NOT EXISTS ] extension_name
  
 
  
-  old_version
-  
-   
-FROM old_version
-must be specified when, and only when, you are attempting to install
-an extension that replaces an old style module that is just
-a collection of objects not packaged into an extension.  This option
-causes CREATE EXTENSION to run an alternative installation
-script that absorbs the existing objects into the extension, instead
-of creating new objects.  Be careful that SCHEMA specifies
-the schema containing these pre-existing objects.
-   
-
-   
-

Re: Error on failed COMMIT

2020-02-14 Thread Robert Haas

On Thu, Feb 13, 2020 at 2:38 AM Haumacher, Bernhard  wrote:
> Assume the application is written in Java and sees Postgres through the
> JDBC driver:
>
> composeTransaction() {
> Connection con = getConnection(); // implicitly "begin"
> try {
>insertFrameworkLevelState(con);
>insertApplicationLevelState(con);
>con.commit();
>publishNewState();
> } catch (Throwable ex) {
>con.rollback();
> }
> }
>
> With the current implementation, it is possible, that the control flow
> reaches "publishNewState()" without the changes done in
> "insertFrameworkLevelState()" have been made persistent - without the
> framework-level code (which is everything except
> "insertApplicationLevelState()") being able to detect the problem (e.g.
> if "insertApplicationLevelState()" tries add a null into a non-null
> column catching the exception or any other application-level error that
> is not properly handled through safepoints).
>
>  From a framework's perspective, this behavior is absolutely
> unacceptable. Here, the framework-level code sees a database that
> commits successfully but does not make its changes persistent.
> Therefore, I don't think that altering this behavior has more downside
> than upside.

I am not sure that this example really proves anything. If
insertFrameworkLevelState(con), insertApplicationLevelState(con), and
con.commit() throw exceptions, or if they return a status code and you
check it and throw an exception if it's not what you expect, then it
will work. If database errors that occur during those operations are
ignored, then you've got a problem, but it does not seem necessary to
change the behavior of the database to fix that problem.

I think one of the larger issues in this area is that people have
script that go:

BEGIN;
-- do stuff
COMMIT;
BEGIN;
-- do more stuff
COMMIT;

...and they run these scripts by piping them into psql. Now, if the
COMMIT leaves the session in a transaction state, this is going to
have pretty random behavior. Like your example, this can be fixed by
having proper error checking in the application, but that does require
that your application is something more than a psql script.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Robert Haas

On Fri, Feb 14, 2020 at 11:40 AM Andres Freund  wrote:
> IMO the right thing here is to extend lock.c so we can better represent
> whether certain types of lockmethods (& levels ?) are [not] to be
> shared.

The part that I find awkward about that is the whole thing with the
deadlock detector. The deadlock detection code is old, crufty,
complex, and very difficult to test (or at least I have found it so).
A bug that I introduced when inventing group locking took like 5 years
for somebody to find.

One way of looking at the requirement that we have here is that
certain kinds of locks need to be exempted from group locking.
Basically, these are because they are a lower-level concept: a lock on
a relation is more of a "logical" concept, and you hold the lock until
eoxact, whereas a lock on an extend the relation is more of a
"physical" concept, and you give it up as soon as you are done. Page
locks are like relation extension locks in this regard. Unlike locks
on SQL-level objects, these should not be shared between members of a
lock group.

Now, if it weren't for the deadlock detector, that would be easy
enough. But figuring out what to do with the deadlock detector seems
really painful to me. I wonder if there's some way we can make an end
run around that problem. For instance, if we could make (and enforce)
a coding rule that you cannot acquire a heavyweight lock while holding
a relation extension or page lock, then maybe we could somehow teach
the deadlock detector to just ignore those kinds of locks, and teach
the lock acquisition machinery that they conflict between lock group
members.

On the other hand, I think you might also be understating the
differences between these kinds of locks and other heavyweight locks.
I suspect that the reason why we use lwlocks for buffers and
heavyweight locks here is because there are a conceptually infinite
number of relations, and lwlocks can't handle that. The only mechanism
we currently have that does handle that is the heavyweight lock
mechanism, and from my point of view, somebody just beat it with a
stick to make it fit this application. But the fact that it has been
made to fit does not mean that it is really fit for purpose. We use 2
of 9 lock levels, we don't need deadlock detection, we need different
behavior when group locking is in use, we release locks right away
rather than at eoxact. I don't think it's crazy to think that those
differences are significant enough to justify having a separate
mechanism, even if the one that is currently on the table is not
exactly what we want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Andres Freund

Hi,

On 2020-02-14 09:42:40 -0500, Tom Lane wrote:
> In the second place, it's ludicrous to expect that the underlying
> platform/filesystem can support an infinite number of concurrent
> file-extension operations.  At some level (e.g. where disk blocks
> are handed out, or where a record of the operation is written to
> a filesystem journal) it's quite likely that things are bottlenecked
> down to *one* such operation at a time per filesystem.

That's probably true to some degree from a theoretical POV, but I think
it's so far from where we are at, that it's effectively wrong. I can
concurrently extend a few files at close to 10GB/s on a set of fast
devices below a *single* filesystem. Whereas postgres bottlenecks far
far before this.  Given that a lot of today's storage has latencies in
the 10-100s of microseconds, a journal flush doesn't necessarily cause
that much serialization - and OS journals do group commit like things
too.

Greetings,

Andres Freund

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Andres Freund

Hi,

On 2020-02-12 11:53:49 -0500, Tom Lane wrote:
> In general, if we think there are issues with LWLock, it seems to me
> we'd be better off to try to fix them, not to invent a whole new
> single-purpose lock manager that we'll have to debug and maintain.

My impression is that what's being discussed here is doing exactly that,
except with s/lwlock/heavyweight locks/. We're basically replacing the
lock.c lock mapping table with an ad-hoc implementation, and now we're
also reinventing interruptability etc.

I still find the performance arguments pretty ludicruous, to be honest -
I think the numbers I posted about how much time we spend with the locks
held, back that up.  I have a bit more understanding for the parallel
worker arguments, but only a bit:

I think if we develop a custom solution for the extension lock, we're
just going to end up having to develop another custom solution for a
bunch of other types of locks.  It seems quite likely that we'll end up
also wanting TUPLE and also SPECULATIVE and PAGE type locks that we
don't want to share between leader & workers.

IMO the right thing here is to extend lock.c so we can better represent
whether certain types of lockmethods (& levels ?) are [not] to be
shared.

Greetings,

Andres Freund

Re: allow frontend use of the backend's core hashing functions

2020-02-14 Thread Mark Dilger




> On Feb 14, 2020, at 8:29 AM, David Fetter  wrote:
> 
> On Fri, Feb 14, 2020 at 08:16:47AM -0800, Mark Dilger wrote:
>>> On Feb 14, 2020, at 8:15 AM, David Fetter  wrote:
>>> 
>>> On Fri, Feb 14, 2020 at 10:33:04AM -0500, Robert Haas wrote:
 On Thu, Feb 13, 2020 at 11:26 AM Mark Dilger
  wrote:
> I have made these changes and rebased Robert’s patches but
> otherwise changed nothing.  Here they are:
 
 Thanks. Anyone else have comments? I think this is pretty
 straightforward and unobjectionable work so I'm inclined to press
 forward with committing it fairly soon, but if someone feels
 otherwise, please speak up.
>>> 
>>> One question. It might be possible to make these functions faster
>>> using compiler intrinsics. Would those still be available to front-end
>>> code?
>> 
>> Do you have a specific proposal that would preserve on-disk compatibility?
> 
> I hadn't planned on changing the representation, just cutting
> instructions out of the calculation of same.

Ok, I misunderstood.

I thought the question was about using compiler intrinsics to implement an 
alltogether different hashing algorithm than the one currently in use and 
whether exposing the hash function to frontend code would lock down the 
algorithm in a way that would make it harder to change in the future.  That 
lead me to the question of whether we had sufficient flexibility to entertain 
changing the hashing algorithm anyway.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: allow frontend use of the backend's core hashing functions

2020-02-14 Thread David Fetter

On Fri, Feb 14, 2020 at 08:16:47AM -0800, Mark Dilger wrote:
> > On Feb 14, 2020, at 8:15 AM, David Fetter  wrote:
> > 
> > On Fri, Feb 14, 2020 at 10:33:04AM -0500, Robert Haas wrote:
> >> On Thu, Feb 13, 2020 at 11:26 AM Mark Dilger
> >>  wrote:
> >>> I have made these changes and rebased Robert’s patches but
> >>> otherwise changed nothing.  Here they are:
> >> 
> >> Thanks. Anyone else have comments? I think this is pretty
> >> straightforward and unobjectionable work so I'm inclined to press
> >> forward with committing it fairly soon, but if someone feels
> >> otherwise, please speak up.
> > 
> > One question. It might be possible to make these functions faster
> > using compiler intrinsics. Would those still be available to front-end
> > code?
> 
> Do you have a specific proposal that would preserve on-disk compatibility?

I hadn't planned on changing the representation, just cutting
instructions out of the calculation of same.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: allow frontend use of the backend's core hashing functions

2020-02-14 Thread Robert Haas

On Fri, Feb 14, 2020 at 11:15 AM David Fetter  wrote:
> One question. It might be possible to make these functions faster
> using compiler intrinsics. Would those still be available to front-end
> code?

Why not? We use the same compiler for frontend code as we do for
backend code. Possibly you might need some more header-file
rejiggering someplace, but I don't know of anything specific or see
any reason why it would be particularly painful if we had to do
something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Robert Haas

On Fri, Feb 14, 2020 at 10:43 AM Tom Lane  wrote:
> I remain unconvinced ... wouldn't both of those claims apply to any disk
> I/O request?  Are we going to try to ensure that no I/O ever happens
> while holding an LWLock, and if so how?  (Again, CheckpointLock is a
> counterexample, which has been that way for decades without reported
> problems.  But actually I think buffer I/O locks are an even more
> direct counterexample.)

Yes, that's a problem. I proposed a patch a few years ago that
replaced the buffer I/O locks with condition variables, and I think
that's a good idea for lots of reasons, including this one. I never
quite got around to pushing that through to commit, but I think we
should do that. Aside from fixing this problem, it also prevents
certain scenarios where we can currently busy-loop.

I do realize that we're unlikely to ever solve this problem
completely, but I don't think that should discourage us from making
incremental progress. Just as debuggability is a sticking point for
you, what I'm going to call operate-ability is a sticking point for
me. My work here at EnterpriseDB exposes me on a fairly regular basis
to real broken systems, and I'm therefore really sensitive to the
concerns that people have when trying to recover after a system has
become, for one reason or another, really broken.

Interruptibility may not be the #1 concern in that area, but it's very
high on the list. EnterpriseDB customers, as a rule, *really* hate
being told to restart the database because one session is stuck. It
causes a lot of disruption for them and the person who does the
restart gets yelled at by their boss, and maybe their bosses boss and
the boss above that. It means that their whole application, which may
be mission-critical, is down until the database finishes restarting,
and that is not always a quick process, especially after an immediate
shutdown. I don't think we can ever make everything that can get stuck
interruptible, but the more we can do the better.

The work you and others have done over the years to add
CHECK_FOR_INTERRUPTS() to more places pays real dividends. Making
sessions that are blocked on disk I/O interruptible in at least some
of the more common cases would be a huge win. Other people may well
have different experiences, but my experience is that the disk
deciding to conk out for a while or just respond very very slowly is a
very common problem even (and sometimes especially) on very expensive
hardware. Obviously that's not great and you're in lots of trouble,
but being able to hit ^C and get control back significantly improves
your chances of being able to understand what has happened and recover
from it.

> > That's an interesting idea, but it doesn't make the lock acquisition
> > itself interruptible, which seems pretty important to me in this case.
>
> Good point: if you think the contained operation might run too long to
> suit you, then you don't want other backends to be stuck behind it for
> the same amount of time.

Right.

> > I wonder if we could have an LWLockAcquireInterruptibly() or some such
> > that allows the lock acquisition itself to be interruptible. I think
> > that would require some rejiggering but it might be doable.
>
> Yeah, I had the impression from a brief look at LWLockAcquire that
> it was itself depending on not throwing errors partway through.
> But with careful and perhaps-a-shade-slower coding, we could probably
> make a version that didn't require that.

Yeah, that was my thought, too, but I didn't study it that carefully,
so somebody would need to do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: allow frontend use of the backend's core hashing functions

2020-02-14 Thread Mark Dilger




> On Feb 14, 2020, at 8:15 AM, David Fetter  wrote:
> 
> On Fri, Feb 14, 2020 at 10:33:04AM -0500, Robert Haas wrote:
>> On Thu, Feb 13, 2020 at 11:26 AM Mark Dilger
>>  wrote:
>>> I have made these changes and rebased Robert’s patches but
>>> otherwise changed nothing.  Here they are:
>> 
>> Thanks. Anyone else have comments? I think this is pretty
>> straightforward and unobjectionable work so I'm inclined to press
>> forward with committing it fairly soon, but if someone feels
>> otherwise, please speak up.
> 
> One question. It might be possible to make these functions faster
> using compiler intrinsics. Would those still be available to front-end
> code?

Do you have a specific proposal that would preserve on-disk compatibility?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: allow frontend use of the backend's core hashing functions

2020-02-14 Thread David Fetter

On Fri, Feb 14, 2020 at 10:33:04AM -0500, Robert Haas wrote:
> On Thu, Feb 13, 2020 at 11:26 AM Mark Dilger
>  wrote:
> > I have made these changes and rebased Robert’s patches but
> > otherwise changed nothing.  Here they are:
> 
> Thanks. Anyone else have comments? I think this is pretty
> straightforward and unobjectionable work so I'm inclined to press
> forward with committing it fairly soon, but if someone feels
> otherwise, please speak up.

One question. It might be possible to make these functions faster
using compiler intrinsics. Would those still be available to front-end
code?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Internal key management system

2020-02-14 Thread Robert Haas

On Thu, Feb 6, 2020 at 9:19 PM Masahiko Sawada
 wrote:
> This feature protects data from disk thefts but cannot protect data
> from attackers who are able to access PostgreSQL server. In this
> design application side still is responsible for managing the wrapped
> secret in order to protect it from attackers. This is the same as when
> we use pgcrypto now. The difference is that data is safe even if
> attackers steal the wrapped secret and the disk. The data cannot be
> decrypted either without the passphrase which can be stored to other
> key management systems or without accessing postgres server. IOW for
> example, attackers can get the data if they get the wrapped secret
> managed by application side then run pg_kmgr_unwrap() to get the
> secret and then steal the disk.

If you only care about protecting against the theft of the disk, you
might as well just encrypt the whole filesystem, which will probably
perform better and probably be a lot harder to break since you won't
have short encrypted strings but instead large encrypted blocks of
data. Moreover, I think a lot of people who are interested in these
kinds of features are hoping for more than just protecting against the
theft of the disk. While some people may be hoping for too much in
this area, setting your sights only on encryption at rest seems like a
fairly low bar.

It also doesn't seem very likely to actually provide any security.
You're talking about sending the encryption key in the query string,
which means that there's a good chance it's going to end up in a log
file someplace. One way that could happen is if the user has
configured log_statement=all or log_min_duration_statement, but it
could also happen any time the query throws an error. In theory, you
might arrange for the log messages to be sent to another server that
is protected by separate layers of security, but a lot of people are
going to just log locally. And, even if you do have a separate server,
do you really want to have the logfile over there be full of
passwords? I know I can be awfully negative some times, but that it
seems like a weakness so serious as to make this whole thing
effectively useless.

One way to plug this hole is to use new protocol messages for key
exchanges. For example, suppose that after authentication is complete,
you can send the server a new protocol message: KeyPassphrase
 . The server stores the passphrase in
backend-private memory and returns ReadyForQuery, and does not log the
message payload anywhere. Now you do this:

INSERT INTO tbl VALUES (pg_encrypt('user data', 'key-name');
SELECT pg_decrypt(secret_column, 'key-name') FROM tbl;

If the passphrase for the named key has not been loaded into the
current session's memory, this produces an error; otherwise, it looks
up the passphrase and uses it to do the decryption. Now the passphrase
never gets logged anywhere, and, also, the user can't persuade the
server to provide it with the encryption key, because there's no
SQL-level function to access that data.

We could take it a step further: suppose that encryption is a column
property, and the value of the property is a key name. If the user
hasn't sent a KeyPassphrase message with the relevant key name,
attempts to access that column just error out. If they have, then the
server just does the encryption and decryption automatically.  Now the
user can just do:

INSERT INTO tbl VALUES ('user data');
SELECT secret_column FROM tbl;

It's a huge benefit if the SQL doesn't need to be changed. All that an
application needs to do in order to use encryption in this scenario is
use PQsetKeyPassphrase() or whatever before doing whatever else they
want to do.

Even with these changes, the security of this whole approach can be
criticized on the basis that a good amount of information about the
data can be inferred without decrypting anything. You can tell which
encrypted values are long and which are short. If someone builds an
index on the column, you can tell the order of all the encrypted
values even though you may not know what the actual values are. Those
could well be meaningful information leaks, but I think such a system
might still be of use for certain purposes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Tom Lane

Robert Haas  writes:
> On Wed, Feb 12, 2020 at 11:53 AM Tom Lane  wrote:
>> * I see no reason to think that a relation extension lock would ever
>> be held long enough for noninterruptibility to be a real issue.  Our
>> expectations for query cancel response time are in the tens to
>> hundreds of msec anyway.

> I don't agree, because (1) the time to perform a relation extension on
> a busy system can be far longer than that and (2) if the disk is
> failing, then it can be *really* long, or indefinite.

I remain unconvinced ... wouldn't both of those claims apply to any disk
I/O request?  Are we going to try to ensure that no I/O ever happens
while holding an LWLock, and if so how?  (Again, CheckpointLock is a
counterexample, which has been that way for decades without reported
problems.  But actually I think buffer I/O locks are an even more
direct counterexample.)

>> * There are other places where an LWLock can be held for a *long* time,
>> notably the CheckpointLock.  If we do think this is an issue, we could
>> devise a way to not insist on noninterruptibility.  The easiest fix
>> is just to do a matching RESUME_INTERRUPTS after getting the lock and
>> HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth
>> offering some slightly cleaner way.  Point here is that LWLockAcquire
>> only does that because it's useful to the majority of callers, not
>> because it's graven in stone that it must be like that.

> That's an interesting idea, but it doesn't make the lock acquisition
> itself interruptible, which seems pretty important to me in this case.

Good point: if you think the contained operation might run too long to
suit you, then you don't want other backends to be stuck behind it for
the same amount of time.

> I wonder if we could have an LWLockAcquireInterruptibly() or some such
> that allows the lock acquisition itself to be interruptible. I think
> that would require some rejiggering but it might be doable.

Yeah, I had the impression from a brief look at LWLockAcquire that
it was itself depending on not throwing errors partway through.
But with careful and perhaps-a-shade-slower coding, we could probably
make a version that didn't require that.

regards, tom lane

Re: Internal key management system

2020-02-14 Thread Robert Haas

On Thu, Feb 6, 2020 at 4:37 PM Cary Huang  wrote:
> Since the user does not need to know the master secret key used to cipher the 
> data, I don't think we should expose "pg_kmgr_unwrap("")" SQL function to 
> the user at all.
> The wrapped key "" is stored in control data and it is possible to obtain 
> by malicious user and steal the key by running SELECT pg_kmgr_unwrap("").
> Even the user is righteous, it may not be very straightforward for that user 
> to obtain the wrapped key "" to use in the unwrap function.

I agree.

> so instead of:
> --
> INSERT INTO tbl VALUES (pg_encrypt('user data', pg_kmgr_unwrap('x'));
> SELECT pg_decrypt(secret_column, pg_kmgr_unwrap('x')) FROM tbl;
>
> it would become:
> --
> INSERT INTO tbl VALUES (pg_encrypt('user data', 'cluster_pass_phrase');
> SELECT pg_decrypt(secret_column, 'cluster_pass_phrase') FROM tbl;

The second one is certainly better than the first one, as it prevents
the key from being stolen. It's still pretty bad, though, because the
supposedly-secret passphrase will end up in the server log.

I have a hard time believing that this feature as currently proposed
is worth anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: error context for vacuum to include block number

2020-02-14 Thread Justin Pryzby

On Fri, Feb 14, 2020 at 12:30:25PM +0900, Masahiko Sawada wrote:
> * I think the function name is too generic. init_vacuum_error_callback
> or init_vacuum_errcallback is better.

> * The comment of this function is not accurate since this function is
> not only for heap vacuum but also index vacuum. How about just
> "Initialize vacuum error callback"?

> * I think it's easier to read the code if we set the relname and
> indname in the same order.

> * The comment I wrote in the previous mail seems better, because in
> this function the reader might get confused that 'rel' is a relation
> or an index depending on the phase but that comment helps it.

Fixed these

> * rel->rd_index->indexrelid should be rel->rd_index->indrelid.

Ack.  I think that's been wrong since I first wrote it two weeks ago :(
The error is probably more obvious due to the switch statement you proposed.

Thanks for continued reviews.

-- 
Justin
>From 94768a134118d30853b75a96b90166363f0fef5b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 12 Dec 2019 20:54:37 -0600
Subject: [PATCH v19] vacuum errcontext to show block being processed

Discussion:
https://www.postgresql.org/message-id/20191120210600.gc30...@telsasoft.com
---
 src/backend/access/heap/vacuumlazy.c | 120 +++
 1 file changed, 120 insertions(+)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a23cdef..ebfb2e7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -292,6 +292,14 @@ typedef struct LVRelStats
 	bool		lock_waiter_detected;
 } LVRelStats;
 
+typedef struct
+{
+	char 		*relnamespace;
+	char		*relname;
+	char 		*indname;
+	BlockNumber blkno;	/* used only for heap operations */
+	int			phase;	/* Reusing same enums as for progress reporting */
+} vacuum_error_callback_arg;
 
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
@@ -361,6 +369,9 @@ static void end_parallel_vacuum(Relation *Irel, IndexBulkDeleteResult **stats,
 LVParallelState *lps, int nindexes);
 static LVSharedIndStats *get_indstats(LVShared *lvshared, int n);
 static bool skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared);
+static void vacuum_error_callback(void *arg);
+static void init_vacuum_error_callback(ErrorContextCallback *errcallback,
+		vacuum_error_callback_arg *errcbarg, Relation onerel, int phase);
 
 
 /*
@@ -724,6 +735,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	ErrorContextCallback errcallback;
+	vacuum_error_callback_arg errcbarg;
 
 	pg_rusage_init(&ru0);
 
@@ -870,6 +883,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	else
 		skipping_blocks = false;
 
+	/* Setup error traceback support for ereport() */
+	init_vacuum_error_callback(&errcallback, &errcbarg, onerel, PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+	error_context_stack = &errcallback;
+
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
 		Buffer		buf;
@@ -891,6 +908,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 #define FORCE_CHECK_PAGE() \
 		(blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats))
 
+		errcbarg.blkno = blkno;
+
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		if (blkno == next_unskippable_block)
@@ -987,6 +1006,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 vmbuffer = InvalidBuffer;
 			}
 
+			/* Pop the error context stack while calling vacuum */
+			error_context_stack = errcallback.previous;
+
 			/* Work on all the indexes, then the heap */
 			lazy_vacuum_all_indexes(onerel, Irel, indstats,
 	vacrelstats, lps, nindexes);
@@ -1011,6 +1033,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 		 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			/* Set the error context while continuing heap scan */
+			error_context_stack = &errcallback;
 		}
 
 		/*
@@ -1597,6 +1622,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
 	}
 
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
 	/* report that everything is scanned and vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
@@ -1772,11 +1800,19 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 	int			npages;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
+	ErrorContextCallback errcallback;
+	vacuum_error_callback_arg errcbarg;
 
 	/* Report that we are now vacuuming the heap */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
  PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
 
+	/*
+	 * Se

Re: allow frontend use of the backend's core hashing functions

2020-02-14 Thread Robert Haas

On Thu, Feb 13, 2020 at 11:26 AM Mark Dilger
 wrote:
> I have made these changes and rebased Robert’s patches but otherwise changed 
> nothing.  Here they are:

Thanks. Anyone else have comments? I think this is pretty
straightforward and unobjectionable work so I'm inclined to press
forward with committing it fairly soon, but if someone feels
otherwise, please speak up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: assert pg_class.relnatts is consistent

2020-02-14 Thread Alvaro Herrera

On 2020-Feb-14, John Naylor wrote:

> One possible objection to what I wrote above is that it adds a
> different kind of special case, but in a sneaky way. Perhaps it would
> be more principled to treat it the same as oid after all. If we do
> that, it would help to add a comment that we can't treat relnatts like
> pronangs, since we need more information than what's in each pg_class
> row.

How about something like this? (untested)

# oids are a special case; ignore
next if $attname eq 'oid';
# pg_class.relnatts is computed from pg_attribute rows; ignore
next if $catname eq 'pg_class' and $attname eq 'relnatts';

# Raise error unless a value exists.
die "strip_default_values: $catname.$attname undefined\n"
  if !defined $row->{$attname};

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Robert Haas

On Wed, Feb 12, 2020 at 11:53 AM Tom Lane  wrote:
> Yeah.  I would say a couple more things:
>
> * I see no reason to think that a relation extension lock would ever
> be held long enough for noninterruptibility to be a real issue.  Our
> expectations for query cancel response time are in the tens to
> hundreds of msec anyway.

I don't agree, because (1) the time to perform a relation extension on
a busy system can be far longer than that and (2) if the disk is
failing, then it can be *really* long, or indefinite.

> * There are other places where an LWLock can be held for a *long* time,
> notably the CheckpointLock.  If we do think this is an issue, we could
> devise a way to not insist on noninterruptibility.  The easiest fix
> is just to do a matching RESUME_INTERRUPTS after getting the lock and
> HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth
> offering some slightly cleaner way.  Point here is that LWLockAcquire
> only does that because it's useful to the majority of callers, not
> because it's graven in stone that it must be like that.

That's an interesting idea, but it doesn't make the lock acquisition
itself interruptible, which seems pretty important to me in this case.
I wonder if we could have an LWLockAcquireInterruptibly() or some such
that allows the lock acquisition itself to be interruptible. I think
that would require some rejiggering but it might be doable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BRIN cost estimate breaks geometric indexes

2020-02-14 Thread Komяpa

Hi,

Patch may look as simple as this one:
https://patch-diff.githubusercontent.com/raw/postgres/postgres/pull/49.diff

Previous mention in -hackers is available at
https://postgrespro.com/list/id/cakjs1f9n-wapop5xz1dtgdpdqmzegqqk4sv2mk-zzugfc14...@mail.gmail.com
-
seems everyone overlooked that patch there breaks geometric indexing back
then.

On Tue, Jan 21, 2020 at 2:07 AM Egor Rogov  wrote:

> On 21.01.2020 0:00, Darafei "Komяpa" Praliaskouski wrote:
> > Hi,
> >
> > Found out today that BRIN indexes don't really work for PostGIS and
> > box datatypes.
> >
> > Since
> >
> https://github.com/postgres/postgres/commit/7e534adcdc70866e7be74d626b0ed067c890a251
>  Postgres
>
> > requires datatype to provide correlation statistics. Such statistics
> > wasn't provided by PostGIS and box types.
> >
> > Today I tried to replace a 200gb gist index with 8mb brin index and
> > queries didn't work as expected - it was never used. set
> > enable_seqscan=off helped for a bit but that's not a permanent solution.
> > Plans for context:
> > https://gist.github.com/Komzpa/2cd396ec9b65e2c93341e9934d974826
> >
> > Debugging session on #postgis IRC channel leads to this ticket to
> > create a (not that meaningful) correlation statistics for geometry
> > datatype: https://trac.osgeo.org/postgis/ticket/4625#ticket
> >
> > Postgres Professional mentioned symptoms of the issue in their
> > in-depth manual:
> > https://habr.com/ru/company/postgrespro/blog/346460/ - box datatype
> > showed same unusable BRIN symptoms for them.
>
>
> (Translated to English:
> https://habr.com/en/company/postgrespro/blog/452900/)
>
>
> > A reasonable course of action on Postgres side seems to be to not
> > assume selectivity of 1 in absence of correlation statistics, but
> > something that would prefer such an index to a parallel seq scan, but
> > higher than similar GIST.
> >
> > Any other ideas?
>
>
> As far as I understand, correlation is computed only for sortable types,
> which means that the current concept of correlation works as intended
> only for B-tree indexes.
>
> Ideally, correlation should be computed for (attribute, index) pair,
> taking into account order of values returned by the index scan. Less
> ideal but more easier approach can be to ignore the computed correlation
> for any index access except B-tree, and just assume some predefined
> constant.
>
>
>
>
>

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Re: assert pg_class.relnatts is consistent

2020-02-14 Thread Alvaro Herrera

I propose this more concise coding for AddDefaultValues,

# Now fill in defaults, and note any columns that remain undefined.
foreach my $column (@$schema)
{
my $attname = $column->{name};
my $atttype = $column->{type};

# Skip if a value already exists
next if defined $row->{$attname};

# 'oid' and 'relnatts' are special cases. Ignore.
next if $attname eq 'oid';
next if $attname eq 'relnatts';

# This column has a default value.  Fill it in.
if (defined $column->{default})
{
$row->{$attname} = $column->{default};
next;
}

# Failed to find a value.
push @missing_fields, $attname;
}

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Fix compiler warnings on 64-bit Windows

2020-02-14 Thread Tom Lane

Peter Eisentraut  writes:
> On 2020-02-13 16:19, Tom Lane wrote:
>> According to C99 and POSIX, intptr_t should be provided by  ...
>> now that we're requiring C99, can we get away with just #include'ing
>> that directly in these test files?

> I think in the past we were worried about the C library not being fully 
> C99.  But the build farm indicates that even the trailing edge OS X and 
> HP-UX members have it, so I'm content to require it.  Then we should 
> probably remove the Autoconf tests altogether.

Yeah, I think that the C99 requirement has obsoleted a number of configure
tests and related hackery in c.h.  We just haven't got round to cleaning
that up yet.

BTW: I'm still concerned about the possibility of the C library being
less than C99.  The model that was popular back then, and which still
exists on e.g. gaur, was that you could install a C99 *compiler* on
a pre-C99 system, and the compiler would bring its own standard header
files as necessary.  While I don't have the machine booted up to check,
I'm pretty sure that gaur's  is being supplied by the gcc
installation not directly from /usr/include.  On the other hand, that
compiler installation is still dependent on the vendor-supplied libc.

So the sorts of tests I think we can get away with removing have to do
with the presence of C99-required headers, macros, typedefs, etc.
Anything that is checking the presence or behavior of code in libc,
we probably need to be more careful about.

regards, tom lane

Re: Wait event that should be reported while waiting for WAL archiving to finish

2020-02-14 Thread Robert Haas

On Thu, Feb 13, 2020 at 10:47 PM Fujii Masao
 wrote:
> Fixed. Thanks for the review!

I think it would be safer to just report the wait event during
pg_usleep(100L) rather than putting those calls around the whole
loop. It does not seem impossible that ereport() or
CHECK_FOR_INTERRUPTS() could do something that reports a wait event
internally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Tom Lane

Amit Kapila  writes:
> I think MaxBackends will generally limit the number of different
> relations that can simultaneously extend, but maybe tables with many
> partitions might change the situation.  You are right that some tests
> might suggest a good number, let Mahendra write a patch and then we
> can test it.  Do you have any better idea?

In the first place, there certainly isn't more than one extension
happening at a time per backend, else the entire premise of this
thread is wrong.  Handwaving about partitions won't change that.

In the second place, it's ludicrous to expect that the underlying
platform/filesystem can support an infinite number of concurrent
file-extension operations.  At some level (e.g. where disk blocks
are handed out, or where a record of the operation is written to
a filesystem journal) it's quite likely that things are bottlenecked
down to *one* such operation at a time per filesystem.  So I'm not
that concerned about occasional false-sharing limiting our ability
to issue concurrent requests.  There are probably worse restrictions
at lower levels.

regards, tom lane

Re: assert pg_class.relnatts is consistent

2020-02-14 Thread Amit Langote

Hi John,

On Fri, Feb 14, 2020 at 6:50 PM John Naylor  wrote:
> On Fri, Feb 14, 2020 at 5:00 PM Amit Langote  wrote:
> > I tried and think it works but not sure if that's good Perl
> > programming.  See the attached.
>
> Hi Amit,
> I took this for a spin -- I just have a couple comments.

Thanks for chiming in.

> + elsif ($attname eq 'relnatts')
> + {
> + ;
> + }
>
> With your patch, I get this when running
> src/include/catalog/reformat_dat_file.pl:
>
> strip_default_values: pg_class.relnatts undefined

I think I have fixed this in the attached.

> + if ($catname eq "pg_class" && $attname eq "relnatts")
> + {
> + $bki_values{$attname} = $catalog_ncols{$bki_values{relname}};
> + }
> +
>
> You could avoid the name/attr checks if you do it while building the
> pg_class lookup table, like this:
>
>  foreach my $row (@{ $catalog_data{pg_class} })
>  {
>   $classoids{ $row->{relname} } = $row->{oid};
> +
> + # Also fill in correct value for relnatts.
> + $row->{relnatts} = $catalog_ncols{ $row->{relname} };
>  }

Did this too.  Attached updated patch, which also addresses Michael's comment.

I'm still trying to understand your comment about using placeholder
BKI_DEFAULT...

Thanks,
Amit


v2-0001-Don-t-require-relnatts-to-be-specified-in-pg_clas.patch
Description: Binary data

Re: Parallel copy

2020-02-14 Thread Alastair Turner

On Fri, 14 Feb 2020 at 11:57, Amit Kapila  wrote:

> On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro 
> wrote:
> >
> > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila 
> wrote:

 ...

> > > Another approach that came up during an offlist discussion with Robert
> > > is that we have one dedicated worker for reading the chunks from file
> > > and it copies the complete tuples of one chunk in the shared memory
> > > and once that is done, a handover that chunks to another worker which
> > > can process tuples in that area.  We can imagine that the reader
> > > worker is responsible to form some sort of work queue that can be
> > > processed by the other workers.  In this idea, we won't be able to get
> > > the benefit of initial tokenization (forming tuple boundaries) via
> > > parallel workers and might need some additional memory processing as
> > > after reader worker has handed the initial shared memory segment, we
> > > need to somehow identify tuple boundaries and then process them.
>

Parsing rows from the raw input (the work done by CopyReadLine()) in a
single process would accommodate line returns in quoted fields. I don't
think there's a way of getting parallel workers to manage the
in-quote/out-of-quote state required. A single worker could also process a
stream without having to reread/rewind so it would be able to process input
from STDIN or PROGRAM sources, making the improvements applicable to load
operations done by third party tools and scripted \copy in psql.


> >

...

>
> > > Another thing we need to figure out is the how many workers to use for
> > > the copy command.  I think we can use it based on the file size which
> > > needs some experiments or may be based on user input.
> >
> > It seems like we don't even really have a general model for that sort
> > of thing in the rest of the system yet, and I guess some kind of
> > fairly dumb explicit system would make sense in the early days...
> >
>
> makes sense.
>
The ratio between chunking or line parsing processes and the parallel
worker pool would vary with the width of the table, complexity of the data
or file (dates, encoding conversions), complexity of constraints and
acceptable impact of the load. Being able to control it through user input
would be great.

--
Alastair

Re: assert pg_class.relnatts is consistent

2020-02-14 Thread Amit Langote

On Fri, Feb 14, 2020 at 6:47 PM Michael Paquier  wrote:
> On Fri, Feb 14, 2020 at 06:00:05PM +0900, Amit Langote wrote:
> > On Fri, Feb 14, 2020 at 2:58 PM Amit Langote  
> > wrote:
> > > On Fri, Feb 14, 2020 at 1:04 AM Tom Lane  wrote:
> > > > I've been burnt by this too :-(.  However, I think this patch is
> > > > completely the wrong way to go about improving this.  What we should
> > > > be doing, now that we have all that perl code generating postgres.bki,
> > > > is eliminating the problem at the source.  That is, drop the hand-coded
> > > > relnatts values from pg_class.dat altogether, and let the perl code fill
> > > > it in --- compare the handling of pg_proc.pronargs for instance.
> > >
> > > I can't write Perl myself (maybe Justin), but +1 to this idea.
> >
> > I tried and think it works but not sure if that's good Perl
> > programming.  See the attached.
>
> I quite like what you have here.  Please note that this comment in
> genbki.pl is incorrect regarding relnatts (the last part could just be
> deleted):
> # Note: only bootstrap catalogs, ie those marked BKI_BOOTSTRAP, need to
> # have entries here.  Be sure that the OIDs listed here match those given in
> # their CATALOG and BKI_ROWTYPE_OID macros, and that the relnatts values are
> # correct.

You're right, although this comment is in pg_class.dat.

Thanks,
Amit

Re: [HACKERS] [PATCH] Generic type subscripting

2020-02-14 Thread Dmitry Dolgov

> On Thu, Feb 13, 2020 at 02:25:46PM +0100, Pavel Stehule wrote:
>
> > > I like patch 0006 - filling gaps by NULLs - it fixed my objections if I
> > > remember correctly.  Patch 0005 - polymorphic subscribing - I had not a
> > > idea, what is a use case? Maybe can be good to postpone this patch. I
> > have
> > > not strong opinion about it, but generally is good to reduce size of
> > > initial patch. I have nothing against a compatibility with SQL, but this
> > > case doesn't looks too realistic for me, and can be postponed without
> > > future compatibility issues.
> >
> > The idea about 0005 is mostly performance related, since this change
> > (aside from being more pedantic with the standard) also allows to
> > squeeze out some visible processing time improvement. But I agree that
> > the patch series itself is too big to add something more, that's why I
> > concider 0005/0006 mosly as interesting ideas for the future.
> >
>
> patch 0006 is necessary from my perspective. Without it, behave of update
> is not practical. I didn't review of this patch mainly due issues that was
> fixed by 0006 patch

Oh, I see. The thing is that in how it is implemented right now 0006
depends on 0005. Originally I was against of doing anything different
than a regular jsonb functionality would do, but after the discussion
about jsonb_set and null arguments I figured that indeed it probably
makes sense to deviate in some certain cases. Eventually it depends on
the community feedback, so I can try to make 0006 an independent change
and we will see.

> > Yep, I wasn't paying much attention recently to this patch, will post
> > rebased and fixed version soon. At the same time I must admit, even if
> > at the moment I can pursue two goals - either to make this feature
> > accepted somehow, or make a longest living CF item ever - neither of
> > those goals seems reachable.
> >
>
> I think so this feature is not important for existing applications. But it
> allows to work with JSON data (or any other) more comfortable (creative) in
> plpgsql.

Yes, hopefully.

Re: LOCK TABLE and DROP TABLE on temp tables of other sessions

2020-02-14 Thread Ashutosh Bapat

On Fri, Feb 14, 2020 at 11:35 AM Michael Paquier 
wrote:

> On Thu, Feb 13, 2020 at 09:05:01PM +0530, Ashutosh Bapat wrote:
> > That looks odd. Other sessions are able to see temporary tables of a
> given
> > session because they are stored in the same catalog which is accessible
> to
> > all the sessions. But ideally, a temporary table should be visible only
> to
> > the session which created it (GTT is an exception). So LOCK and DROP
> table
> > should not succeed.
>
> One thing that we need to consider is if there are applications which
> take advantage of LOCK allowed on temp relations from other backends
> or not.  One downside is that if one backend takes a lock on a temp
> table from a different session, then this other session would not
> completely shut down (still report the shutdown to the client),
> and would remain blocked during the temp schema cleanup until the
> transaction of the session locking the temp relation commits.  This
> blocks access to one connection slot, still we are talking about an
> operation where the owner of the temp schema wants to do the lock.
>

That might be disastrous if happens by accident eating up most of the
available connection slots.

Whatever the user wants to achieve using LOCK [temp] TABLE of other
session, I guess can be achieved by other means or can be shown to have
disastrous effect. So that kind of usage pattern would better be forced to
change.


>
> > Thinking from a different perspective, DROP TABLE being able to drop a
> > temporary table seems a good tool in case a temporary table is left
> behind
> > by a finished session. But that doesn't seem like a good reason to have
> it
> > and I don't see much use of LOCK TABLE there.
>
> Yep.  Robert had actually this argument with DROP SCHEMA pg_temp not
> so long ago with me.
>
>
DROP SCHEMA might be better for mass cleanup. That actually makes DROP
[other session temp] TABLE useless.


-- 
--
Best Wishes,
Ashutosh Bapat

Re: Index Skip Scan

2020-02-14 Thread Dmitry Dolgov

> On Fri, Feb 14, 2020 at 05:23:13PM +0900, Kyotaro Horiguchi wrote:
> The first attached (renamed to .txt not to confuse the cfbots) is a
> small patch that makes sure if _bt_readpage is called with the proper
> condition as written in its comment, that is, caller must have pinned
> and read-locked so->currPos.buf. This patch reveals many instances of
> breakage of the contract.

Thanks! On top of which patch version one can apply it? I'm asking
because I believe I've addressed similar issues in the last version, and
the last proposed diff (after resolving some conflicts) breaks tests for
me, so not sure if I miss something.

At the same time if you and Tomas strongly agree that it actually makes
sense to make moving forward/reading backward case work with dead tuples
correctly, I'll take a shot and try to teach the code around _bt_skip to
do what is required for that. I can merge your changes there and we can
see what would be the result.

Re: Parallel copy

2020-02-14 Thread Amit Kapila

On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro  wrote:
>
> On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila  wrote:
> > This work is to parallelize the copy command and in particular "Copy
> >  from 'filename' Where ;" command.
>
> Nice project, and a great stepping stone towards parallel DML.
>

Thanks.

> > The first idea is that we allocate each chunk to a worker and once the
> > worker has finished processing the current chunk, it can start with
> > the next unprocessed chunk.  Here, we need to see how to handle the
> > partial tuples at the end or beginning of each chunk.  We can read the
> > chunks in dsa/dsm instead of in local buffer for processing.
> > Alternatively, if we think that accessing shared memory can be costly
> > we can read the entire chunk in local memory, but copy the partial
> > tuple at the beginning of a chunk (if any) to a dsa.  We mainly need
> > partial tuple in the shared memory area. The worker which has found
> > the initial part of the partial tuple will be responsible to process
> > the entire tuple. Now, to detect whether there is a partial tuple at
> > the beginning of a chunk, we always start reading one byte, prior to
> > the start of the current chunk and if that byte is not a terminating
> > line byte, we know that it is a partial tuple.  Now, while processing
> > the chunk, we will ignore this first line and start after the first
> > terminating line.
>
> That's quiet similar to the approach I took with a parallel file_fdw
> patch[1], which mostly consisted of parallelising the reading part of
> copy.c, except that...
>
> > To connect the partial tuple in two consecutive chunks, we need to
> > have another data structure (for the ease of reference in this email,
> > I call it CTM (chunk-tuple-map)) in shared memory where we store some
> > per-chunk information like the chunk-number, dsa location of that
> > chunk and a variable which indicates whether we can free/reuse the
> > current entry.  Whenever we encounter the partial tuple at the
> > beginning of a chunk we note down its chunk number, and dsa location
> > in CTM.  Next, whenever we encounter any partial tuple at the end of
> > the chunk, we search CTM for next chunk-number and read from
> > corresponding dsa location till we encounter terminating line byte.
> > Once we have read and processed this partial tuple, we can mark the
> > entry as available for reuse.  There are some loose ends here like how
> > many entries shall we allocate in this data structure.  It depends on
> > whether we want to allow the worker to start reading the next chunk
> > before the partial tuple of the previous chunk is processed.  To keep
> > it simple, we can allow the worker to process the next chunk only when
> > the partial tuple in the previous chunk is processed.  This will allow
> > us to keep the entries equal to a number of workers in CTM.  I think
> > we can easily improve this if we want but I don't think it will matter
> > too much as in most cases by the time we processed the tuples in that
> > chunk, the partial tuple would have been consumed by the other worker.
>
> ... I didn't use a shm 'partial tuple' exchanging mechanism, I just
> had each worker follow the final tuple in its chunk into the next
> chunk, and have each worker ignore the first tuple in chunk after
> chunk 0 because it knows someone else is looking after that.  That
> means that there was some double reading going on near the boundaries,
>

Right and especially if the part in the second chunk is bigger, then
we might need to read most of the second chunk.

> and considering how much I've been complaining about bogus extra
> system calls on this mailing list lately, yeah, your idea of doing a
> bit more coordination is a better idea.  If you go this way, you might
> at least find the copy.c part of the patch I wrote useful as stand-in
> scaffolding code in the meantime while you prototype the parallel
> writing side, if you don't already have something better for this?
>

No, I haven't started writing anything yet, but I have some ideas on
how to achieve this.  I quickly skimmed through your patch and I think
that can be used as a starting point though if we decide to go with
accumulating the partial tuple or all the data in shm, then the things
might differ.

> > Another approach that came up during an offlist discussion with Robert
> > is that we have one dedicated worker for reading the chunks from file
> > and it copies the complete tuples of one chunk in the shared memory
> > and once that is done, a handover that chunks to another worker which
> > can process tuples in that area.  We can imagine that the reader
> > worker is responsible to form some sort of work queue that can be
> > processed by the other workers.  In this idea, we won't be able to get
> > the benefit of initial tokenization (forming tuple boundaries) via
> > parallel workers and might need some additional memory processing as
> > after reader worker has handed the initial shared

Re: [PATCH] libpq improvements and fixes

2020-02-14 Thread Ranier Vilela

Em sex., 14 de fev. de 2020 às 03:13, Michael Paquier 
escreveu:

> On Thu, Feb 13, 2020 at 02:22:36PM -0300, Ranier Vilela wrote:
> > I just kept it, even if I duplicated the error message, the style was
> kept
> > and in my opinion it is much more coherent and readable.
> > But your solution is also good, and yes, it is worth it, because even
> with
> > small benefits, the change improves the code and prevents Coverity or
> > another tool from continuing to report false positives or not.
>
> Complaints from static analyzers need to be taken with a pinch of
> salt, and I agree with Tom here.
>
That's right, I will try avoid sending patches that only satisfy static
analysis tools.


> > Virtually no code will break for the change, since bool and int are
> > internally the same types.
> > I believe that no code will have either adjusted to work with corrected
> > functions, even if they use compiled libraries.
> > And again, it is worth correcting at least the static ones, because the
> > goal here, too, is to improve readability.
>
> FWIW, looking at the patch from upthread, I think that it is not that
> wise to blindly break the error compatibility handling of all PQsend*
> routines by switching the error handling of the connection to be after
> the compatibility checks, and all the other changes don't justify a
> breakage making back-patching more complicated nor do they improve
> readability at great lengths.
>

It is difficult to understand what you consider to be improvement.

Another programming principle I follow is to remove anything static from
loops that can be executed outside the loop.
In this specific case, from the loop modified in fe-exec, two branches were
removed, is this an improvement for you or not?

See patch attached.

regards,
Ranier Vilela


fe-exec.patch
Description: Binary data

Re: assert pg_class.relnatts is consistent

2020-02-14 Thread John Naylor

> pronangs, since we need more information than what's in each pg_class

Sigh, and of course I met pg_proc.pronargs.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: assert pg_class.relnatts is consistent

2020-02-14 Thread John Naylor

I wrote:
> + elsif ($attname eq 'relnatts')
> + {
> + ;
> + }
>
> With your patch, I get this when running
> src/include/catalog/reformat_dat_file.pl:
>
> strip_default_values: pg_class.relnatts undefined
>
> Rather than adding this one-off case to AddDefaultValues and then
> another special case to strip_default_values, maybe it would be better
> to just add a placeholder BKI_DEFAULT(0) to pg_class.h, with a comment
> that it's just a placeholder.

One possible objection to what I wrote above is that it adds a
different kind of special case, but in a sneaky way. Perhaps it would
be more principled to treat it the same as oid after all. If we do
that, it would help to add a comment that we can't treat relnatts like
pronangs, since we need more information than what's in each pg_class
row.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-14 Thread Kuntal Ghosh

On Sun, Feb 9, 2020 at 9:18 AM Amit Kapila  wrote:
>
> It seems for this we formed a cache of max_cached_tuplebufs number of
> objects and we don't need to allocate more than that number of tuples
> of size MaxHeapTupleSize because we will anyway return that memory to
> aset.c.
>
In the approach suggested by Amit (approach 1), once we allocate the
max_cached_tuplebufs number of MaxHeapTupleSize, we can use the actual
length of the tuple for allocating memory. So, if we have m
subtransactions, the memory usage at worst case will be,

(max_cached_tuplebufs * MaxHeapTupleSize) cache +
(Maximum changes in a subtransaction before spilling) * m * (Actual tuple size)

= 64 MB cache + 4095 * m * (Actual tuple size)

In the approach suggested by Andres (approach 2), we're going to
reduce the size of a cached tuple to 1024 bytes. So, if we have m
sub-transactions, the memory usage at worst case will be,

(max_cached_tuplebufs * 1024 bytes) cache + (Maximum changes in a
subtransaction before spilling) * m * 1024 bytes

= 8 MB cache + 4095 * m * 1024 (considering the size of the tuple is
less than 1024 bytes)

Once the cache is filled, for 1000 sub-transactions operating on tuple
size, say 100 bytes, approach 1 will allocate 390 MB of memory
(approx.) whereas approach 2 will allocate 4GB of memory
approximately. If there is no obvious error that I'm missing, I think
we should implement the first approach.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-02-14 Thread Thomas Munro

On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila  wrote:
> This work is to parallelize the copy command and in particular "Copy
>  from 'filename' Where ;" command.

Nice project, and a great stepping stone towards parallel DML.

> The first idea is that we allocate each chunk to a worker and once the
> worker has finished processing the current chunk, it can start with
> the next unprocessed chunk.  Here, we need to see how to handle the
> partial tuples at the end or beginning of each chunk.  We can read the
> chunks in dsa/dsm instead of in local buffer for processing.
> Alternatively, if we think that accessing shared memory can be costly
> we can read the entire chunk in local memory, but copy the partial
> tuple at the beginning of a chunk (if any) to a dsa.  We mainly need
> partial tuple in the shared memory area. The worker which has found
> the initial part of the partial tuple will be responsible to process
> the entire tuple. Now, to detect whether there is a partial tuple at
> the beginning of a chunk, we always start reading one byte, prior to
> the start of the current chunk and if that byte is not a terminating
> line byte, we know that it is a partial tuple.  Now, while processing
> the chunk, we will ignore this first line and start after the first
> terminating line.

That's quiet similar to the approach I took with a parallel file_fdw
patch[1], which mostly consisted of parallelising the reading part of
copy.c, except that...

> To connect the partial tuple in two consecutive chunks, we need to
> have another data structure (for the ease of reference in this email,
> I call it CTM (chunk-tuple-map)) in shared memory where we store some
> per-chunk information like the chunk-number, dsa location of that
> chunk and a variable which indicates whether we can free/reuse the
> current entry.  Whenever we encounter the partial tuple at the
> beginning of a chunk we note down its chunk number, and dsa location
> in CTM.  Next, whenever we encounter any partial tuple at the end of
> the chunk, we search CTM for next chunk-number and read from
> corresponding dsa location till we encounter terminating line byte.
> Once we have read and processed this partial tuple, we can mark the
> entry as available for reuse.  There are some loose ends here like how
> many entries shall we allocate in this data structure.  It depends on
> whether we want to allow the worker to start reading the next chunk
> before the partial tuple of the previous chunk is processed.  To keep
> it simple, we can allow the worker to process the next chunk only when
> the partial tuple in the previous chunk is processed.  This will allow
> us to keep the entries equal to a number of workers in CTM.  I think
> we can easily improve this if we want but I don't think it will matter
> too much as in most cases by the time we processed the tuples in that
> chunk, the partial tuple would have been consumed by the other worker.

... I didn't use a shm 'partial tuple' exchanging mechanism, I just
had each worker follow the final tuple in its chunk into the next
chunk, and have each worker ignore the first tuple in chunk after
chunk 0 because it knows someone else is looking after that.  That
means that there was some double reading going on near the boundaries,
and considering how much I've been complaining about bogus extra
system calls on this mailing list lately, yeah, your idea of doing a
bit more coordination is a better idea.  If you go this way, you might
at least find the copy.c part of the patch I wrote useful as stand-in
scaffolding code in the meantime while you prototype the parallel
writing side, if you don't already have something better for this?

> Another approach that came up during an offlist discussion with Robert
> is that we have one dedicated worker for reading the chunks from file
> and it copies the complete tuples of one chunk in the shared memory
> and once that is done, a handover that chunks to another worker which
> can process tuples in that area.  We can imagine that the reader
> worker is responsible to form some sort of work queue that can be
> processed by the other workers.  In this idea, we won't be able to get
> the benefit of initial tokenization (forming tuple boundaries) via
> parallel workers and might need some additional memory processing as
> after reader worker has handed the initial shared memory segment, we
> need to somehow identify tuple boundaries and then process them.

Yeah, I have also wondered about something like this in a slightly
different context.  For parallel query in general, I wondered if there
should be a Parallel Scatter node, that can be put on top of any
parallel-safe plan, and it runs it in a worker process that just
pushes tuples into a single-producer multi-consumer shm queue, and
then other workers read from that whenever they need a tuple.  Hmm,
but for COPY, I suppose you'd want to push the raw lines with minimal
examination, not tuples, into a shm queue, so I guess th

Re: Extracting only the columns needed for a query

2020-02-14 Thread Pengzhou Tang

> > > On Sat, Jun 15, 2019 at 10:02 AM Tom Lane  wrote:
> > >
> > > Another reason for having the planner do this is that presumably, in
> > > an AM that's excited about this, the set of fetched columns should
> > > play into the cost estimates for the scan.  I've not been paying
> > > enough attention to the tableam work to know if we've got hooks for
> > > the AM to affect scan costing ... but if we don't, that seems like
> > > a hole that needs plugged.
> >
> > AM callback relation_estimate_size exists currently which planner
> leverages.
> > Via this callback it fetches tuples, pages, etc.. So, our thought is to
> extend
> > this API if possible to pass down needed column and help perform better
> costing
> > for the query. Though we think if wish to leverage this function, need
> to know
> > list of columns before planning hence might need to use query tree.
>
> I believe it would be beneficial to add this potential API extension patch
> into
> the thread (as an example of an interface defining how scanCols could be
> used)
> and review them together.
>
> Thanks for your suggestion, we paste one potential API extension change
bellow for zedstore to use scanCols.

The change contains 3 patches to clarify our idea.
0001-ANALYZE.patch is a generic patch for ANALYZE API extension, we develop
it to make the
analysis of zedstore tables more accurate. It is more flexible now, eg,
TableAm can provide
logical block number as random sample seed; TableAm can only analyze
specified columns; TableAm
can provide extra info besides the data tuple.

0002-Planner.patch is the real patch to show how we use rte->scanCols for a
cost estimate, the main idea
is adding a new metric 'stadiskfrac' to catalog pg_statistic, 'stadiskfrac'
is the physical size ratio of a column,
it is calculated when ANALYZE is performed, 0001-ANALYZE.patch can help to
provide extra disk size info.
So when set_plain_rel_size() is called by the planner, it uses
rte->scanCols and 'stadiskfrac' to adjust the
rel->pages, please see set_plain_rel_page_estimates().

0003-ZedStore.patch is an example of how zedstore uses extended ANALYZE
API, I paste it here anywhere, in case someone
is interest in it.

Thanks,
Pengzhou
From 77d257ab002a5e1b6a2f65e359cbfd7978e3cff5 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang 
Date: Wed, 20 Nov 2019 06:42:37 -0500
Subject: [PATCH 1/3] ANALYZE tableam API change

Extended three ANALYZE-related tableam APIs so AMs can take more control
of ANALYZE progress:
- scan_analyze_beginscan() : so AMs can has more flexible sampling strategy
- scan_analyze_sample_tuple() : so ANALYZE can get extra info as needed
- scan_analyze_endscan() :

Also use struct AnalyzeSampleContext to provide more convenience, with it
tableam analyze routines can provide extra info except the real data,
for example: physical size or compression ratio.
---
 contrib/file_fdw/file_fdw.c  |  35 +++---
 contrib/postgres_fdw/postgres_fdw.c  |  56 +
 src/backend/access/heap/heapam_handler.c | 109 ++--
 src/backend/access/table/tableam.c   | 209 +++
 src/backend/commands/analyze.c   | 181 --
 src/include/access/tableam.h | 138 +---
 src/include/foreign/fdwapi.h |   7 +-
 7 files changed, 530 insertions(+), 205 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 549821c..2344f01 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -19,6 +19,7 @@
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_foreign_table.h"
 #include "commands/copy.h"
@@ -157,10 +158,8 @@ static void estimate_size(PlannerInfo *root, RelOptInfo *baserel,
 static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
 		   FileFdwPlanState *fdw_private,
 		   Cost *startup_cost, Cost *total_cost);
-static int	file_acquire_sample_rows(Relation onerel, int elevel,
-	 HeapTuple *rows, int targrows,
-	 double *totalrows, double *totaldeadrows);
-
+static void file_acquire_sample_rows(Relation onerel, int elevel,
+	 AnalyzeSampleContext *context);
 
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
@@ -1091,14 +1090,16 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
  * may be meaningless, but it's OK because we don't use the estimates
  * currently (the planner only pays attention to correlation for indexscans).
  */
-static int
+static void
 file_acquire_sample_rows(Relation onerel, int elevel,
-		 HeapTuple *rows, int targrows,
-		 double *totalrows, double *totaldeadrows)
+		 AnalyzeSampleContext *context)
 {
 	int			numrows = 0;
+	int			targrows = 0;
+	double		totalrows = 0;
 	double		rowstoskip = -1;	/* -1 means not set yet */
 	ReservoirStateData rstate;
+	HeapTuple	tuple;
 	TupleDesc	tupDe

Re: assert pg_class.relnatts is consistent

2020-02-14 Thread John Naylor

On Fri, Feb 14, 2020 at 5:00 PM Amit Langote  wrote:
> I tried and think it works but not sure if that's good Perl
> programming.  See the attached.

Hi Amit,
I took this for a spin -- I just have a couple comments.

+ elsif ($attname eq 'relnatts')
+ {
+ ;
+ }

With your patch, I get this when running
src/include/catalog/reformat_dat_file.pl:

strip_default_values: pg_class.relnatts undefined

Rather than adding this one-off case to AddDefaultValues and then
another special case to strip_default_values, maybe it would be better
to just add a placeholder BKI_DEFAULT(0) to pg_class.h, with a comment
that it's just a placeholder.

+ if ($catname eq "pg_class" && $attname eq "relnatts")
+ {
+ $bki_values{$attname} = $catalog_ncols{$bki_values{relname}};
+ }
+

You could avoid the name/attr checks if you do it while building the
pg_class lookup table, like this:

 foreach my $row (@{ $catalog_data{pg_class} })
 {
  $classoids{ $row->{relname} } = $row->{oid};
+
+ # Also fill in correct value for relnatts.
+ $row->{relnatts} = $catalog_ncols{ $row->{relname} };
 }

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: assert pg_class.relnatts is consistent

2020-02-14 Thread Michael Paquier

On Fri, Feb 14, 2020 at 06:00:05PM +0900, Amit Langote wrote:
> On Fri, Feb 14, 2020 at 2:58 PM Amit Langote  wrote:
> > On Fri, Feb 14, 2020 at 1:04 AM Tom Lane  wrote:
> > > I've been burnt by this too :-(.  However, I think this patch is
> > > completely the wrong way to go about improving this.  What we should
> > > be doing, now that we have all that perl code generating postgres.bki,
> > > is eliminating the problem at the source.  That is, drop the hand-coded
> > > relnatts values from pg_class.dat altogether, and let the perl code fill
> > > it in --- compare the handling of pg_proc.pronargs for instance.
> >
> > I can't write Perl myself (maybe Justin), but +1 to this idea.
> 
> I tried and think it works but not sure if that's good Perl
> programming.  See the attached.

I quite like what you have here.  Please note that this comment in
genbki.pl is incorrect regarding relnatts (the last part could just be
deleted):
# Note: only bootstrap catalogs, ie those marked BKI_BOOTSTRAP, need to
# have entries here.  Be sure that the OIDs listed here match those given in
# their CATALOG and BKI_ROWTYPE_OID macros, and that the relnatts values are
# correct.
--
Michael


signature.asc
Description: PGP signature

Re: [Proposal] Global temporary tables

2020-02-14 Thread Pavel Stehule

čt 30. 1. 2020 v 15:21 odesílatel Pavel Stehule 
napsal:

>
>
> čt 30. 1. 2020 v 15:17 odesílatel 曾文旌(义从) 
> napsal:
>
>>
>>
>> > 2020年1月29日 下午9:48，Robert Haas  写道：
>> >
>> > On Tue, Jan 28, 2020 at 12:12 PM 曾文旌(义从) 
>> wrote:
>> >>> Opinion by Pavel
>> >>> + rel->rd_islocaltemp = true;  <<< if this is valid, then the
>> name of field "rd_islocaltemp" is not probably best
>> >>> I renamed rd_islocaltemp
>> >>
>> >> I don't see any change?
>> >>
>> >> Rename rd_islocaltemp to rd_istemp  in
>> global_temporary_table_v8-pg13.patch
>> >
>> > In view of commit 6919b7e3294702adc39effd16634b2715d04f012, I think
>> > that this has approximately a 0% chance of being acceptable. If you're
>> > setting a field in a way that is inconsistent with the current use of
>> > the field, you're probably doing it wrong, because the field has an
>> > existing purpose to which new code must conform. And if you're not
>> > doing that, then you don't need to rename it.
>> Thank you for pointing it out.
>> I've rolled back the rename.
>> But I still need rd_localtemp to be true, The reason is that
>> 1 GTT The GTT needs to support DML in read-only transactions ,like local
>> temp table.
>> 2 GTT does not need to hold the lock before modifying the index buffer
>> ,also like local temp table.
>>
>> Please give me feedback.
>>
>
> maybe some like
>
> rel->rd_globaltemp = true;
>
> and somewhere else
>
> if (rel->rd_localtemp || rel->rd_globaltemp)
> {
>   ...
> }
>
>
I tested this patch again and I am very well satisfied with behave.

what doesn't work still - TRUNCATE statement

postgres=# insert into foo select generate_series(1,1);
INSERT 0 1
postgres=# \dt+ foo
  List of relations
┌┬──┬───┬───┬─┬┬─┐
│ Schema │ Name │ Type  │ Owner │ Persistence │  Size  │ Description │
╞╪══╪═══╪═══╪═╪╪═╡
│ public │ foo  │ table │ pavel │ session │ 384 kB │ │
└┴──┴───┴───┴─┴┴─┘
(1 row)

postgres=# truncate foo;
TRUNCATE TABLE
postgres=# \dt+ foo
  List of relations
┌┬──┬───┬───┬─┬───┬─┐
│ Schema │ Name │ Type  │ Owner │ Persistence │ Size  │ Description │
╞╪══╪═══╪═══╪═╪═══╪═╡
│ public │ foo  │ table │ pavel │ session │ 16 kB │ │
└┴──┴───┴───┴─┴───┴─┘
(1 row)

I expect zero size after truncate.

Regards

Pavel


>>
>> Wenjing
>>
>>
>>
>>
>> >
>> > --
>> > Robert Haas
>> > EnterpriseDB: http://www.enterprisedb.com
>> > The Enterprise PostgreSQL Company
>>
>>

Re: Fix compiler warnings on 64-bit Windows

2020-02-14 Thread Peter Eisentraut


On 2020-02-13 16:19, Tom Lane wrote:

According to C99 and POSIX, intptr_t should be provided by  ...
now that we're requiring C99, can we get away with just #include'ing
that directly in these test files?


I think in the past we were worried about the C library not being fully 
C99.  But the build farm indicates that even the trailing edge OS X and 
HP-UX members have it, so I'm content to require it.  Then we should 
probably remove the Autoconf tests altogether.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: assert pg_class.relnatts is consistent

2020-02-14 Thread Amit Langote

On Fri, Feb 14, 2020 at 2:58 PM Amit Langote  wrote:
> On Fri, Feb 14, 2020 at 1:04 AM Tom Lane  wrote:
> > I've been burnt by this too :-(.  However, I think this patch is
> > completely the wrong way to go about improving this.  What we should
> > be doing, now that we have all that perl code generating postgres.bki,
> > is eliminating the problem at the source.  That is, drop the hand-coded
> > relnatts values from pg_class.dat altogether, and let the perl code fill
> > it in --- compare the handling of pg_proc.pronargs for instance.
>
> I can't write Perl myself (maybe Justin), but +1 to this idea.

I tried and think it works but not sure if that's good Perl
programming.  See the attached.

Thanks,
Amit
From ec690b6e78176354ae033073cda1b0770e956a72 Mon Sep 17 00:00:00 2001
From: Amit Langote 
Date: Fri, 14 Feb 2020 16:04:48 +0900
Subject: [PATCH] Don't require relnatts to be specified in pg_class.dat

---
 src/backend/catalog/Catalog.pm   |  4 
 src/backend/catalog/genbki.pl|  7 +++
 src/include/catalog/pg_class.dat | 11 +++
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/Catalog.pm b/src/backend/catalog/Catalog.pm
index c089b1d71d..8e84bf9d2d 100644
--- a/src/backend/catalog/Catalog.pm
+++ b/src/backend/catalog/Catalog.pm
@@ -341,6 +341,10 @@ sub AddDefaultValues
{
;
}
+   elsif ($attname eq 'relnatts')
+   {
+   ;
+   }
elsif (defined $column->{default})
{
$row->{$attname} = $column->{default};
diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index 803251207b..eac4542ade 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -56,6 +56,7 @@ my %catalog_data;
 my @toast_decls;
 my @index_decls;
 my %oidcounts;
+my %catalog_ncols;
 
 foreach my $header (@ARGV)
 {
@@ -71,6 +72,7 @@ foreach my $header (@ARGV)
{
push @catnames, $catname;
$catalogs{$catname} = $catalog;
+   $catalog_ncols{$catname} = scalar(@$schema);
}
 
# While checking for duplicated OIDs, we ignore the pg_class OID and
@@ -524,6 +526,11 @@ EOM
my $attname = $column->{name};
my $atttype = $column->{type};
 
+   if ($catname eq "pg_class" && $attname eq "relnatts")
+   {
+   $bki_values{$attname} = 
$catalog_ncols{$bki_values{relname}};
+   }
+
# Assign oid if oid column exists and no explicit 
assignment in row
if ($attname eq "oid" and not defined 
$bki_values{$attname})
{
diff --git a/src/include/catalog/pg_class.dat b/src/include/catalog/pg_class.dat
index f70d5bacb9..d901988858 100644
--- a/src/include/catalog/pg_class.dat
+++ b/src/include/catalog/pg_class.dat
@@ -20,11 +20,14 @@
 # Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
 # similarly, "1" in relminmxid stands for FirstMultiXactId
 
+# Note: relnatts is computed automatically by genbki.pl, so need not be
+# specified here.
+
 { oid => '1247',
   relname => 'pg_type', reltype => 'pg_type', relam => 'heap',
   relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0',
   reltoastrelid => '0', relhasindex => 'f', relisshared => 'f',
-  relpersistence => 'p', relkind => 'r', relnatts => '31', relchecks => '0',
+  relpersistence => 'p', relkind => 'r', relchecks => '0',
   relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f',
   relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't',
   relreplident => 'n', relispartition => 'f', relfrozenxid => '3',
@@ -34,7 +37,7 @@
   relname => 'pg_attribute', reltype => 'pg_attribute', relam => 'heap',
   relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0',
   reltoastrelid => '0', relhasindex => 'f', relisshared => 'f',
-  relpersistence => 'p', relkind => 'r', relnatts => '25', relchecks => '0',
+  relpersistence => 'p', relkind => 'r', relchecks => '0',
   relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f',
   relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't',
   relreplident => 'n', relispartition => 'f', relfrozenxid => '3',
@@ -44,7 +47,7 @@
   relname => 'pg_proc', reltype => 'pg_proc', relam => 'heap',
   relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0',
   reltoastrelid => '0', relhasindex => 'f', relisshared => 'f',
-  relpersistence => 'p', relkind => 'r', relnatts => '29', relchecks => '0',
+  relpersistence => 'p', relkind => 'r', relchecks => '0',
   relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f',
   relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't',
   relreplident => 'n', relispartit

Re: Index Skip Scan

2020-02-14 Thread Kyotaro Horiguchi

Thank you very much for the benchmarking!

A bit different topic from  the latest branch..

At Sat, 8 Feb 2020 14:11:59 +0100, Tomas Vondra  
wrote in 
> >Yes, I've mentioned that already in one of the previous emails :) The
> >simplest way I see to achieve what we want is to do something like in
> >attached modified version with a new hasDeclaredCursor field. It's not
> >a
> >final version though, but posted just for discussion, so feel free to
> >suggest any improvements or alternatives.
> 
> IMO the proper fix for this case (moving forward, reading backwards)
> is
> simply making it work by properly checking deleted tuples etc. Not
> sure
> why that would be so much complex (haven't tried implementing it)?

I don't think it's not so complex. But I suspect that it might be a
bit harder starting from the current shpae.

The first attached (renamed to .txt not to confuse the cfbots) is a
small patch that makes sure if _bt_readpage is called with the proper
condition as written in its comment, that is, caller must have pinned
and read-locked so->currPos.buf. This patch reveals many instances of
breakage of the contract.

The second is a crude fix the breakages, but the result seems far from
neat.. I think we need rethinking taking modification of support
functions into consideration.

> I think making this depend on things like declared cursor etc. is
> going
> to be tricky, may easily be more complex than checking deleted tuples,
> and the behavior may be quite surprising.

Sure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From de129e5a261ed43f002c1684dc9d6575f3880b16 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 6 Feb 2020 14:31:36 +0900
Subject: [PATCH 1/2] debug aid

---
 src/backend/access/nbtree/nbtsearch.c |  1 +
 src/backend/storage/buffer/bufmgr.c   | 13 +
 src/include/storage/bufmgr.h  |  1 +
 3 files changed, 15 insertions(+)

diff --git a/src/backend/access/nbtree/nbtsearch.c 
b/src/backend/access/nbtree/nbtsearch.c
index c5f5d228f2..5cd97d8bb5 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1785,6 +1785,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, 
OffsetNumber offnum)
 * used here; this function is what makes it good for currPos.
 */
Assert(BufferIsValid(so->currPos.buf));
+   Assert(BufferLockAndPinHeldByMe(so->currPos.buf));
 
page = BufferGetPage(so->currPos.buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index aba3960481..08a75a6846 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1553,6 +1553,19 @@ ReleaseAndReadBuffer(Buffer buffer,
return ReadBuffer(relation, blockNum);
 }
 
+/* tmp function for debugging */
+bool
+BufferLockAndPinHeldByMe(Buffer buffer)
+{
+   BufferDesc  *b = GetBufferDescriptor(buffer - 1);
+
+   if (BufferIsPinned(buffer) &&
+   LWLockHeldByMe(BufferDescriptorGetContentLock(b)))
+   return true;
+
+   return false;
+}
+
 /*
  * PinBuffer -- make buffer unavailable for replacement.
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..8e5fc639a0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -177,6 +177,7 @@ extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
   BlockNumber 
blockNum);
+extern bool BufferLockAndPinHeldByMe(Buffer buffer);
 
 extern void InitBufferPool(void);
 extern void InitBufferPoolAccess(void);
-- 
2.18.2

>From 912bad2ec8c66ccd01cebf1f69233b004c633243 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 6 Feb 2020 19:09:09 +0900
Subject: [PATCH 2/2] crude fix

---
 src/backend/access/nbtree/nbtsearch.c | 43 +--
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c 
b/src/backend/access/nbtree/nbtsearch.c
index 5cd97d8bb5..1f18b38ca5 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1619,6 +1619,9 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir,
 
nextOffset = startOffset = 
ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
 
+   if (nextOffset != startOffset)
+   LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+
while (nextOffset == startOffset)
{
IndexTuple itup;
@@ -1653,7 +1656,7 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir,
offnum = OffsetNumberPrev(offnum);
 
/* Check if _bt_readpage returns already fou

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-14 Thread Amit Kapila

On Fri, Feb 14, 2020 at 11:42 AM Masahiko Sawada
 wrote:
>
> On Thu, 13 Feb 2020 at 13:16, Amit Kapila  wrote:
> >
> > On Tue, Feb 11, 2020 at 9:13 PM Tom Lane  wrote:
> > >
> > > I took a brief look through this patch.  I agree with the fundamental
> > > idea that we shouldn't need to use the heavyweight lock manager for
> > > relation extension, since deadlock is not a concern and no backend
> > > should ever need to hold more than one such lock at once.  But it feels
> > > to me like this particular solution is rather seriously overengineered.
> > > I would like to suggest that we do something similar to Robert Haas'
> > > excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
> > > that is,
> > >
> > > * Create some predetermined number N of LWLocks for relation extension.
> > > * When we want to extend some relation R, choose one of those locks
> > >   (say, R's relfilenode number mod N) and lock it.
> > >
> >
> > I am imagining something on the lines of BufferIOLWLockArray (here it
> > will be RelExtLWLockArray).  The size (N) could MaxBackends or some
> > percentage of it (depending on testing) and indexing into an array
> > could be as suggested (R's relfilenode number mod N).
>
> I'm not sure it's good that the contention of LWLock slot depends on
> MaxBackends. Because it means that the more MaxBackends is larger, the
> less the LWLock slot conflicts, even if the same number of backends
> actually connecting. Normally we don't want to increase unnecessarily
> MaxBackends for security reasons. In the current patch we defined a
> fixed length of array for extension lock but I agree that we need to
> determine what approach is the best depending on testing.
>

I think MaxBackends will generally limit the number of different
relations that can simultaneously extend, but maybe tables with many
partitions might change the situation.  You are right that some tests
might suggest a good number, let Mahendra write a patch and then we
can test it.  Do you have any better idea?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Parallel copy

2020-02-14 Thread Amit Kapila

This work is to parallelize the copy command and in particular "Copy
 from 'filename' Where ;" command.

Before going into how and what portion of 'copy command' processing we
can parallelize, let us see in brief what are the top-level operations
we perform while copying from the file into a table.  We read the file
in 64KB chunks, then find the line endings and process that data line
by line, where each line corresponds to one tuple.  We first form the
tuple (in form of value/null array) from that line, check if it
qualifies the where condition and if it qualifies, then perform
constraint check and few other checks and then finally store it in
local tuple array.  Once we reach 1000 tuples or consumed 64KB
(whichever occurred first), we insert them together via
table_multi_insert API and then for each tuple insert into the
index(es) and execute after row triggers.

So if we see here we do a lot of work after reading each 64K chunk.
We can read the next chunk only after all the tuples are processed in
the previous chunk we read.  This brings us an opportunity to
parallelize each 64K chunk processing.  I think we can do this in more
than one way.

The first idea is that we allocate each chunk to a worker and once the
worker has finished processing the current chunk, it can start with
the next unprocessed chunk.  Here, we need to see how to handle the
partial tuples at the end or beginning of each chunk.  We can read the
chunks in dsa/dsm instead of in local buffer for processing.
Alternatively, if we think that accessing shared memory can be costly
we can read the entire chunk in local memory, but copy the partial
tuple at the beginning of a chunk (if any) to a dsa.  We mainly need
partial tuple in the shared memory area. The worker which has found
the initial part of the partial tuple will be responsible to process
the entire tuple. Now, to detect whether there is a partial tuple at
the beginning of a chunk, we always start reading one byte, prior to
the start of the current chunk and if that byte is not a terminating
line byte, we know that it is a partial tuple.  Now, while processing
the chunk, we will ignore this first line and start after the first
terminating line.

To connect the partial tuple in two consecutive chunks, we need to
have another data structure (for the ease of reference in this email,
I call it CTM (chunk-tuple-map)) in shared memory where we store some
per-chunk information like the chunk-number, dsa location of that
chunk and a variable which indicates whether we can free/reuse the
current entry.  Whenever we encounter the partial tuple at the
beginning of a chunk we note down its chunk number, and dsa location
in CTM.  Next, whenever we encounter any partial tuple at the end of
the chunk, we search CTM for next chunk-number and read from
corresponding dsa location till we encounter terminating line byte.
Once we have read and processed this partial tuple, we can mark the
entry as available for reuse.  There are some loose ends here like how
many entries shall we allocate in this data structure.  It depends on
whether we want to allow the worker to start reading the next chunk
before the partial tuple of the previous chunk is processed.  To keep
it simple, we can allow the worker to process the next chunk only when
the partial tuple in the previous chunk is processed.  This will allow
us to keep the entries equal to a number of workers in CTM.  I think
we can easily improve this if we want but I don't think it will matter
too much as in most cases by the time we processed the tuples in that
chunk, the partial tuple would have been consumed by the other worker.

Another approach that came up during an offlist discussion with Robert
is that we have one dedicated worker for reading the chunks from file
and it copies the complete tuples of one chunk in the shared memory
and once that is done, a handover that chunks to another worker which
can process tuples in that area.  We can imagine that the reader
worker is responsible to form some sort of work queue that can be
processed by the other workers.  In this idea, we won't be able to get
the benefit of initial tokenization (forming tuple boundaries) via
parallel workers and might need some additional memory processing as
after reader worker has handed the initial shared memory segment, we
need to somehow identify tuple boundaries and then process them.

Another thing we need to figure out is the how many workers to use for
the copy command.  I think we can use it based on the file size which
needs some experiments or may be based on user input.

I think we have two related problems to solve for this (a) relation
extension lock (required for extending the relation) which won't
conflict among workers due to group locking, we are working on a
solution for this in another thread [1], (b) Use of Page locks in Gin
indexes, we can probably disallow parallelism if the table has Gin
index which is not a great thing but not bad either.

To be clear

83 matches

Mail list logo