Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.
Thank you both, Vik, and David, for bing so quick to respond. All is clear now. It seems to me that the price (giving up the ability to say explicitly what primitive JSON values you want) is too great to pay for the benefit (being able to build the semantic equivalent of a variadic list of actual arguments as text. So I wrote my own wrapper for jsonb_build_array() and jsonb_build_object(): create function my_jsonb_build( kind in varchar, variadic_elements in varchar) returns jsonb immutable language plpgsql as $body$ declare stmt varchar := case kind when 'array' then 'select jsonb_build_array('||variadic_elements||')' when 'object' then 'select jsonb_build_object('||variadic_elements||')' end; j jsonb; begin execute stmt into j; return j; end; $body$; create type t1 as(a int, b varchar); ——— — Test it. select jsonb_pretty(my_jsonb_build( 'array', $$ 17::integer, 'dog'::varchar, true::boolean $$)); select jsonb_pretty(my_jsonb_build( 'array', $$ 17::integer, 'dog'::varchar, true::boolean, (17::int, 'dog'::varchar)::t1 $$)); select jsonb_pretty(my_jsonb_build( 'object', $$ 'a'::varchar, 17::integer, 'b'::varchar, 'dog'::varchar, 'c'::varchar, true::boolean $$)); It produces the result that I want. And I’m prepared to pay the price of using $$ to avoid doubling up interior single quotes.. On 14-Feb-2020, at 19:24, David G. Johnston wrote: On Friday, February 14, 2020, Bryn Llewellyn mailto:b...@yugabyte.com>> wrote: select jsonb_pretty(jsonb_object( '{a, 17, b, "dog", c, true}'::varchar[] )) In other words, do the double quotes around "dog" have no effect? That would be a bad thing—and it would limit the usefulness of the jsonb_object() function. The double quotes serve a specific purpose, to allow values containing commas to be treated as a single value (see syntax details for the exact rules) in the resulting array of text values. The fact you don’t have to quote the other strings is a convenience behavior of the feature. David J.
Re: Parallel copy
On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner wrote: > > On Fri, 14 Feb 2020 at 11:57, Amit Kapila wrote: >> >> On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro wrote: >> > >> > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila >> > wrote: > > ... >> >> > > Another approach that came up during an offlist discussion with Robert >> > > is that we have one dedicated worker for reading the chunks from file >> > > and it copies the complete tuples of one chunk in the shared memory >> > > and once that is done, a handover that chunks to another worker which >> > > can process tuples in that area. We can imagine that the reader >> > > worker is responsible to form some sort of work queue that can be >> > > processed by the other workers. In this idea, we won't be able to get >> > > the benefit of initial tokenization (forming tuple boundaries) via >> > > parallel workers and might need some additional memory processing as >> > > after reader worker has handed the initial shared memory segment, we >> > > need to somehow identify tuple boundaries and then process them. > > > Parsing rows from the raw input (the work done by CopyReadLine()) in a single > process would accommodate line returns in quoted fields. I don't think > there's a way of getting parallel workers to manage the in-quote/out-of-quote > state required. > AFAIU, the whole of this in-quote/out-of-quote state is manged inside CopyReadLineText which will be done by each of the parallel workers, something on the lines of what Thomas did in his patch [1]. Basically, we need to invent a mechanism to allocate chunks to individual workers and then the whole processing will be done as we are doing now except for special handling for partial tuples which I have explained in my previous email. Am, I missing something here? >> >> > > > ... >> >> >> > > Another thing we need to figure out is the how many workers to use for >> > > the copy command. I think we can use it based on the file size which >> > > needs some experiments or may be based on user input. >> > >> > It seems like we don't even really have a general model for that sort >> > of thing in the rest of the system yet, and I guess some kind of >> > fairly dumb explicit system would make sense in the early days... >> > >> >> makes sense. > > The ratio between chunking or line parsing processes and the parallel worker > pool would vary with the width of the table, complexity of the data or file > (dates, encoding conversions), complexity of constraints and acceptable > impact of the load. Being able to control it through user input would be > great. > Okay, I think one simple way could be that we compute the number of workers based on filesize (some experiments are required to determine this) unless the user has given the input. If the user has provided the input then we can use that with an upper limit to max_parallel_workers. [1] - https://www.postgresql.org/message-id/CA%2BhUKGKZu8fpZo0W%3DPOmQEN46kXhLedzqqAnt5iJZy7tD0x6sw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.
On Friday, February 14, 2020, Bryn Llewellyn wrote: > > > select jsonb_pretty(jsonb_object( > '{a, 17, b, "dog", c, true}'::varchar[] > )) > > In other words, do the double quotes around "dog" have no effect? That > would be a bad thing—and it would limit the usefulness of the > jsonb_object() function. > The double quotes serve a specific purpose, to allow values containing commas to be treated as a single value (see syntax details for the exact rules) in the resulting array of text values. The fact you don’t have to quote the other strings is a convenience behavior of the feature. David J.
Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.
On Friday, February 14, 2020, Bryn Llewellyn wrote: > > The doc (“Builds a JSON object out of a text array.”) is simply too terse > to inform an answer to this question. > It does presume knowledge but it precisely defines the outcome: PostgreSQL arrays are typed and all members are of the same type. A text array’s members are all text. Given the above knowledge the fact that the resultant json object contains exclusively text keys and text values directly follows. David J.
Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.
On 15/02/2020 04:07, Bryn Llewellyn wrote: > This: > > select jsonb_pretty(jsonb_build_object( > 'a'::varchar, 1.7::numeric, > 'b'::varchar, 'dog'::varchar, > 'c'::varchar, true::boolean > )) > > allows me to express what I want. That’s a good thing. Are you saying that > this: > > select jsonb_pretty(jsonb_object( > '{a, 17, b, "dog", c, true}'::varchar[] > )) > > simply lacks that power of expression and that every item in the array is > assumed to be intended to end up as a JSON text primitive value? In other > words, do the double quotes around "dog" have no effect? That is correct. > That would be a bad thing—and it would limit the usefulness of the > jsonb_object() function. Use the long form if you need to mix datatypes. -- Vik Fearing
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Fri, Feb 14, 2020 at 9:13 PM Tom Lane wrote: > > Robert Haas writes: > > On Wed, Feb 12, 2020 at 11:53 AM Tom Lane wrote: > > > That's an interesting idea, but it doesn't make the lock acquisition > > itself interruptible, which seems pretty important to me in this case. > > Good point: if you think the contained operation might run too long to > suit you, then you don't want other backends to be stuck behind it for > the same amount of time. > It is not clear to me why we should add that as a requirement for this patch when other places like WALWriteLock, etc. have similar coding patterns and we haven't heard a ton of complaints about making it interruptable or if there are then I am not aware. > > I wonder if we could have an LWLockAcquireInterruptibly() or some such > > that allows the lock acquisition itself to be interruptible. I think > > that would require some rejiggering but it might be doable. > > Yeah, I had the impression from a brief look at LWLockAcquire that > it was itself depending on not throwing errors partway through. > But with careful and perhaps-a-shade-slower coding, we could probably > make a version that didn't require that. > If this becomes a requirement to move this patch, then surely we can do that. BTW, what exactly we need to ensure for it? Is it something on the lines of ensuring that in error path the state of the lock is cleared? Are we worried that interrupt handler might do something which will change the state of lock we are acquiring? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.
This: select jsonb_pretty(jsonb_build_object( 'a'::varchar, 1.7::numeric, 'b'::varchar, 'dog'::varchar, 'c'::varchar, true::boolean )) allows me to express what I want. That’s a good thing. Are you saying that this: select jsonb_pretty(jsonb_object( '{a, 17, b, "dog", c, true}'::varchar[] )) simply lacks that power of expression and that every item in the array is assumed to be intended to end up as a JSON text primitive value? In other words, do the double quotes around "dog" have no effect? That would be a bad thing—and it would limit the usefulness of the jsonb_object() function. The doc (“Builds a JSON object out of a text array.”) is simply too terse to inform an answer to this question. On 14-Feb-2020, at 18:28, Vik Fearing wrote: On 15/02/2020 03:21, Bryn Llewellyn wrote: > Now execute this supposed functional equivalent: > > select jsonb_pretty(jsonb_object( > '{a, 17, b, "dog", c, true}'::varchar[] > )) > > It is meant to be a nice alternative when you want to build an object (rather > than an array) because the syntax is less verbose. > > However, it gets the wrong answer, thus: > > { + > "a": "17", + > "b": "dog",+ > "c": "true"+ > } > > Now, the numeric value and the boolean value are double-quoted—in other > words, they have been implicitly converted to JSON primitive text values. They haven't been implicitly converted, you gave an array of varchars. How should it know that you don't want texts? > Do you agree that this is a bug? No. -- Vik Fearing
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Fri, Feb 14, 2020 at 8:12 PM Tom Lane wrote: > > Amit Kapila writes: > > I think MaxBackends will generally limit the number of different > > relations that can simultaneously extend, but maybe tables with many > > partitions might change the situation. You are right that some tests > > might suggest a good number, let Mahendra write a patch and then we > > can test it. Do you have any better idea? > > In the first place, there certainly isn't more than one extension > happening at a time per backend, else the entire premise of this > thread is wrong. Handwaving about partitions won't change that. > Having more number of partitions theoretically increases the chances of false-sharing with the same number of concurrent sessions. For ex. two sessions operating on two relations vs. two sessions working on two relations with 100 partitions each would increase the chances of false-sharing. Sawada-San and Mahendra have done many tests on different systems and some monitoring with the previous patch that with a decent number of fixed slots (1024), the false-sharing was very less and even if it was there the effect was close to nothing. So, in short, this is not the point to worry about, but to ensure that we don't create any significant regressions in this area. > In the second place, it's ludicrous to expect that the underlying > platform/filesystem can support an infinite number of concurrent > file-extension operations. At some level (e.g. where disk blocks > are handed out, or where a record of the operation is written to > a filesystem journal) it's quite likely that things are bottlenecked > down to *one* such operation at a time per filesystem. So I'm not > that concerned about occasional false-sharing limiting our ability > to issue concurrent requests. There are probably worse restrictions > at lower levels. > Agreed and what we have observed during the tests is what you have said in this paragraph. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.
On 15/02/2020 03:21, Bryn Llewellyn wrote: > Now execute this supposed functional equivalent: > > select jsonb_pretty(jsonb_object( > '{a, 17, b, "dog", c, true}'::varchar[] > )) > > It is meant to be a nice alternative when you want to build an object (rather > than an array) because the syntax is less verbose. > > However, it gets the wrong answer, thus: > > { + > "a": "17", + > "b": "dog",+ > "c": "true"+ > } > > Now, the numeric value and the boolean value are double-quoted—in other > words, they have been implicitly converted to JSON primitive text values. They haven't been implicitly converted, you gave an array of varchars. How should it know that you don't want texts? > Do you agree that this is a bug? No. -- Vik Fearing
jsonb_object() seems to be buggy. jsonb_build_object() is good.
Execute this: select jsonb_pretty(jsonb_build_object( 'a'::varchar, 1.7::numeric, 'b'::varchar, 'dog'::varchar, 'c'::varchar, true::boolean )) It produces the result that I expect: { + "a": 1.7, + "b": "dog",+ "c": true + } Notice that the numeric, text, and boolean primitive values are properly rendered with the text value double-quoted and the numeric and boolean values unquoted. Now execute this supposed functional equivalent: select jsonb_pretty(jsonb_object( '{a, 17, b, "dog", c, true}'::varchar[] )) It is meant to be a nice alternative when you want to build an object (rather than an array) because the syntax is less verbose. However, it gets the wrong answer, thus: { + "a": "17", + "b": "dog",+ "c": "true"+ } Now, the numeric value and the boolean value are double-quoted—in other words, they have been implicitly converted to JSON primitive text values. Do you agree that this is a bug? Notice that I see this behavior in vanilla PostgreSQL 11.2 and in YugabyteDB Version 2.0.11.0. See this blogpost: “Distributed PostgreSQL on a Google Spanner Architecture—Query Layer” https://blog.yugabyte.com/distributed-postgresql-on-a-google-spanner-architecture-query-layer/ YugabyteDB uses the PostgreSQL source code for its SQL upper half. Regards, Bryn Llewellyn, Yugabyte
Re: Use LN_S instead of "ln -s" in Makefile
Ashwin Agrawal writes: > On Fri, Feb 14, 2020 at 4:57 PM Tom Lane wrote: >> Hm. So, these oversights are certainly bugs in narrow terms. However, >> they've all been like that for a *long* time --- the three instances >> you found date to 2005, 2006, and 2008 according to "git blame". >> The complete lack of complaints shows that nobody cares. So I think >> a fairly strong case could be made for going the other way, ie >> s/$(LN_S)/ln -s/g and get rid of the configure-time cycles wasted in >> setting up that variable. > I accede to it. Oh ... 2005 was just the last time anybody touched that particular line in backend/Makefile. Further digging shows that we've been installing the postmaster -> postgres symlink with raw "ln -s" clear back to the Postgres95 virgin sources. I didn't bother to chase down the oldest instances of the other two cases. (Man, "git blame" is such a great tool for software archaeology.) regards, tom lane
Re: Use LN_S instead of "ln -s" in Makefile
On Fri, Feb 14, 2020 at 4:57 PM Tom Lane wrote: > Ashwin Agrawal writes: > > In general, the variable LN_S is 'ln -s', however, LN_S changes to 'cp > > -pR' if configure finds the file system does not support symbolic > > links. > > I was playing with 'ln' for some other reason where I symbolic linked > > it to '/bin/false'. That revealed the failure for > > src/backend/Makefile. Greping for 'ln -s' revealed three places it's > > used. Attaching the patch to fix the same. > > Hm. So, these oversights are certainly bugs in narrow terms. However, > they've all been like that for a *long* time --- the three instances > you found date to 2005, 2006, and 2008 according to "git blame". > The complete lack of complaints shows that nobody cares. So I think > a fairly strong case could be made for going the other way, ie > s/$(LN_S)/ln -s/g and get rid of the configure-time cycles wasted in > setting up that variable. > I accede to it.
Re: Just for fun: Postgres 20?
On Thu, Feb 13, 2020 at 1:34 PM Andrew Dunstan wrote: > I also object because 20 is *my* unlucky number ... I don't think we're going to do this, so you don't have to worry on that score. -- Peter Geoghegan
Re: Just for fun: Postgres 20?
On Fri, Feb 14, 2020 at 4:19 PM Tom Lane wrote: > Not sure how serious Andrew is being here, but it does open up an > important point: there are varying opinions on which numbers are unlucky. > The idea that 13 is unlucky is Western, and maybe even only common in > English-speaking countries. I would wager that this superstition is the main reason why Oracle 12c was followed by Oracle 18c rather than Oracle 13c. I have no evidence for this -- I take it on faith. I feel that I should take the proposal seriously for at least a moment. The proposal doesn't affect anybody who isn't into numerology. At the same time, it makes the superstitious people happy (leaving aside the icosaphobes). Airlines do this with row numbers -- what's the harm? There is a real downside to this, though. It is a bad idea, even on its own terms. If we take the idea seriously, then it has every chance of being noticed and becoming a big distraction in all sorts of ways. That might happen anyway, but I think it's less likely this way. ISTM that the smart thing to do is to ignore it completely. Don't even try to preempt a silly headline written by some tech journalist wiseacre. -- Peter Geoghegan
Re: Use LN_S instead of "ln -s" in Makefile
Ashwin Agrawal writes: > In general, the variable LN_S is 'ln -s', however, LN_S changes to 'cp > -pR' if configure finds the file system does not support symbolic > links. > I was playing with 'ln' for some other reason where I symbolic linked > it to '/bin/false'. That revealed the failure for > src/backend/Makefile. Greping for 'ln -s' revealed three places it's > used. Attaching the patch to fix the same. Hm. So, these oversights are certainly bugs in narrow terms. However, they've all been like that for a *long* time --- the three instances you found date to 2005, 2006, and 2008 according to "git blame". The complete lack of complaints shows that nobody cares. So I think a fairly strong case could be made for going the other way, ie s/$(LN_S)/ln -s/g and get rid of the configure-time cycles wasted in setting up that variable. Otherwise I fear somebody will "break" it again soon, and it will stay "broken" for another 15 years till someone happens to notice. We have better things to do than spend our time maintaining such nonfunctional differences. regards, tom lane
Re: Standards compliance of SET ROLE / SET SESSION AUTHORIZATION
On 02/14/20 18:43, Tom Lane wrote: > I suppose it could be argued that that's a bug in the interpretation > of role membership: arguably, if you're a member of some superuser > role, that ought to give you membership in anything else. IOW, a > superuser's implicit membership in every role isn't transitive, > and maybe it should be. But I'm not sure that I want to change that; > it feels like doing so might have surprising side-effects. I have a tendency to create roles like postgres_assumable or dba_assumable, which are themselves members of the indicated roles, but without rolinherit, and then grant those to my own role. That way in my day to day faffing about, I don't get to make superuser-powered mistakes, but I can 'set role postgres' when needed. Would it make sense for a proposed transitive superuser-membership- in-everything also to stop at a role without rolinherit? Clearly it would just add one extra step to 'set role anybody', but sometimes one extra step inspires a useful extra moment of thought. Regards, -Chap
Use LN_S instead of "ln -s" in Makefile
In general, the variable LN_S is 'ln -s', however, LN_S changes to 'cp -pR' if configure finds the file system does not support symbolic links. I was playing with 'ln' for some other reason where I symbolic linked it to '/bin/false'. That revealed the failure for src/backend/Makefile. Greping for 'ln -s' revealed three places it's used. Attaching the patch to fix the same. From 60b3374de1e4d0fd143c2a7b3bf78893e32579af Mon Sep 17 00:00:00 2001 From: Ashwin Agrawal Date: Fri, 14 Feb 2020 16:16:37 -0800 Subject: [PATCH v1] Use LN_S instead of "ln -s" in Makefile --- src/backend/Makefile | 2 +- src/interfaces/ecpg/test/Makefile | 2 +- src/makefiles/pgxs.mk | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/src/backend/Makefile b/src/backend/Makefile index 9706a95848..f21a454286 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -224,7 +224,7 @@ install-bin: postgres $(POSTGRES_IMP) installdirs $(INSTALL_PROGRAM) postgres$(X) '$(DESTDIR)$(bindir)/postgres$(X)' ifneq ($(PORTNAME), win32) @rm -f '$(DESTDIR)$(bindir)/postmaster$(X)' - ln -s postgres$(X) '$(DESTDIR)$(bindir)/postmaster$(X)' + $(LN_S) postgres$(X) '$(DESTDIR)$(bindir)/postmaster$(X)' else $(INSTALL_PROGRAM) postgres$(X) '$(DESTDIR)$(bindir)/postmaster$(X)' endif diff --git a/src/interfaces/ecpg/test/Makefile b/src/interfaces/ecpg/test/Makefile index be53b7b94d..11d55bde3e 100644 --- a/src/interfaces/ecpg/test/Makefile +++ b/src/interfaces/ecpg/test/Makefile @@ -73,7 +73,7 @@ remaining_files_build := $(patsubst $(srcdir)/%, $(abs_builddir)/%, $(remaining_ all: $(remaining_files_build) $(remaining_files_build): $(abs_builddir)/%: $(srcdir)/% - ln -s $< $@ + $(LN_S) $< $@ endif # Common options for tests. Also pick up anything passed in EXTRA_REGRESS_OPTS diff --git a/src/makefiles/pgxs.mk b/src/makefiles/pgxs.mk index 271e7eaba8..8c7d126593 100644 --- a/src/makefiles/pgxs.mk +++ b/src/makefiles/pgxs.mk @@ -397,7 +397,7 @@ test_files_build := $(patsubst $(srcdir)/%, $(abs_builddir)/%, $(test_files_src) all: $(test_files_build) $(test_files_build): $(abs_builddir)/%: $(srcdir)/% $(MKDIR_P) $(dir $@) - ln -s $< $@ + $(LN_S) $< $@ endif # VPATH endif # REGRESS -- 2.25.0
Re: Just for fun: Postgres 20?
Andrew Dunstan writes: > I also object because 20 is *my* unlucky number ... Not sure how serious Andrew is being here, but it does open up an important point: there are varying opinions on which numbers are unlucky. The idea that 13 is unlucky is Western, and maybe even only common in English-speaking countries. In Asia, numbers containing the digit 4 are considered unlucky [1], and there are probably other rules in other cultures. If we establish a precedent that we'll skip release numbers for non-technical reasons, I'm afraid we'll be right back in the mess we sought to avoid, whereby nearly every year we had an argument about what the next release number would be. So let's not go there. regards, tom lane [1] https://en.wikipedia.org/wiki/Tetraphobia
Re: Standards compliance of SET ROLE / SET SESSION AUTHORIZATION
I wrote: > What I'm now thinking is that we shouldn't mess with the behavior of > SET ROLE, as I mused about doing yesterday in [1]. It's spec-compliant, > or close enough, so let's leave it be. On the other hand, changing the > behavior of SET SESSION AUTHORIZATION is not constrained by spec > compliance concerns, only backwards compatibility. We could address the > pg_dump concerns I had in [1] by tweaking what SET SESSION AUTHORIZATION > can do and then adjusting pg_dump to swap its usage of SET SESSION > AUTHORIZATION (do that just once, in response to --role) and SET ROLE > (do that per-object, to establish ownership). Concretely, I propose the following semantics: * SET SESSION AUTHORIZATION is allowed if your original login role is a member of the target role. If successful, it resets the role to "NONE", ie session authorization and effective role both become the stated role. * SET ROLE is allowed if your session authorization is a member of the target role. If successful, it sets the effective role to the target role. SET ROLE NONE resets effective role to the current session authorization. This is the same behavior we have now for SET ROLE. The difference for SET SESSION AUTHORIZATION is that currently that requires your login role to be superuser or equal to the target role, so the above is a strictly weaker check. The reason this is interesting is that currently, if you log in as somebody who isn't superuser but is allowed to become superuser (ie, has been granted a superuser role), you're not allowed to SET SESSION AUTHORIZATION to the superuser, only SET ROLE to it. And that in turn means that you can't necessarily SET ROLE to any random other userid, which is a weird restriction that breaks the "pg_restore --role" use-case for this whole thing [1]. I suppose it could be argued that that's a bug in the interpretation of role membership: arguably, if you're a member of some superuser role, that ought to give you membership in anything else. IOW, a superuser's implicit membership in every role isn't transitive, and maybe it should be. But I'm not sure that I want to change that; it feels like doing so might have surprising side-effects. Note that AFAICS, this is just as spec-compliant as our current behavior. The spec only constrains what SET ROLE does. regards, tom lane [1] https://www.postgresql.org/message-id/11496.1581634533%40sss.pgh.pa.us
Re: Memory-Bounded Hash Aggregation
On Wed, 2020-02-12 at 21:51 -0800, Jeff Davis wrote: > On Mon, 2020-02-10 at 15:57 -0800, Jeff Davis wrote: > > Attaching latest version (combined logtape changes along with main > > HashAgg patch). > > I ran a matrix of small performance tests to look for regressions. I ran some more tests, this time comparing Hash Aggregation to Sort+Group. Summary of trends: group key complexity : favors Hash group key size : favors Hash group size : favors Hash higher work_mem : favors Sort[1] data set size: favors Sort[1] number of aggregates : favors Hash[2] [1] I have closed the gap a bit with some post-experiment tuning. I have just begun to analyze this case so I think there is quite a bit more room for improvement. [2] Could use more exploration -- I don't have an explanation. Data sets: t20m_1_int4: ~20 million groups of size ~1 (uniform) t1m_20_int4: ~1 million groups of size ~20 (uniform) t1k_20k_int4: ~1k groups of size ~20k (uniform) also, text versions of each of those with collate "C.UTF-8" Results: 1. A general test to vary the group size, key type, and work_mem. Query: select i from $TABLE group by i offset 1; work_mem='4MB' ++--+-+--+ || sort(ms) | hashagg(ms) | sort/hashagg | ++--+-+--+ | t20m_1_int4| 11852 | 10640 | 1.11 | | t1m_20_int4| 11108 | 8109 | 1.37 | | t1k_20k_int4 |8575 | 2732 | 3.14 | | t20m_1_text| 80463 | 12902 | 6.24 | | t1m_20_text | 58586 | 9252 | 6.33 | | t1k_20k_text | 21781 | 5739 | 3.80 | ++--+-+ --+ work_mem='32MB' ++--+-+--+ || sort(ms) | hashagg(ms) | sort/hashagg | ++--+-+--+ | t20m_1_int4|9656 | 11702 | 0.83 | | t1m_20_int4|8870 | 9804 | 0.90 | | t1k_20k_int4 |6359 | 1852 | 3.43 | | t20m_1_text| 74266 | 14434 | 5.15 | | t1m_20_text| 56549 | 10180 | 5.55 | | t1k_20k_text | 21407 | 3989 | 5.37 | ++--+-+--+ 2. Test group key size data set: 20m rows, four int4 columns. Columns a,b,c are all the constant value 1, forcing each comparison to look at all four columns. Query: select a,b,c,d from wide group by a,b,c,d offset 1; work_mem='4MB' Sort : 30852ms HashAgg : 12343ms Sort/HashAgg : 2.50 In theory, if the first grouping column is highly selective, then Sort may have a slight advantage because it can look at only the first column, while HashAgg needs to look at all 4. But HashAgg only needs to perform this calculation once and it seems hard enough to show this in practice that I consider it an edge case. In "normal" cases, it appears that more grouping columns significantly favors Hash Agg. 3. Test number of aggregates Data Set: same as for test #2 (group key size). Query: select d, count(a),sum(b),avg(c),min(d) from wide group by d offset 1; work_mem='4MB' Sort : 22373ms HashAgg : 17338ms Sort/HashAgg : 1.29 I don't have an explanation of why HashAgg is doing better here. Both of them are using JIT and essentially doing the same number of advancements. This could use more exploration, but the effect isn't major. 4. Test data size Data 400 million rows of four random int8s. Group size of one. Query: select a from t400m_1_int8 group by a offset 10; work_mem='32MB' Sort : 300675ms HashAgg : 560740ms Sort/HashAgg : 0.54 I tried increasing the max number of partitions and brought the HashAgg runtime down to 481985 (using 1024 partitions), which closes the gap to 0.62. That's not too bad for HashAgg considering this is a group size of one with a simple group key. A bit more tuning might be able to close the gap further. Conclusion: HashAgg is winning in a lot of cases, and this will be an important improvement for many workloads. Not only is it faster in a lot of cases, but it's also less risky. When an input has unknown group size, it's much easier for the planner to choose HashAgg -- a small downside and a big upside. Regards, Jeff Davis
Re: Standards compliance of SET ROLE / SET SESSION AUTHORIZATION
Chapman Flack writes: > On 2/14/20 4:01 PM, Tom Lane wrote: >> ... A protocol-level message >> to set session auth could also be possible, of course. > I'll once again whimper softly and perhaps ineffectually that an > SQL-exposed equivalent like > SET SESSION AUTHORIZATION foo WITH RESET COOKIE 'lkjhikuhoihkihlj'; > would seem to suit the same purpose, with the advantage of being > immediately usable by any kind of front- or middle-end code the > instant there is a server version that supports it, without having > to wait for something new at the protocol level to trickle through > to n different driver implementations. Yeah, I'm not that thrilled with the idea of a protocol message that's not equivalent to any SQL-level functionality, either. But the immediate point here is that I think we could get away with playing around with SET SESSION AUTHORIZATION's semantics. Or, seeing that that's just syntactic sugar for "SET session_authorization", we could invent some new GUCs that allow control over this, rather than new syntax. regards, tom lane
Re: Standards compliance of SET ROLE / SET SESSION AUTHORIZATION
On 2/14/20 4:01 PM, Tom Lane wrote: > Robert Haas writes: >> It wouldn't be difficult to introduce a new protocol-level option that >> prohibits RESET SESSION AUTHORIZATION; and it would also be possible >> to introduce a new protocol message that has the same effect as RESET >> SESSION AUTHORIZATION. If you do those two things, then it's possible >> to create a sandbox which the end client cannot escape but which the >> pooler can escape easily. > ... > SET SESSION AUTHORIZATION foo PERMANENT; > ... A protocol-level message > to set session auth could also be possible, of course. I'll once again whimper softly and perhaps ineffectually that an SQL-exposed equivalent like SET SESSION AUTHORIZATION foo WITH RESET COOKIE 'lkjhikuhoihkihlj'; would seem to suit the same purpose, with the advantage of being immediately usable by any kind of front- or middle-end code the instant there is a server version that supports it, without having to wait for something new at the protocol level to trickle through to n different driver implementations. Regards, -Chap
Re: [DOC] Document auto vacuum interruption
On Thu, Sep 19, 2019 at 5:34 AM Amit Kapila wrote: > > On Wed, Sep 18, 2019 at 10:25 AM Amit Kapila wrote: > > > > On Tue, Sep 17, 2019 at 5:48 PM James Coleman wrote: > > > > > > On Tue, Sep 17, 2019 at 2:21 AM Amit Kapila > > > wrote: > > > > > > > > > > > > Let me know what you think of attached? I think we can back-patch > > > > this patch. What do you think? Does anyone else have an opinion on > > > > this patch especially if we see any problem in back-patching this? > > > > > > The attached looks great! > > > > > > I was working on HEAD for the patch, but this concern has been an > > > issue for quite a long time. We were running into it on 9.6 in > > > production, for example. And given how frequently it seems like there > > > are large-scale production issues related to auto vacuum, I think any > > > amount of back patching we can do to make that footgun less likely > > > would be a good thing. > > > > > > > Okay, I will commit this tomorrow unless someone has any comments or > > objections. > > > > Pushed with minor changes. There was one extra space in a few lines > and the tag for back-branches (from 10~9.4) was slightly different. I completely forgot to reply to this; thanks Amit for working on this. James
Re: [DOC] Document concurrent index builds waiting on each other
On Sun, Sep 29, 2019 at 9:24 PM Michael Paquier wrote: > > On Sat, Sep 28, 2019 at 10:22:28PM -0300, Alvaro Herrera wrote: > > I always thought that create index concurrently was prevented from > > running concurrently in a table by the ShareUpdateExclusive lock that's > > held during the operation. > > REINDEX CONCURRENTLY and CIC can deadlock while waiting for each other > to finish after their validation phase, see: > https://www.postgresql.org/message-id/20190507030756.gd1...@paquier.xyz > https://www.postgresql.org/message-id/20190507032543.gh1...@paquier.xyz Michael, Thanks for the cross-link. Do you think this would be valuable to document at the same time? Or did you just want to ensure we were also aware of this particular downfall? If the latter, I appreciate it, it's helpful info. If the latter, let me know, and I'll try to update the patch. Thanks, James
Re: [DOC] Document concurrent index builds waiting on each other
I went ahead and registered this in the current only CF as https://commitfest.postgresql.org/27/2454/ Alvaro: Would you like to be added as a reviewer? James
Standards compliance of SET ROLE / SET SESSION AUTHORIZATION
[ Starting a new thread about this, since the old one about GUC reporting is only marginally related to this point ... if it were more so, maybe I'd have found it when I went looking for it yesterday ] Robert Haas writes: > On Tue, Nov 5, 2019 at 10:02 AM Alvaro Herrera > wrote: >> There's a reason the SQL standard defines SET SESSION AUTHORIZATION but >> no RESET SESSION AUTHORIZATION: once you enter a security context, you >> cannot escape it. ISTM that essentially we broke feature F321 "User >> authorization" by adding RESET into the mix. (I think RESET ROLE breaks >> the spirit of feature T331 too.) The SQL:2016 standard describes how >> this is supposed to work in Foundation "4.40.1.1 SQL-session >> authorization identifiers" (same section is numbered 4.35.1.1 in >> SQL:2011), and ISTM we made a huge mess of it. >> >> I don't see how to fix it, though. If we were to adopt the standard's >> mechanism, we'd probably break tons of existing code. > It wouldn't be difficult to introduce a new protocol-level option that > prohibits RESET SESSION AUTHORIZATION; and it would also be possible > to introduce a new protocol message that has the same effect as RESET > SESSION AUTHORIZATION. If you do those two things, then it's possible > to create a sandbox which the end client cannot escape but which the > pooler can escape easily. I went looking into the SQL standard to see just what it says about this, and I'm darned if I see anything supporting Alvaro's argument. I do not have SQL:2016 at hand, but in SQL:2011 what I see is that section 4.35.1.1 describes a stack of authorization identifiers and/or roles that controls the currently-applicable privileges. It says Let E be an externally-invoked procedure, SQL-invoked routine, triggered action, prepared statement, or directly executed statement. When E is invoked, a copy of the top cell is pushed onto the authorization stack. If the invocation of E is to be under definer's rights, then the contents of the top cell are replaced with the authorization identifier of the owner of E. On completion of the execution of E, the top cell is removed. ... The changes the value of the current user identifier and of the SQL- session user identifier. The changes the value of the current role name. ... The term current authorization identifier denotes an authorization identifier in the top cell of the authorization stack. There is nothing anywhere in 4.35 that constrains the allowable transitions of authorization identifiers. The only thing I can find on that point is in the General Rules of 19.2 (a/k/a SET SESSION AUTHORIZATION), which says: 4) If V is not equal to the current value of the SQL-session user identifier of the current SQL-session context, then the restrictions on the permissible values for V are implementation-defined. 5) If the current user identifier and the current role name are restricted from setting the user identifier to V, then an exception condition is raised: invalid authorization specification. So as far as I can see, restrictions on what SET SESSION AUTHORIZATION can set the authorization ID to are implementation-defined, full stop. There might be considerable value in the semantics Alvaro suggests, but I think arguing that the spec requires 'em is just wrong. On the other hand, the restrictions on SET ROLE in 19.3 are much less squishy: 3) If contains a , then: c) If no role authorization descriptor exists that indicates that the role identified by V has been granted to either the current user identifier or to PUBLIC, then an exception condition is raised: invalid role specification. d) The SQL-session role name and the current role name are set to V. 4) If NONE is specified, then the current role name is removed. As best I can tell, we actually are entirely compliant with that, modulo the fact that we don't think of the current state as an pair. What you can SET ROLE to is determined by your authorization identifier, not your current role, and so doing a SET ROLE doesn't change what you can SET ROLE to later. The argument that "RESET ROLE" is somehow invalid seems a little silly when "SET ROLE NONE" does the same thing. What I'm now thinking is that we shouldn't mess with the behavior of SET ROLE, as I mused about doing yesterday in [1]. It's spec-compliant, or close enough, so let's leave it be. On the other hand, changing the behavior of SET SESSION AUTHORIZATION is not constrained by spec compliance concerns, only backwards compatibility. We could address the pg_dump concerns I had in [1] by tweaking what SET SESSION AUTHORIZATION can do and then adjusting pg_dump to swap its usage of SET SESSION AUTHORIZATION (do that just once, in response to --role) and SET ROLE (do that per-object, to establish ownership). The only thing stopping us from addressing Alvaro's concern is bac
Re: Error on failed COMMIT
On Fri, 14 Feb 2020 at 15:19, Alvaro Herrera wrote: > On 2020-Feb-14, Dave Cramer wrote: > > > I offered to make the behaviour in the JDBC driver dependent on a > > configuration parameter > > Do you mean "if con.commit() results in a rollback, then an exception is > thrown, unless the parameter XYZ is set to PQR"? > No. JDBC has a number of connection parameters which change internal semantics of various things. I was proposing to have a connection parameter which would be throwExceptionOnFailedCommit (or something) which would do what it says. None of this would touch the server. It would just change the driver semantics. Dave Cramer www.postgres.rocks
Re: Zedstore - compressed in-core columnar storage
Hello, We (David and I) recently observed that a Zedstore table can be considerably bloated when we load data into it with concurrent copies. Also, we found that concurrent insert performance was less than desirable. This is a detailed analysis of the extent of the problem, the cause of the problem and how we fixed it. This has input from much of our team: Alex, Ashwin, David, Heikki, Melanie, Taylor and myself. An example of the bloat that we observed: TPC-DS scale = 270: Table heap zed(serial) zed(16 parallel COPYs) web_sales39G19G39G We found that it was caused due to inefficient page splits resultant from out-of-tid-order-inserts into full/full-ish attribute tree leaf pages. The degree of under-utilization was significant - attribute tree leaves with a serial data load had 6-8x more datums than the attribute tree leaves resultant with a parallel load of 16 sessions. Consider the scenario below: Assumptions: 1. Let us consider two concurrent copy commands executing (sessions S1 and S2). 2. The table has only one (fixed-length for sake of argument) attribute 'a'. 3. For attribute 'a', a full attribute tree leaf page can accommodate 1500 datums. TID allocations: S1: 1-1000 S2: 1001-2000, 2001-3000 Order of operations: 1. S2 writes datums for tids 1001-2000, 2001-3000. The resulting leaves are: L1: lokey = 1 hikey= 2500 firsttid = 1001lasttid = 2500 L2: lokey = 2501hikey= MaxZSTid firsttid = 2501lasttid = 3000 2. S1 now writes datums for tids 1-1000. We have to split L1 into L1' and L1''. L1': lokey= 1 hikey= 1500 firsttid = 1 lasttid = 1500 L1'': [under-utilized page] lokey= 1501 hikey = 2000 firsttid = 1501 lasttid = 2000 L2: lokey = 2501hikey = MaxZSTid firsttid = 2501lasttid = 3000 Note: The lokeys/hikeys reflect ranges of what CAN be inserted whereas firsttid and lasttid reflect what actually have been inserted. L1'' will be an under-utilized page that is not going to be filled again because it inherits the tight hikey from L1. In this example, space wastage in L1'' is 66% but it could very easily be close to 100%, especially under concurrent workloads which mixes single and multi-inserts, or even unequally sized multi-inserts. Solution (kudos to Ashwin!): For every multi-insert (and only multi-insert, not for singleton inserts), allocate N times more tids. Each session will keep these extra tids in a buffer. Subsequent calls to multi-insert would use these buffered tids. If at any time a tid allocation request cannot be met by the remaining buffered tids, a new batch of N times the number of tids requested will again be allocated. If we take the same example above and say we allocated N=5 times the number of tids upon the first request for 1000 tids.: TID allocations: S1: 1-5000 S2: 5001-1 Order of operations: 1. S2 writes datums for tids 5001-6000, 6001-7000. The resulting leaves are: L1: lokey = 1 hikey= 6500 firsttid = 5001 lasttid = 6500 L2: lokey = 6501hikey = MaxZSTid firsttid = 6501 lasttid = 7000 2. S1 writes datums for tids 1-1000. L1 will be split into L1' and L1''. L1': lokey = 1 hikey = 5500 firsttid = 1 lasttid = 1000 L1'' [under-utilized page]: lokey = 5501 hikey = 6500 firsttid = 5501 lasttid = 6500 L2: lokey = 6501hikey = MaxZSTid firsttid = 6501lasttid = 7000 Subsequent inserts by S1 will land on L1' whose hikey isn't restrictive. However, we do end up with the inefficient page L1''. With a high enough value of N, we reduce the frequency of such pages. We could further reduce this wastage by incorporating a special left split (Since L1 was already full, we don't change it at all -> we simply update it's lokey -> L1 becomes L1'' and we fork of a new leaf to its left: L1'). This would look like: L1': lokey = 1 hikey = 5000 firsttid = 1 lasttid = 1000 L1'': lokey = 5001 hikey = 6500 firsttid = 5001 lasttid = 6500 We found that with a high enough value of N, we did not get significant space benefits from the left split. Thus, we decided to only incorporate N. Results: [TPC-DS scale = 50, 16 conc copies] Table zed N=10 N=100 N=1000 heap zed(serial) catalog_sales 15G9.1G 7.7G 7.5G15G 8.0G catalog_returns1.5G0.9G 0.7G 0.7G1.2G 0.8G store_returns2.1G 1.2G 1.1G 1.1G1.9G 1.2G store_sales 17G11G 10.1G 10.1G 21G10G Load time: N=10 30min N=100 10min N=1000 7min zed100min heap 8min 'zed' refers to the zedstore branch without our fix. We see that with N = 10, we get closer to what we get with serial inserts. For N = 100, we even be
Re: Error on failed COMMIT
On 2020-Feb-14, Dave Cramer wrote: > I offered to make the behaviour in the JDBC driver dependent on a > configuration parameter Do you mean "if con.commit() results in a rollback, then an exception is thrown, unless the parameter XYZ is set to PQR"? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Error on failed COMMIT
On Fri, 14 Feb 2020 at 15:07, Tom Lane wrote: > Dave Cramer writes: > > On Fri, 14 Feb 2020 at 14:37, Robert Haas wrote: > >> I'm not trying to deny that you might find some other server behavior > >> more convenient. You might. And, to Vik's original point, it might be > >> more compliant with the spec, too. But since changing that would have > >> a pretty big blast radius at this stage, I think it's worth trying to > >> make things work as well as they can with the server behavior that we > >> already have. And I don't really see anything preventing the driver > >> from doing that technically. I don't understand the idea that the > >> driver is somehow not allowed to notice the command tag. > > > We have the same blast radius. > > I have offered to make the behaviour requested dependent on a > configuration > > parameter but apparently this is not sufficient. > > Nope, that is absolutely not happening. I should have been more specific. I offered to make the behaviour in the JDBC driver dependent on a configuration parameter Dave Cramer www.postgres.rocks > >
Re: Error on failed COMMIT
Dave Cramer writes: > On Fri, 14 Feb 2020 at 14:37, Robert Haas wrote: >> I'm not trying to deny that you might find some other server behavior >> more convenient. You might. And, to Vik's original point, it might be >> more compliant with the spec, too. But since changing that would have >> a pretty big blast radius at this stage, I think it's worth trying to >> make things work as well as they can with the server behavior that we >> already have. And I don't really see anything preventing the driver >> from doing that technically. I don't understand the idea that the >> driver is somehow not allowed to notice the command tag. > We have the same blast radius. > I have offered to make the behaviour requested dependent on a configuration > parameter but apparently this is not sufficient. Nope, that is absolutely not happening. We learned very painfully, back around 7.3 when we tried to put in autocommit on/off, that if server behaviors like this are configurable then most client code has to be prepared to work with every possible setting. The argument that "you can just set it to work the way you expect" is a dangerous falsehood. I see no reason to think that a change like this wouldn't suffer the same sort of embarrassing and expensive failure that autocommit did. regards, tom lane
Re: Error on failed COMMIT
On Fri, 14 Feb 2020 at 14:37, Robert Haas wrote: > On Fri, Feb 14, 2020 at 2:08 PM Dave Cramer > wrote: > > Well now you are asking the driver to re-interpret the results in a > different way than the server which is not what we tend to do. > > > > The server throws an error we throw an error. We really aren't in the > business of re-interpreting the servers responses. > > I don't really see a reason why the driver has to throw an exception > if and only if there is an ERROR on the PostgreSQL side. But even if > you want to make that rule for some reason, it doesn't preclude > correct behavior here. All you really need is to have con.commit() > return some indication of what the command tag was, just as, say, psql > would do. If the server provides you with status information and you > throw it out instead of passing it along to the application, that's > not ideal. > Well con.commit() returns void :( > > Another thing that kinda puzzles me about this situation is that, as > far as I know, the only time COMMIT returns ROLLBACK is if the > transaction has already previously reported an ERROR. But if an ERROR > gets turned into an exception, then, in the code snippet previously > provided, we'd never reach con.commit() in the first place. > The OP is building a framework where it's possible for the exception to be swallowed by the consumer of the framework. > I'm not trying to deny that you might find some other server behavior > more convenient. You might. And, to Vik's original point, it might be > more compliant with the spec, too. But since changing that would have > a pretty big blast radius at this stage, I think it's worth trying to > make things work as well as they can with the server behavior that we > already have. And I don't really see anything preventing the driver > from doing that technically. I don't understand the idea that the > driver is somehow not allowed to notice the command tag. > We have the same blast radius. I have offered to make the behaviour requested dependent on a configuration parameter but apparently this is not sufficient. Dave Cramer www.postgres.rocks
Re: Error on failed COMMIT
On Fri, Feb 14, 2020 at 2:08 PM Dave Cramer wrote: > Well now you are asking the driver to re-interpret the results in a different > way than the server which is not what we tend to do. > > The server throws an error we throw an error. We really aren't in the > business of re-interpreting the servers responses. I don't really see a reason why the driver has to throw an exception if and only if there is an ERROR on the PostgreSQL side. But even if you want to make that rule for some reason, it doesn't preclude correct behavior here. All you really need is to have con.commit() return some indication of what the command tag was, just as, say, psql would do. If the server provides you with status information and you throw it out instead of passing it along to the application, that's not ideal. Another thing that kinda puzzles me about this situation is that, as far as I know, the only time COMMIT returns ROLLBACK is if the transaction has already previously reported an ERROR. But if an ERROR gets turned into an exception, then, in the code snippet previously provided, we'd never reach con.commit() in the first place. I'm not trying to deny that you might find some other server behavior more convenient. You might. And, to Vik's original point, it might be more compliant with the spec, too. But since changing that would have a pretty big blast radius at this stage, I think it's worth trying to make things work as well as they can with the server behavior that we already have. And I don't really see anything preventing the driver from doing that technically. I don't understand the idea that the driver is somehow not allowed to notice the command tag. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Error on failed COMMIT
On Fri, 14 Feb 2020 at 13:29, Robert Haas wrote: > On Fri, Feb 14, 2020 at 1:04 PM Dave Cramer > wrote: > > Thing is that con.commit() DOESN'T return a status code, nor does it > throw an exception as we silently ROLLBACK here. > > Why not? There's nothing keeping the driver from doing either of those > things, is there? I mean, if using libpq, you can use PQcmdStatus() to > get the command tag, and find out whether it's COMMIT or ROLLBACK. If > you're implementing the wire protocol directly, you can do something > similar. > > > https://www.postgresql.org/docs/current/libpq-exec.html#LIBPQ-EXEC-NONSELECT Well now you are asking the driver to re-interpret the results in a different way than the server which is not what we tend to do. The server throws an error we throw an error. We really aren't in the business of re-interpreting the servers responses. Dave Cramer www.postgres.rocks
Re: assert pg_class.relnatts is consistent
Alvaro Herrera writes: > On 2020-Feb-14, John Naylor wrote: >> One possible objection to what I wrote above is that it adds a >> different kind of special case, but in a sneaky way. Perhaps it would >> be more principled to treat it the same as oid after all. If we do >> that, it would help to add a comment that we can't treat relnatts like >> pronangs, since we need more information than what's in each pg_class >> row. > How about something like this? (untested) I think John's idea of setting a dummy BKI_DEFAULT value is better, as that means the only code that has to worry about this directly is the code that's actually filling in relnatts. As far as said code goes, we don't need an additional global variable when we can just look in the $catalogs data structure; and I'm not a fan of cramming this into the OID-assignment logic just to save a loop. So that leads me to the attached. (I agree with Alvaro's thought of shortening AddDefaultValues, but didn't do that here.) regards, tom lane diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl index 8032512..ad24f4d 100644 --- a/src/backend/catalog/genbki.pl +++ b/src/backend/catalog/genbki.pl @@ -184,6 +184,15 @@ my $PG_CATALOG_NAMESPACE = 'PG_CATALOG_NAMESPACE'); +# Fill in pg_class.relnatts by looking at the referenced catalog's schema. +# This is ugly but there's no better place; Catalog::AddDefaultValues +# can't do it, for lack of easy access to the other catalog. +foreach my $row (@{ $catalog_data{pg_class} }) +{ + $row->{relnatts} = scalar(@{ $catalogs{ $row->{relname} }->{columns} }); +} + + # Build lookup tables. # access method OID lookup diff --git a/src/include/catalog/pg_class.dat b/src/include/catalog/pg_class.dat index f70d5ba..b28bba7 100644 --- a/src/include/catalog/pg_class.dat +++ b/src/include/catalog/pg_class.dat @@ -14,51 +14,15 @@ # Note: only bootstrap catalogs, ie those marked BKI_BOOTSTRAP, need to # have entries here. Be sure that the OIDs listed here match those given in -# their CATALOG and BKI_ROWTYPE_OID macros, and that the relnatts values are -# correct. - -# Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId; -# similarly, "1" in relminmxid stands for FirstMultiXactId +# their CATALOG and BKI_ROWTYPE_OID macros. { oid => '1247', - relname => 'pg_type', reltype => 'pg_type', relam => 'heap', - relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0', - reltoastrelid => '0', relhasindex => 'f', relisshared => 'f', - relpersistence => 'p', relkind => 'r', relnatts => '31', relchecks => '0', - relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f', - relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't', - relreplident => 'n', relispartition => 'f', relfrozenxid => '3', - relminmxid => '1', relacl => '_null_', reloptions => '_null_', - relpartbound => '_null_' }, + relname => 'pg_type', reltype => 'pg_type' }, { oid => '1249', - relname => 'pg_attribute', reltype => 'pg_attribute', relam => 'heap', - relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0', - reltoastrelid => '0', relhasindex => 'f', relisshared => 'f', - relpersistence => 'p', relkind => 'r', relnatts => '25', relchecks => '0', - relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f', - relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't', - relreplident => 'n', relispartition => 'f', relfrozenxid => '3', - relminmxid => '1', relacl => '_null_', reloptions => '_null_', - relpartbound => '_null_' }, + relname => 'pg_attribute', reltype => 'pg_attribute' }, { oid => '1255', - relname => 'pg_proc', reltype => 'pg_proc', relam => 'heap', - relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0', - reltoastrelid => '0', relhasindex => 'f', relisshared => 'f', - relpersistence => 'p', relkind => 'r', relnatts => '29', relchecks => '0', - relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f', - relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't', - relreplident => 'n', relispartition => 'f', relfrozenxid => '3', - relminmxid => '1', relacl => '_null_', reloptions => '_null_', - relpartbound => '_null_' }, + relname => 'pg_proc', reltype => 'pg_proc' }, { oid => '1259', - relname => 'pg_class', reltype => 'pg_class', relam => 'heap', - relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0', - reltoastrelid => '0', relhasindex => 'f', relisshared => 'f', - relpersistence => 'p', relkind => 'r', relnatts => '33', relchecks => '0', - relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f', - relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't', - relreplident => 'n', relispartition => 'f', relfrozenxid => '3', - relminmxid => '1', relacl => '_null_', reloptions => '_null_', - relpartbound => '_null_' }, + relname => 'pg_class', reltype => 'pg_c
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Fri, Feb 14, 2020 at 1:07 PM Andres Freund wrote: > Yea, that seems possible. I'm not really sure it's needed however? As > long as you're not teaching the locking mechanism new tricks that > influence the wait graph, why would the deadlock detector care? That's > quite different from the group locking case, where you explicitly needed > to teach it something fairly fundamental. Well, you have to teach it that locks of certain types conflict even if they are in the same group, and that bleeds over pretty quickly into the whole area of deadlock detection, because lock waits are the edges in the graph that the deadlock detector processes. > It might still be a good idea independently to add the rule & enforce > that acquire heavyweight locks while holding certain classes of locks is > not allowed. I think that's absolutely essential, if we're going to continue using the main lock manager for this. I remain somewhat unconvinced that doing so is the best way forward, but it is *a* way forward. > Right. For me that's *the* fundamental service that lock.c delivers. And > it's the fundamental bit this thread so far largely has been focusing > on. For me, the deadlock detection is the far more complicated and problematic bit. > Isn't that mostly true to varying degrees for the majority of lock types > in lock.c? Sure, perhaps historically that's a misuse of lock.c, but > it's been pretty ingrained by now. I just don't see where leaving out > any of these features is going to give us fundamental advantages > justifying a different locking infrastructure. I think the group locking + deadlock detection things are more fundamental than you might be crediting, but I agree that having parallel mechanisms has its own set of pitfalls. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Error on failed COMMIT
On Fri, Feb 14, 2020 at 1:04 PM Dave Cramer wrote: > Thing is that con.commit() DOESN'T return a status code, nor does it throw an > exception as we silently ROLLBACK here. Why not? There's nothing keeping the driver from doing either of those things, is there? I mean, if using libpq, you can use PQcmdStatus() to get the command tag, and find out whether it's COMMIT or ROLLBACK. If you're implementing the wire protocol directly, you can do something similar. https://www.postgresql.org/docs/current/libpq-exec.html#LIBPQ-EXEC-NONSELECT -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Hi, On 2020-02-14 12:08:45 -0500, Robert Haas wrote: > On Fri, Feb 14, 2020 at 11:40 AM Andres Freund wrote: > > IMO the right thing here is to extend lock.c so we can better represent > > whether certain types of lockmethods (& levels ?) are [not] to be > > shared. > > The part that I find awkward about that is the whole thing with the > deadlock detector. The deadlock detection code is old, crufty, > complex, and very difficult to test (or at least I have found it so). > A bug that I introduced when inventing group locking took like 5 years > for somebody to find. Oh, I agree, lock.c and surrounding code is pretty crufty. Doubtful that just building up a largely parallel piece of infrastructure next to it is a good answer though. > One way of looking at the requirement that we have here is that > certain kinds of locks need to be exempted from group locking. > Basically, these are because they are a lower-level concept: a lock on > a relation is more of a "logical" concept, and you hold the lock until > eoxact, whereas a lock on an extend the relation is more of a > "physical" concept, and you give it up as soon as you are done. Page > locks are like relation extension locks in this regard. Unlike locks > on SQL-level objects, these should not be shared between members of a > lock group. > > Now, if it weren't for the deadlock detector, that would be easy > enough. But figuring out what to do with the deadlock detector seems > really painful to me. I wonder if there's some way we can make an end > run around that problem. For instance, if we could make (and enforce) > a coding rule that you cannot acquire a heavyweight lock while holding > a relation extension or page lock, then maybe we could somehow teach > the deadlock detector to just ignore those kinds of locks, and teach > the lock acquisition machinery that they conflict between lock group > members. Yea, that seems possible. I'm not really sure it's needed however? As long as you're not teaching the locking mechanism new tricks that influence the wait graph, why would the deadlock detector care? That's quite different from the group locking case, where you explicitly needed to teach it something fairly fundamental. It might still be a good idea independently to add the rule & enforce that acquire heavyweight locks while holding certain classes of locks is not allowed. > On the other hand, I think you might also be understating the > differences between these kinds of locks and other heavyweight locks. > I suspect that the reason why we use lwlocks for buffers and > heavyweight locks here is because there are a conceptually infinite > number of relations, and lwlocks can't handle that. Right. For me that's *the* fundamental service that lock.c delivers. And it's the fundamental bit this thread so far largely has been focusing on. > The only mechanism we currently have that does handle that is the > heavyweight lock mechanism, and from my point of view, somebody just > beat it with a stick to make it fit this application. But the fact > that it has been made to fit does not mean that it is really fit for > purpose. We use 2 of 9 lock levels, we don't need deadlock detection, > we need different behavior when group locking is in use, we release > locks right away rather than at eoxact. I don't think it's crazy to > think that those differences are significant enough to justify having > a separate mechanism, even if the one that is currently on the table > is not exactly what we want. Isn't that mostly true to varying degrees for the majority of lock types in lock.c? Sure, perhaps historically that's a misuse of lock.c, but it's been pretty ingrained by now. I just don't see where leaving out any of these features is going to give us fundamental advantages justifying a different locking infrastructure. E.g. not needing to support "conceptually infinite" number of relations IMO does provide a fundamental advantage - no need for a mapping. I'm not yet seeing anything equivalent for the extension vs. lock.c style lock case. Greetings, Andres Freund
Re: Error on failed COMMIT
On Fri, 14 Feb 2020 at 12:37, Robert Haas wrote: > On Thu, Feb 13, 2020 at 2:38 AM Haumacher, Bernhard > wrote: > > Assume the application is written in Java and sees Postgres through the > > JDBC driver: > > > > composeTransaction() { > > Connection con = getConnection(); // implicitly "begin" > > try { > >insertFrameworkLevelState(con); > >insertApplicationLevelState(con); > >con.commit(); > >publishNewState(); > > } catch (Throwable ex) { > >con.rollback(); > > } > > } > > > > With the current implementation, it is possible, that the control flow > > reaches "publishNewState()" without the changes done in > > "insertFrameworkLevelState()" have been made persistent - without the > > framework-level code (which is everything except > > "insertApplicationLevelState()") being able to detect the problem (e.g. > > if "insertApplicationLevelState()" tries add a null into a non-null > > column catching the exception or any other application-level error that > > is not properly handled through safepoints). > > > > From a framework's perspective, this behavior is absolutely > > unacceptable. Here, the framework-level code sees a database that > > commits successfully but does not make its changes persistent. > > Therefore, I don't think that altering this behavior has more downside > > than upside. > > I am not sure that this example really proves anything. If > insertFrameworkLevelState(con), insertApplicationLevelState(con), and > con.commit() throw exceptions, or if they return a status code and you > check it and throw an exception if it's not what you expect, then it > will work. Thing is that con.commit() DOESN'T return a status code, nor does it throw an exception as we silently ROLLBACK here. As noted up thread it's somewhat worse as depending on how the transaction failed we seem to do different things In Tom's example we do issue a warning and say there is no transaction running. I would guess we silently rolled back earlier. In my example we don't issue the warning we just roll back. I do agree with Tom that changing this behaviour at this point would make things worse for more people than it would help so I am not advocating throwing an error here. I would however advocate for consistently doing the same thing with failed transactions Dave Cramer www.postgres.rocks
Re: Marking some contrib modules as trusted extensions
Andres Freund writes: > On 2020-02-13 18:57:10 -0500, Tom Lane wrote: >> Maybe we could decide that the time for supporting easy updates from >> pre-9.1 is past, and just remove all the unpackaged-to-XXX scripts? >> Maybe even remove the "FROM version" option altogether. > Yea, that strikes me as a reasonable thing to do. These days that just > seems to be dangerous, without much advantage. Here's a patch to remove the core-code support and documentation for that. I have not included the actual deletion of the contrib modules' 'unpackaged' scripts, as that seems both long and boring. regards, tom lane diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml index 08bb110..261a559 100644 --- a/doc/src/sgml/contrib.sgml +++ b/doc/src/sgml/contrib.sgml @@ -88,22 +88,6 @@ CREATE EXTENSION module_name; - If your database was brought forward by dump and reload from a pre-9.1 - version of PostgreSQL, and you had been using the pre-9.1 - version of the module in it, you should instead do - - -CREATE EXTENSION module_name FROM unpackaged; - - - This will update the pre-9.1 objects of the module into a proper - extension object. Future updates to the module will be - managed by . - For more information about extension updates, see - . - - - Note, however, that some of these modules are not extensions in this sense, but are loaded into the server in some other way, for instance by way of diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml index ffe068b..9ec1af7 100644 --- a/doc/src/sgml/extend.sgml +++ b/doc/src/sgml/extend.sgml @@ -917,33 +917,6 @@ SELECT pg_catalog.pg_extension_config_dump('my_config', 'WHERE NOT standard_entr - The update mechanism can be used to solve an important special case: - converting a loose collection of objects into an extension. - Before the extension mechanism was added to - PostgreSQL (in 9.1), many people wrote - extension modules that simply created assorted unpackaged objects. - Given an existing database containing such objects, how can we convert - the objects into a properly packaged extension? Dropping them and then - doing a plain CREATE EXTENSION is one way, but it's not - desirable if the objects have dependencies (for example, if there are - table columns of a data type created by the extension). The way to fix - this situation is to create an empty extension, then use ALTER - EXTENSION ADD to attach each pre-existing object to the extension, - then finally create any new objects that are in the current extension - version but were not in the unpackaged release. CREATE - EXTENSION supports this case with its FROM old_version option, which causes it to not run the - normal installation script for the target version, but instead the update - script named - extension--old_version--target_version.sql. - The choice of the dummy version name to use as old_version is up to the extension author, though - unpackaged is a common convention. If you have multiple - prior versions you need to be able to update into extension style, use - multiple dummy version names to identify them. - - - ALTER EXTENSION is able to execute sequences of update script files to achieve a requested update. For example, if only foo--1.0--1.1.sql and foo--1.1--2.0.sql are diff --git a/doc/src/sgml/ref/create_extension.sgml b/doc/src/sgml/ref/create_extension.sgml index d76ac3e..6a21bff 100644 --- a/doc/src/sgml/ref/create_extension.sgml +++ b/doc/src/sgml/ref/create_extension.sgml @@ -24,7 +24,6 @@ PostgreSQL documentation CREATE EXTENSION [ IF NOT EXISTS ] extension_name [ WITH ] [ SCHEMA schema_name ] [ VERSION version ] - [ FROM old_version ] [ CASCADE ] @@ -48,8 +47,9 @@ CREATE EXTENSION [ IF NOT EXISTS ] extension_name The user who runs CREATE EXTENSION becomes the - owner of the extension for purposes of later privilege checks, as well - as the owner of any objects created by the extension's script. + owner of the extension for purposes of later privilege checks, and + normally also becomes the owner of any objects created by the + extension's script. @@ -142,33 +142,6 @@ CREATE EXTENSION [ IF NOT EXISTS ] extension_name - old_version - - -FROM old_version -must be specified when, and only when, you are attempting to install -an extension that replaces an old style module that is just -a collection of objects not packaged into an extension. This option -causes CREATE EXTENSION to run an alternative installation -script that absorbs the existing objects into the extension, instead -of creating new objects. Be careful that SCHEMA specifies -the schema containing these pre-existing objects. - - - -
Re: Error on failed COMMIT
On Thu, Feb 13, 2020 at 2:38 AM Haumacher, Bernhard wrote: > Assume the application is written in Java and sees Postgres through the > JDBC driver: > > composeTransaction() { > Connection con = getConnection(); // implicitly "begin" > try { >insertFrameworkLevelState(con); >insertApplicationLevelState(con); >con.commit(); >publishNewState(); > } catch (Throwable ex) { >con.rollback(); > } > } > > With the current implementation, it is possible, that the control flow > reaches "publishNewState()" without the changes done in > "insertFrameworkLevelState()" have been made persistent - without the > framework-level code (which is everything except > "insertApplicationLevelState()") being able to detect the problem (e.g. > if "insertApplicationLevelState()" tries add a null into a non-null > column catching the exception or any other application-level error that > is not properly handled through safepoints). > > From a framework's perspective, this behavior is absolutely > unacceptable. Here, the framework-level code sees a database that > commits successfully but does not make its changes persistent. > Therefore, I don't think that altering this behavior has more downside > than upside. I am not sure that this example really proves anything. If insertFrameworkLevelState(con), insertApplicationLevelState(con), and con.commit() throw exceptions, or if they return a status code and you check it and throw an exception if it's not what you expect, then it will work. If database errors that occur during those operations are ignored, then you've got a problem, but it does not seem necessary to change the behavior of the database to fix that problem. I think one of the larger issues in this area is that people have script that go: BEGIN; -- do stuff COMMIT; BEGIN; -- do more stuff COMMIT; ...and they run these scripts by piping them into psql. Now, if the COMMIT leaves the session in a transaction state, this is going to have pretty random behavior. Like your example, this can be fixed by having proper error checking in the application, but that does require that your application is something more than a psql script. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Fri, Feb 14, 2020 at 11:40 AM Andres Freund wrote: > IMO the right thing here is to extend lock.c so we can better represent > whether certain types of lockmethods (& levels ?) are [not] to be > shared. The part that I find awkward about that is the whole thing with the deadlock detector. The deadlock detection code is old, crufty, complex, and very difficult to test (or at least I have found it so). A bug that I introduced when inventing group locking took like 5 years for somebody to find. One way of looking at the requirement that we have here is that certain kinds of locks need to be exempted from group locking. Basically, these are because they are a lower-level concept: a lock on a relation is more of a "logical" concept, and you hold the lock until eoxact, whereas a lock on an extend the relation is more of a "physical" concept, and you give it up as soon as you are done. Page locks are like relation extension locks in this regard. Unlike locks on SQL-level objects, these should not be shared between members of a lock group. Now, if it weren't for the deadlock detector, that would be easy enough. But figuring out what to do with the deadlock detector seems really painful to me. I wonder if there's some way we can make an end run around that problem. For instance, if we could make (and enforce) a coding rule that you cannot acquire a heavyweight lock while holding a relation extension or page lock, then maybe we could somehow teach the deadlock detector to just ignore those kinds of locks, and teach the lock acquisition machinery that they conflict between lock group members. On the other hand, I think you might also be understating the differences between these kinds of locks and other heavyweight locks. I suspect that the reason why we use lwlocks for buffers and heavyweight locks here is because there are a conceptually infinite number of relations, and lwlocks can't handle that. The only mechanism we currently have that does handle that is the heavyweight lock mechanism, and from my point of view, somebody just beat it with a stick to make it fit this application. But the fact that it has been made to fit does not mean that it is really fit for purpose. We use 2 of 9 lock levels, we don't need deadlock detection, we need different behavior when group locking is in use, we release locks right away rather than at eoxact. I don't think it's crazy to think that those differences are significant enough to justify having a separate mechanism, even if the one that is currently on the table is not exactly what we want. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Hi, On 2020-02-14 09:42:40 -0500, Tom Lane wrote: > In the second place, it's ludicrous to expect that the underlying > platform/filesystem can support an infinite number of concurrent > file-extension operations. At some level (e.g. where disk blocks > are handed out, or where a record of the operation is written to > a filesystem journal) it's quite likely that things are bottlenecked > down to *one* such operation at a time per filesystem. That's probably true to some degree from a theoretical POV, but I think it's so far from where we are at, that it's effectively wrong. I can concurrently extend a few files at close to 10GB/s on a set of fast devices below a *single* filesystem. Whereas postgres bottlenecks far far before this. Given that a lot of today's storage has latencies in the 10-100s of microseconds, a journal flush doesn't necessarily cause that much serialization - and OS journals do group commit like things too. Greetings, Andres Freund
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Hi, On 2020-02-12 11:53:49 -0500, Tom Lane wrote: > In general, if we think there are issues with LWLock, it seems to me > we'd be better off to try to fix them, not to invent a whole new > single-purpose lock manager that we'll have to debug and maintain. My impression is that what's being discussed here is doing exactly that, except with s/lwlock/heavyweight locks/. We're basically replacing the lock.c lock mapping table with an ad-hoc implementation, and now we're also reinventing interruptability etc. I still find the performance arguments pretty ludicruous, to be honest - I think the numbers I posted about how much time we spend with the locks held, back that up. I have a bit more understanding for the parallel worker arguments, but only a bit: I think if we develop a custom solution for the extension lock, we're just going to end up having to develop another custom solution for a bunch of other types of locks. It seems quite likely that we'll end up also wanting TUPLE and also SPECULATIVE and PAGE type locks that we don't want to share between leader & workers. IMO the right thing here is to extend lock.c so we can better represent whether certain types of lockmethods (& levels ?) are [not] to be shared. Greetings, Andres Freund
Re: allow frontend use of the backend's core hashing functions
> On Feb 14, 2020, at 8:29 AM, David Fetter wrote: > > On Fri, Feb 14, 2020 at 08:16:47AM -0800, Mark Dilger wrote: >>> On Feb 14, 2020, at 8:15 AM, David Fetter wrote: >>> >>> On Fri, Feb 14, 2020 at 10:33:04AM -0500, Robert Haas wrote: On Thu, Feb 13, 2020 at 11:26 AM Mark Dilger wrote: > I have made these changes and rebased Robert’s patches but > otherwise changed nothing. Here they are: Thanks. Anyone else have comments? I think this is pretty straightforward and unobjectionable work so I'm inclined to press forward with committing it fairly soon, but if someone feels otherwise, please speak up. >>> >>> One question. It might be possible to make these functions faster >>> using compiler intrinsics. Would those still be available to front-end >>> code? >> >> Do you have a specific proposal that would preserve on-disk compatibility? > > I hadn't planned on changing the representation, just cutting > instructions out of the calculation of same. Ok, I misunderstood. I thought the question was about using compiler intrinsics to implement an alltogether different hashing algorithm than the one currently in use and whether exposing the hash function to frontend code would lock down the algorithm in a way that would make it harder to change in the future. That lead me to the question of whether we had sufficient flexibility to entertain changing the hashing algorithm anyway. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: allow frontend use of the backend's core hashing functions
On Fri, Feb 14, 2020 at 08:16:47AM -0800, Mark Dilger wrote: > > On Feb 14, 2020, at 8:15 AM, David Fetter wrote: > > > > On Fri, Feb 14, 2020 at 10:33:04AM -0500, Robert Haas wrote: > >> On Thu, Feb 13, 2020 at 11:26 AM Mark Dilger > >> wrote: > >>> I have made these changes and rebased Robert’s patches but > >>> otherwise changed nothing. Here they are: > >> > >> Thanks. Anyone else have comments? I think this is pretty > >> straightforward and unobjectionable work so I'm inclined to press > >> forward with committing it fairly soon, but if someone feels > >> otherwise, please speak up. > > > > One question. It might be possible to make these functions faster > > using compiler intrinsics. Would those still be available to front-end > > code? > > Do you have a specific proposal that would preserve on-disk compatibility? I hadn't planned on changing the representation, just cutting instructions out of the calculation of same. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: allow frontend use of the backend's core hashing functions
On Fri, Feb 14, 2020 at 11:15 AM David Fetter wrote: > One question. It might be possible to make these functions faster > using compiler intrinsics. Would those still be available to front-end > code? Why not? We use the same compiler for frontend code as we do for backend code. Possibly you might need some more header-file rejiggering someplace, but I don't know of anything specific or see any reason why it would be particularly painful if we had to do something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Fri, Feb 14, 2020 at 10:43 AM Tom Lane wrote: > I remain unconvinced ... wouldn't both of those claims apply to any disk > I/O request? Are we going to try to ensure that no I/O ever happens > while holding an LWLock, and if so how? (Again, CheckpointLock is a > counterexample, which has been that way for decades without reported > problems. But actually I think buffer I/O locks are an even more > direct counterexample.) Yes, that's a problem. I proposed a patch a few years ago that replaced the buffer I/O locks with condition variables, and I think that's a good idea for lots of reasons, including this one. I never quite got around to pushing that through to commit, but I think we should do that. Aside from fixing this problem, it also prevents certain scenarios where we can currently busy-loop. I do realize that we're unlikely to ever solve this problem completely, but I don't think that should discourage us from making incremental progress. Just as debuggability is a sticking point for you, what I'm going to call operate-ability is a sticking point for me. My work here at EnterpriseDB exposes me on a fairly regular basis to real broken systems, and I'm therefore really sensitive to the concerns that people have when trying to recover after a system has become, for one reason or another, really broken. Interruptibility may not be the #1 concern in that area, but it's very high on the list. EnterpriseDB customers, as a rule, *really* hate being told to restart the database because one session is stuck. It causes a lot of disruption for them and the person who does the restart gets yelled at by their boss, and maybe their bosses boss and the boss above that. It means that their whole application, which may be mission-critical, is down until the database finishes restarting, and that is not always a quick process, especially after an immediate shutdown. I don't think we can ever make everything that can get stuck interruptible, but the more we can do the better. The work you and others have done over the years to add CHECK_FOR_INTERRUPTS() to more places pays real dividends. Making sessions that are blocked on disk I/O interruptible in at least some of the more common cases would be a huge win. Other people may well have different experiences, but my experience is that the disk deciding to conk out for a while or just respond very very slowly is a very common problem even (and sometimes especially) on very expensive hardware. Obviously that's not great and you're in lots of trouble, but being able to hit ^C and get control back significantly improves your chances of being able to understand what has happened and recover from it. > > That's an interesting idea, but it doesn't make the lock acquisition > > itself interruptible, which seems pretty important to me in this case. > > Good point: if you think the contained operation might run too long to > suit you, then you don't want other backends to be stuck behind it for > the same amount of time. Right. > > I wonder if we could have an LWLockAcquireInterruptibly() or some such > > that allows the lock acquisition itself to be interruptible. I think > > that would require some rejiggering but it might be doable. > > Yeah, I had the impression from a brief look at LWLockAcquire that > it was itself depending on not throwing errors partway through. > But with careful and perhaps-a-shade-slower coding, we could probably > make a version that didn't require that. Yeah, that was my thought, too, but I didn't study it that carefully, so somebody would need to do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: allow frontend use of the backend's core hashing functions
> On Feb 14, 2020, at 8:15 AM, David Fetter wrote: > > On Fri, Feb 14, 2020 at 10:33:04AM -0500, Robert Haas wrote: >> On Thu, Feb 13, 2020 at 11:26 AM Mark Dilger >> wrote: >>> I have made these changes and rebased Robert’s patches but >>> otherwise changed nothing. Here they are: >> >> Thanks. Anyone else have comments? I think this is pretty >> straightforward and unobjectionable work so I'm inclined to press >> forward with committing it fairly soon, but if someone feels >> otherwise, please speak up. > > One question. It might be possible to make these functions faster > using compiler intrinsics. Would those still be available to front-end > code? Do you have a specific proposal that would preserve on-disk compatibility? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: allow frontend use of the backend's core hashing functions
On Fri, Feb 14, 2020 at 10:33:04AM -0500, Robert Haas wrote: > On Thu, Feb 13, 2020 at 11:26 AM Mark Dilger > wrote: > > I have made these changes and rebased Robert’s patches but > > otherwise changed nothing. Here they are: > > Thanks. Anyone else have comments? I think this is pretty > straightforward and unobjectionable work so I'm inclined to press > forward with committing it fairly soon, but if someone feels > otherwise, please speak up. One question. It might be possible to make these functions faster using compiler intrinsics. Would those still be available to front-end code? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: Internal key management system
On Thu, Feb 6, 2020 at 9:19 PM Masahiko Sawada wrote: > This feature protects data from disk thefts but cannot protect data > from attackers who are able to access PostgreSQL server. In this > design application side still is responsible for managing the wrapped > secret in order to protect it from attackers. This is the same as when > we use pgcrypto now. The difference is that data is safe even if > attackers steal the wrapped secret and the disk. The data cannot be > decrypted either without the passphrase which can be stored to other > key management systems or without accessing postgres server. IOW for > example, attackers can get the data if they get the wrapped secret > managed by application side then run pg_kmgr_unwrap() to get the > secret and then steal the disk. If you only care about protecting against the theft of the disk, you might as well just encrypt the whole filesystem, which will probably perform better and probably be a lot harder to break since you won't have short encrypted strings but instead large encrypted blocks of data. Moreover, I think a lot of people who are interested in these kinds of features are hoping for more than just protecting against the theft of the disk. While some people may be hoping for too much in this area, setting your sights only on encryption at rest seems like a fairly low bar. It also doesn't seem very likely to actually provide any security. You're talking about sending the encryption key in the query string, which means that there's a good chance it's going to end up in a log file someplace. One way that could happen is if the user has configured log_statement=all or log_min_duration_statement, but it could also happen any time the query throws an error. In theory, you might arrange for the log messages to be sent to another server that is protected by separate layers of security, but a lot of people are going to just log locally. And, even if you do have a separate server, do you really want to have the logfile over there be full of passwords? I know I can be awfully negative some times, but that it seems like a weakness so serious as to make this whole thing effectively useless. One way to plug this hole is to use new protocol messages for key exchanges. For example, suppose that after authentication is complete, you can send the server a new protocol message: KeyPassphrase . The server stores the passphrase in backend-private memory and returns ReadyForQuery, and does not log the message payload anywhere. Now you do this: INSERT INTO tbl VALUES (pg_encrypt('user data', 'key-name'); SELECT pg_decrypt(secret_column, 'key-name') FROM tbl; If the passphrase for the named key has not been loaded into the current session's memory, this produces an error; otherwise, it looks up the passphrase and uses it to do the decryption. Now the passphrase never gets logged anywhere, and, also, the user can't persuade the server to provide it with the encryption key, because there's no SQL-level function to access that data. We could take it a step further: suppose that encryption is a column property, and the value of the property is a key name. If the user hasn't sent a KeyPassphrase message with the relevant key name, attempts to access that column just error out. If they have, then the server just does the encryption and decryption automatically. Now the user can just do: INSERT INTO tbl VALUES ('user data'); SELECT secret_column FROM tbl; It's a huge benefit if the SQL doesn't need to be changed. All that an application needs to do in order to use encryption in this scenario is use PQsetKeyPassphrase() or whatever before doing whatever else they want to do. Even with these changes, the security of this whole approach can be criticized on the basis that a good amount of information about the data can be inferred without decrypting anything. You can tell which encrypted values are long and which are short. If someone builds an index on the column, you can tell the order of all the encrypted values even though you may not know what the actual values are. Those could well be meaningful information leaks, but I think such a system might still be of use for certain purposes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Robert Haas writes: > On Wed, Feb 12, 2020 at 11:53 AM Tom Lane wrote: >> * I see no reason to think that a relation extension lock would ever >> be held long enough for noninterruptibility to be a real issue. Our >> expectations for query cancel response time are in the tens to >> hundreds of msec anyway. > I don't agree, because (1) the time to perform a relation extension on > a busy system can be far longer than that and (2) if the disk is > failing, then it can be *really* long, or indefinite. I remain unconvinced ... wouldn't both of those claims apply to any disk I/O request? Are we going to try to ensure that no I/O ever happens while holding an LWLock, and if so how? (Again, CheckpointLock is a counterexample, which has been that way for decades without reported problems. But actually I think buffer I/O locks are an even more direct counterexample.) >> * There are other places where an LWLock can be held for a *long* time, >> notably the CheckpointLock. If we do think this is an issue, we could >> devise a way to not insist on noninterruptibility. The easiest fix >> is just to do a matching RESUME_INTERRUPTS after getting the lock and >> HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth >> offering some slightly cleaner way. Point here is that LWLockAcquire >> only does that because it's useful to the majority of callers, not >> because it's graven in stone that it must be like that. > That's an interesting idea, but it doesn't make the lock acquisition > itself interruptible, which seems pretty important to me in this case. Good point: if you think the contained operation might run too long to suit you, then you don't want other backends to be stuck behind it for the same amount of time. > I wonder if we could have an LWLockAcquireInterruptibly() or some such > that allows the lock acquisition itself to be interruptible. I think > that would require some rejiggering but it might be doable. Yeah, I had the impression from a brief look at LWLockAcquire that it was itself depending on not throwing errors partway through. But with careful and perhaps-a-shade-slower coding, we could probably make a version that didn't require that. regards, tom lane
Re: Internal key management system
On Thu, Feb 6, 2020 at 4:37 PM Cary Huang wrote: > Since the user does not need to know the master secret key used to cipher the > data, I don't think we should expose "pg_kmgr_unwrap("")" SQL function to > the user at all. > The wrapped key "" is stored in control data and it is possible to obtain > by malicious user and steal the key by running SELECT pg_kmgr_unwrap(""). > Even the user is righteous, it may not be very straightforward for that user > to obtain the wrapped key "" to use in the unwrap function. I agree. > so instead of: > -- > INSERT INTO tbl VALUES (pg_encrypt('user data', pg_kmgr_unwrap('x')); > SELECT pg_decrypt(secret_column, pg_kmgr_unwrap('x')) FROM tbl; > > it would become: > -- > INSERT INTO tbl VALUES (pg_encrypt('user data', 'cluster_pass_phrase'); > SELECT pg_decrypt(secret_column, 'cluster_pass_phrase') FROM tbl; The second one is certainly better than the first one, as it prevents the key from being stolen. It's still pretty bad, though, because the supposedly-secret passphrase will end up in the server log. I have a hard time believing that this feature as currently proposed is worth anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: error context for vacuum to include block number
On Fri, Feb 14, 2020 at 12:30:25PM +0900, Masahiko Sawada wrote: > * I think the function name is too generic. init_vacuum_error_callback > or init_vacuum_errcallback is better. > * The comment of this function is not accurate since this function is > not only for heap vacuum but also index vacuum. How about just > "Initialize vacuum error callback"? > * I think it's easier to read the code if we set the relname and > indname in the same order. > * The comment I wrote in the previous mail seems better, because in > this function the reader might get confused that 'rel' is a relation > or an index depending on the phase but that comment helps it. Fixed these > * rel->rd_index->indexrelid should be rel->rd_index->indrelid. Ack. I think that's been wrong since I first wrote it two weeks ago :( The error is probably more obvious due to the switch statement you proposed. Thanks for continued reviews. -- Justin >From 94768a134118d30853b75a96b90166363f0fef5b Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 12 Dec 2019 20:54:37 -0600 Subject: [PATCH v19] vacuum errcontext to show block being processed Discussion: https://www.postgresql.org/message-id/20191120210600.gc30...@telsasoft.com --- src/backend/access/heap/vacuumlazy.c | 120 +++ 1 file changed, 120 insertions(+) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index a23cdef..ebfb2e7 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -292,6 +292,14 @@ typedef struct LVRelStats bool lock_waiter_detected; } LVRelStats; +typedef struct +{ + char *relnamespace; + char *relname; + char *indname; + BlockNumber blkno; /* used only for heap operations */ + int phase; /* Reusing same enums as for progress reporting */ +} vacuum_error_callback_arg; /* A few variables that don't seem worth passing around as parameters */ static int elevel = -1; @@ -361,6 +369,9 @@ static void end_parallel_vacuum(Relation *Irel, IndexBulkDeleteResult **stats, LVParallelState *lps, int nindexes); static LVSharedIndStats *get_indstats(LVShared *lvshared, int n); static bool skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared); +static void vacuum_error_callback(void *arg); +static void init_vacuum_error_callback(ErrorContextCallback *errcallback, + vacuum_error_callback_arg *errcbarg, Relation onerel, int phase); /* @@ -724,6 +735,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, PROGRESS_VACUUM_MAX_DEAD_TUPLES }; int64 initprog_val[3]; + ErrorContextCallback errcallback; + vacuum_error_callback_arg errcbarg; pg_rusage_init(&ru0); @@ -870,6 +883,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, else skipping_blocks = false; + /* Setup error traceback support for ereport() */ + init_vacuum_error_callback(&errcallback, &errcbarg, onerel, PROGRESS_VACUUM_PHASE_SCAN_HEAP); + error_context_stack = &errcallback; + for (blkno = 0; blkno < nblocks; blkno++) { Buffer buf; @@ -891,6 +908,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, #define FORCE_CHECK_PAGE() \ (blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats)) + errcbarg.blkno = blkno; + pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno); if (blkno == next_unskippable_block) @@ -987,6 +1006,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, vmbuffer = InvalidBuffer; } + /* Pop the error context stack while calling vacuum */ + error_context_stack = errcallback.previous; + /* Work on all the indexes, then the heap */ lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps, nindexes); @@ -1011,6 +1033,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, /* Report that we are once again scanning the heap */ pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_SCAN_HEAP); + + /* Set the error context while continuing heap scan */ + error_context_stack = &errcallback; } /* @@ -1597,6 +1622,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, RecordPageWithFreeSpace(onerel, blkno, freespace); } + /* Pop the error context stack */ + error_context_stack = errcallback.previous; + /* report that everything is scanned and vacuumed */ pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno); @@ -1772,11 +1800,19 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) int npages; PGRUsage ru0; Buffer vmbuffer = InvalidBuffer; + ErrorContextCallback errcallback; + vacuum_error_callback_arg errcbarg; /* Report that we are now vacuuming the heap */ pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_VACUUM_HEAP); + /* + * Se
Re: allow frontend use of the backend's core hashing functions
On Thu, Feb 13, 2020 at 11:26 AM Mark Dilger wrote: > I have made these changes and rebased Robert’s patches but otherwise changed > nothing. Here they are: Thanks. Anyone else have comments? I think this is pretty straightforward and unobjectionable work so I'm inclined to press forward with committing it fairly soon, but if someone feels otherwise, please speak up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: assert pg_class.relnatts is consistent
On 2020-Feb-14, John Naylor wrote: > One possible objection to what I wrote above is that it adds a > different kind of special case, but in a sneaky way. Perhaps it would > be more principled to treat it the same as oid after all. If we do > that, it would help to add a comment that we can't treat relnatts like > pronangs, since we need more information than what's in each pg_class > row. How about something like this? (untested) # oids are a special case; ignore next if $attname eq 'oid'; # pg_class.relnatts is computed from pg_attribute rows; ignore next if $catname eq 'pg_class' and $attname eq 'relnatts'; # Raise error unless a value exists. die "strip_default_values: $catname.$attname undefined\n" if !defined $row->{$attname}; -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Wed, Feb 12, 2020 at 11:53 AM Tom Lane wrote: > Yeah. I would say a couple more things: > > * I see no reason to think that a relation extension lock would ever > be held long enough for noninterruptibility to be a real issue. Our > expectations for query cancel response time are in the tens to > hundreds of msec anyway. I don't agree, because (1) the time to perform a relation extension on a busy system can be far longer than that and (2) if the disk is failing, then it can be *really* long, or indefinite. > * There are other places where an LWLock can be held for a *long* time, > notably the CheckpointLock. If we do think this is an issue, we could > devise a way to not insist on noninterruptibility. The easiest fix > is just to do a matching RESUME_INTERRUPTS after getting the lock and > HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth > offering some slightly cleaner way. Point here is that LWLockAcquire > only does that because it's useful to the majority of callers, not > because it's graven in stone that it must be like that. That's an interesting idea, but it doesn't make the lock acquisition itself interruptible, which seems pretty important to me in this case. I wonder if we could have an LWLockAcquireInterruptibly() or some such that allows the lock acquisition itself to be interruptible. I think that would require some rejiggering but it might be doable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BRIN cost estimate breaks geometric indexes
Hi, Patch may look as simple as this one: https://patch-diff.githubusercontent.com/raw/postgres/postgres/pull/49.diff Previous mention in -hackers is available at https://postgrespro.com/list/id/cakjs1f9n-wapop5xz1dtgdpdqmzegqqk4sv2mk-zzugfc14...@mail.gmail.com - seems everyone overlooked that patch there breaks geometric indexing back then. On Tue, Jan 21, 2020 at 2:07 AM Egor Rogov wrote: > On 21.01.2020 0:00, Darafei "Komяpa" Praliaskouski wrote: > > Hi, > > > > Found out today that BRIN indexes don't really work for PostGIS and > > box datatypes. > > > > Since > > > https://github.com/postgres/postgres/commit/7e534adcdc70866e7be74d626b0ed067c890a251 > Postgres > > > requires datatype to provide correlation statistics. Such statistics > > wasn't provided by PostGIS and box types. > > > > Today I tried to replace a 200gb gist index with 8mb brin index and > > queries didn't work as expected - it was never used. set > > enable_seqscan=off helped for a bit but that's not a permanent solution. > > Plans for context: > > https://gist.github.com/Komzpa/2cd396ec9b65e2c93341e9934d974826 > > > > Debugging session on #postgis IRC channel leads to this ticket to > > create a (not that meaningful) correlation statistics for geometry > > datatype: https://trac.osgeo.org/postgis/ticket/4625#ticket > > > > Postgres Professional mentioned symptoms of the issue in their > > in-depth manual: > > https://habr.com/ru/company/postgrespro/blog/346460/ - box datatype > > showed same unusable BRIN symptoms for them. > > > (Translated to English: > https://habr.com/en/company/postgrespro/blog/452900/) > > > > A reasonable course of action on Postgres side seems to be to not > > assume selectivity of 1 in absence of correlation statistics, but > > something that would prefer such an index to a parallel seq scan, but > > higher than similar GIST. > > > > Any other ideas? > > > As far as I understand, correlation is computed only for sortable types, > which means that the current concept of correlation works as intended > only for B-tree indexes. > > Ideally, correlation should be computed for (attribute, index) pair, > taking into account order of values returned by the index scan. Less > ideal but more easier approach can be to ignore the computed correlation > for any index access except B-tree, and just assume some predefined > constant. > > > > > -- Darafei Praliaskouski Support me: http://patreon.com/komzpa
Re: assert pg_class.relnatts is consistent
I propose this more concise coding for AddDefaultValues, # Now fill in defaults, and note any columns that remain undefined. foreach my $column (@$schema) { my $attname = $column->{name}; my $atttype = $column->{type}; # Skip if a value already exists next if defined $row->{$attname}; # 'oid' and 'relnatts' are special cases. Ignore. next if $attname eq 'oid'; next if $attname eq 'relnatts'; # This column has a default value. Fill it in. if (defined $column->{default}) { $row->{$attname} = $column->{default}; next; } # Failed to find a value. push @missing_fields, $attname; } -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fix compiler warnings on 64-bit Windows
Peter Eisentraut writes: > On 2020-02-13 16:19, Tom Lane wrote: >> According to C99 and POSIX, intptr_t should be provided by ... >> now that we're requiring C99, can we get away with just #include'ing >> that directly in these test files? > I think in the past we were worried about the C library not being fully > C99. But the build farm indicates that even the trailing edge OS X and > HP-UX members have it, so I'm content to require it. Then we should > probably remove the Autoconf tests altogether. Yeah, I think that the C99 requirement has obsoleted a number of configure tests and related hackery in c.h. We just haven't got round to cleaning that up yet. BTW: I'm still concerned about the possibility of the C library being less than C99. The model that was popular back then, and which still exists on e.g. gaur, was that you could install a C99 *compiler* on a pre-C99 system, and the compiler would bring its own standard header files as necessary. While I don't have the machine booted up to check, I'm pretty sure that gaur's is being supplied by the gcc installation not directly from /usr/include. On the other hand, that compiler installation is still dependent on the vendor-supplied libc. So the sorts of tests I think we can get away with removing have to do with the presence of C99-required headers, macros, typedefs, etc. Anything that is checking the presence or behavior of code in libc, we probably need to be more careful about. regards, tom lane
Re: Wait event that should be reported while waiting for WAL archiving to finish
On Thu, Feb 13, 2020 at 10:47 PM Fujii Masao wrote: > Fixed. Thanks for the review! I think it would be safer to just report the wait event during pg_usleep(100L) rather than putting those calls around the whole loop. It does not seem impossible that ereport() or CHECK_FOR_INTERRUPTS() could do something that reports a wait event internally. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Amit Kapila writes: > I think MaxBackends will generally limit the number of different > relations that can simultaneously extend, but maybe tables with many > partitions might change the situation. You are right that some tests > might suggest a good number, let Mahendra write a patch and then we > can test it. Do you have any better idea? In the first place, there certainly isn't more than one extension happening at a time per backend, else the entire premise of this thread is wrong. Handwaving about partitions won't change that. In the second place, it's ludicrous to expect that the underlying platform/filesystem can support an infinite number of concurrent file-extension operations. At some level (e.g. where disk blocks are handed out, or where a record of the operation is written to a filesystem journal) it's quite likely that things are bottlenecked down to *one* such operation at a time per filesystem. So I'm not that concerned about occasional false-sharing limiting our ability to issue concurrent requests. There are probably worse restrictions at lower levels. regards, tom lane
Re: assert pg_class.relnatts is consistent
Hi John, On Fri, Feb 14, 2020 at 6:50 PM John Naylor wrote: > On Fri, Feb 14, 2020 at 5:00 PM Amit Langote wrote: > > I tried and think it works but not sure if that's good Perl > > programming. See the attached. > > Hi Amit, > I took this for a spin -- I just have a couple comments. Thanks for chiming in. > + elsif ($attname eq 'relnatts') > + { > + ; > + } > > With your patch, I get this when running > src/include/catalog/reformat_dat_file.pl: > > strip_default_values: pg_class.relnatts undefined I think I have fixed this in the attached. > + if ($catname eq "pg_class" && $attname eq "relnatts") > + { > + $bki_values{$attname} = $catalog_ncols{$bki_values{relname}}; > + } > + > > You could avoid the name/attr checks if you do it while building the > pg_class lookup table, like this: > > foreach my $row (@{ $catalog_data{pg_class} }) > { > $classoids{ $row->{relname} } = $row->{oid}; > + > + # Also fill in correct value for relnatts. > + $row->{relnatts} = $catalog_ncols{ $row->{relname} }; > } Did this too. Attached updated patch, which also addresses Michael's comment. I'm still trying to understand your comment about using placeholder BKI_DEFAULT... Thanks, Amit v2-0001-Don-t-require-relnatts-to-be-specified-in-pg_clas.patch Description: Binary data
Re: Parallel copy
On Fri, 14 Feb 2020 at 11:57, Amit Kapila wrote: > On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro > wrote: > > > > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila > wrote: ... > > > Another approach that came up during an offlist discussion with Robert > > > is that we have one dedicated worker for reading the chunks from file > > > and it copies the complete tuples of one chunk in the shared memory > > > and once that is done, a handover that chunks to another worker which > > > can process tuples in that area. We can imagine that the reader > > > worker is responsible to form some sort of work queue that can be > > > processed by the other workers. In this idea, we won't be able to get > > > the benefit of initial tokenization (forming tuple boundaries) via > > > parallel workers and might need some additional memory processing as > > > after reader worker has handed the initial shared memory segment, we > > > need to somehow identify tuple boundaries and then process them. > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returns in quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote state required. A single worker could also process a stream without having to reread/rewind so it would be able to process input from STDIN or PROGRAM sources, making the improvements applicable to load operations done by third party tools and scripted \copy in psql. > > ... > > > > Another thing we need to figure out is the how many workers to use for > > > the copy command. I think we can use it based on the file size which > > > needs some experiments or may be based on user input. > > > > It seems like we don't even really have a general model for that sort > > of thing in the rest of the system yet, and I guess some kind of > > fairly dumb explicit system would make sense in the early days... > > > > makes sense. > The ratio between chunking or line parsing processes and the parallel worker pool would vary with the width of the table, complexity of the data or file (dates, encoding conversions), complexity of constraints and acceptable impact of the load. Being able to control it through user input would be great. -- Alastair
Re: assert pg_class.relnatts is consistent
On Fri, Feb 14, 2020 at 6:47 PM Michael Paquier wrote: > On Fri, Feb 14, 2020 at 06:00:05PM +0900, Amit Langote wrote: > > On Fri, Feb 14, 2020 at 2:58 PM Amit Langote > > wrote: > > > On Fri, Feb 14, 2020 at 1:04 AM Tom Lane wrote: > > > > I've been burnt by this too :-(. However, I think this patch is > > > > completely the wrong way to go about improving this. What we should > > > > be doing, now that we have all that perl code generating postgres.bki, > > > > is eliminating the problem at the source. That is, drop the hand-coded > > > > relnatts values from pg_class.dat altogether, and let the perl code fill > > > > it in --- compare the handling of pg_proc.pronargs for instance. > > > > > > I can't write Perl myself (maybe Justin), but +1 to this idea. > > > > I tried and think it works but not sure if that's good Perl > > programming. See the attached. > > I quite like what you have here. Please note that this comment in > genbki.pl is incorrect regarding relnatts (the last part could just be > deleted): > # Note: only bootstrap catalogs, ie those marked BKI_BOOTSTRAP, need to > # have entries here. Be sure that the OIDs listed here match those given in > # their CATALOG and BKI_ROWTYPE_OID macros, and that the relnatts values are > # correct. You're right, although this comment is in pg_class.dat. Thanks, Amit
Re: [HACKERS] [PATCH] Generic type subscripting
> On Thu, Feb 13, 2020 at 02:25:46PM +0100, Pavel Stehule wrote: > > > > I like patch 0006 - filling gaps by NULLs - it fixed my objections if I > > > remember correctly. Patch 0005 - polymorphic subscribing - I had not a > > > idea, what is a use case? Maybe can be good to postpone this patch. I > > have > > > not strong opinion about it, but generally is good to reduce size of > > > initial patch. I have nothing against a compatibility with SQL, but this > > > case doesn't looks too realistic for me, and can be postponed without > > > future compatibility issues. > > > > The idea about 0005 is mostly performance related, since this change > > (aside from being more pedantic with the standard) also allows to > > squeeze out some visible processing time improvement. But I agree that > > the patch series itself is too big to add something more, that's why I > > concider 0005/0006 mosly as interesting ideas for the future. > > > > patch 0006 is necessary from my perspective. Without it, behave of update > is not practical. I didn't review of this patch mainly due issues that was > fixed by 0006 patch Oh, I see. The thing is that in how it is implemented right now 0006 depends on 0005. Originally I was against of doing anything different than a regular jsonb functionality would do, but after the discussion about jsonb_set and null arguments I figured that indeed it probably makes sense to deviate in some certain cases. Eventually it depends on the community feedback, so I can try to make 0006 an independent change and we will see. > > Yep, I wasn't paying much attention recently to this patch, will post > > rebased and fixed version soon. At the same time I must admit, even if > > at the moment I can pursue two goals - either to make this feature > > accepted somehow, or make a longest living CF item ever - neither of > > those goals seems reachable. > > > > I think so this feature is not important for existing applications. But it > allows to work with JSON data (or any other) more comfortable (creative) in > plpgsql. Yes, hopefully.
Re: LOCK TABLE and DROP TABLE on temp tables of other sessions
On Fri, Feb 14, 2020 at 11:35 AM Michael Paquier wrote: > On Thu, Feb 13, 2020 at 09:05:01PM +0530, Ashutosh Bapat wrote: > > That looks odd. Other sessions are able to see temporary tables of a > given > > session because they are stored in the same catalog which is accessible > to > > all the sessions. But ideally, a temporary table should be visible only > to > > the session which created it (GTT is an exception). So LOCK and DROP > table > > should not succeed. > > One thing that we need to consider is if there are applications which > take advantage of LOCK allowed on temp relations from other backends > or not. One downside is that if one backend takes a lock on a temp > table from a different session, then this other session would not > completely shut down (still report the shutdown to the client), > and would remain blocked during the temp schema cleanup until the > transaction of the session locking the temp relation commits. This > blocks access to one connection slot, still we are talking about an > operation where the owner of the temp schema wants to do the lock. > That might be disastrous if happens by accident eating up most of the available connection slots. Whatever the user wants to achieve using LOCK [temp] TABLE of other session, I guess can be achieved by other means or can be shown to have disastrous effect. So that kind of usage pattern would better be forced to change. > > > Thinking from a different perspective, DROP TABLE being able to drop a > > temporary table seems a good tool in case a temporary table is left > behind > > by a finished session. But that doesn't seem like a good reason to have > it > > and I don't see much use of LOCK TABLE there. > > Yep. Robert had actually this argument with DROP SCHEMA pg_temp not > so long ago with me. > > DROP SCHEMA might be better for mass cleanup. That actually makes DROP [other session temp] TABLE useless. -- -- Best Wishes, Ashutosh Bapat
Re: Index Skip Scan
> On Fri, Feb 14, 2020 at 05:23:13PM +0900, Kyotaro Horiguchi wrote: > The first attached (renamed to .txt not to confuse the cfbots) is a > small patch that makes sure if _bt_readpage is called with the proper > condition as written in its comment, that is, caller must have pinned > and read-locked so->currPos.buf. This patch reveals many instances of > breakage of the contract. Thanks! On top of which patch version one can apply it? I'm asking because I believe I've addressed similar issues in the last version, and the last proposed diff (after resolving some conflicts) breaks tests for me, so not sure if I miss something. At the same time if you and Tomas strongly agree that it actually makes sense to make moving forward/reading backward case work with dead tuples correctly, I'll take a shot and try to teach the code around _bt_skip to do what is required for that. I can merge your changes there and we can see what would be the result.
Re: Parallel copy
On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro wrote: > > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila wrote: > > This work is to parallelize the copy command and in particular "Copy > > from 'filename' Where ;" command. > > Nice project, and a great stepping stone towards parallel DML. > Thanks. > > The first idea is that we allocate each chunk to a worker and once the > > worker has finished processing the current chunk, it can start with > > the next unprocessed chunk. Here, we need to see how to handle the > > partial tuples at the end or beginning of each chunk. We can read the > > chunks in dsa/dsm instead of in local buffer for processing. > > Alternatively, if we think that accessing shared memory can be costly > > we can read the entire chunk in local memory, but copy the partial > > tuple at the beginning of a chunk (if any) to a dsa. We mainly need > > partial tuple in the shared memory area. The worker which has found > > the initial part of the partial tuple will be responsible to process > > the entire tuple. Now, to detect whether there is a partial tuple at > > the beginning of a chunk, we always start reading one byte, prior to > > the start of the current chunk and if that byte is not a terminating > > line byte, we know that it is a partial tuple. Now, while processing > > the chunk, we will ignore this first line and start after the first > > terminating line. > > That's quiet similar to the approach I took with a parallel file_fdw > patch[1], which mostly consisted of parallelising the reading part of > copy.c, except that... > > > To connect the partial tuple in two consecutive chunks, we need to > > have another data structure (for the ease of reference in this email, > > I call it CTM (chunk-tuple-map)) in shared memory where we store some > > per-chunk information like the chunk-number, dsa location of that > > chunk and a variable which indicates whether we can free/reuse the > > current entry. Whenever we encounter the partial tuple at the > > beginning of a chunk we note down its chunk number, and dsa location > > in CTM. Next, whenever we encounter any partial tuple at the end of > > the chunk, we search CTM for next chunk-number and read from > > corresponding dsa location till we encounter terminating line byte. > > Once we have read and processed this partial tuple, we can mark the > > entry as available for reuse. There are some loose ends here like how > > many entries shall we allocate in this data structure. It depends on > > whether we want to allow the worker to start reading the next chunk > > before the partial tuple of the previous chunk is processed. To keep > > it simple, we can allow the worker to process the next chunk only when > > the partial tuple in the previous chunk is processed. This will allow > > us to keep the entries equal to a number of workers in CTM. I think > > we can easily improve this if we want but I don't think it will matter > > too much as in most cases by the time we processed the tuples in that > > chunk, the partial tuple would have been consumed by the other worker. > > ... I didn't use a shm 'partial tuple' exchanging mechanism, I just > had each worker follow the final tuple in its chunk into the next > chunk, and have each worker ignore the first tuple in chunk after > chunk 0 because it knows someone else is looking after that. That > means that there was some double reading going on near the boundaries, > Right and especially if the part in the second chunk is bigger, then we might need to read most of the second chunk. > and considering how much I've been complaining about bogus extra > system calls on this mailing list lately, yeah, your idea of doing a > bit more coordination is a better idea. If you go this way, you might > at least find the copy.c part of the patch I wrote useful as stand-in > scaffolding code in the meantime while you prototype the parallel > writing side, if you don't already have something better for this? > No, I haven't started writing anything yet, but I have some ideas on how to achieve this. I quickly skimmed through your patch and I think that can be used as a starting point though if we decide to go with accumulating the partial tuple or all the data in shm, then the things might differ. > > Another approach that came up during an offlist discussion with Robert > > is that we have one dedicated worker for reading the chunks from file > > and it copies the complete tuples of one chunk in the shared memory > > and once that is done, a handover that chunks to another worker which > > can process tuples in that area. We can imagine that the reader > > worker is responsible to form some sort of work queue that can be > > processed by the other workers. In this idea, we won't be able to get > > the benefit of initial tokenization (forming tuple boundaries) via > > parallel workers and might need some additional memory processing as > > after reader worker has handed the initial shared
Re: [PATCH] libpq improvements and fixes
Em sex., 14 de fev. de 2020 às 03:13, Michael Paquier escreveu: > On Thu, Feb 13, 2020 at 02:22:36PM -0300, Ranier Vilela wrote: > > I just kept it, even if I duplicated the error message, the style was > kept > > and in my opinion it is much more coherent and readable. > > But your solution is also good, and yes, it is worth it, because even > with > > small benefits, the change improves the code and prevents Coverity or > > another tool from continuing to report false positives or not. > > Complaints from static analyzers need to be taken with a pinch of > salt, and I agree with Tom here. > That's right, I will try avoid sending patches that only satisfy static analysis tools. > > Virtually no code will break for the change, since bool and int are > > internally the same types. > > I believe that no code will have either adjusted to work with corrected > > functions, even if they use compiled libraries. > > And again, it is worth correcting at least the static ones, because the > > goal here, too, is to improve readability. > > FWIW, looking at the patch from upthread, I think that it is not that > wise to blindly break the error compatibility handling of all PQsend* > routines by switching the error handling of the connection to be after > the compatibility checks, and all the other changes don't justify a > breakage making back-patching more complicated nor do they improve > readability at great lengths. > It is difficult to understand what you consider to be improvement. Another programming principle I follow is to remove anything static from loops that can be executed outside the loop. In this specific case, from the loop modified in fe-exec, two branches were removed, is this an improvement for you or not? See patch attached. regards, Ranier Vilela fe-exec.patch Description: Binary data
Re: assert pg_class.relnatts is consistent
> pronangs, since we need more information than what's in each pg_class Sigh, and of course I met pg_proc.pronargs. -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: assert pg_class.relnatts is consistent
I wrote: > + elsif ($attname eq 'relnatts') > + { > + ; > + } > > With your patch, I get this when running > src/include/catalog/reformat_dat_file.pl: > > strip_default_values: pg_class.relnatts undefined > > Rather than adding this one-off case to AddDefaultValues and then > another special case to strip_default_values, maybe it would be better > to just add a placeholder BKI_DEFAULT(0) to pg_class.h, with a comment > that it's just a placeholder. One possible objection to what I wrote above is that it adds a different kind of special case, but in a sneaky way. Perhaps it would be more principled to treat it the same as oid after all. If we do that, it would help to add a comment that we can't treat relnatts like pronangs, since we need more information than what's in each pg_class row. -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Sun, Feb 9, 2020 at 9:18 AM Amit Kapila wrote: > > It seems for this we formed a cache of max_cached_tuplebufs number of > objects and we don't need to allocate more than that number of tuples > of size MaxHeapTupleSize because we will anyway return that memory to > aset.c. > In the approach suggested by Amit (approach 1), once we allocate the max_cached_tuplebufs number of MaxHeapTupleSize, we can use the actual length of the tuple for allocating memory. So, if we have m subtransactions, the memory usage at worst case will be, (max_cached_tuplebufs * MaxHeapTupleSize) cache + (Maximum changes in a subtransaction before spilling) * m * (Actual tuple size) = 64 MB cache + 4095 * m * (Actual tuple size) In the approach suggested by Andres (approach 2), we're going to reduce the size of a cached tuple to 1024 bytes. So, if we have m sub-transactions, the memory usage at worst case will be, (max_cached_tuplebufs * 1024 bytes) cache + (Maximum changes in a subtransaction before spilling) * m * 1024 bytes = 8 MB cache + 4095 * m * 1024 (considering the size of the tuple is less than 1024 bytes) Once the cache is filled, for 1000 sub-transactions operating on tuple size, say 100 bytes, approach 1 will allocate 390 MB of memory (approx.) whereas approach 2 will allocate 4GB of memory approximately. If there is no obvious error that I'm missing, I think we should implement the first approach. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila wrote: > This work is to parallelize the copy command and in particular "Copy > from 'filename' Where ;" command. Nice project, and a great stepping stone towards parallel DML. > The first idea is that we allocate each chunk to a worker and once the > worker has finished processing the current chunk, it can start with > the next unprocessed chunk. Here, we need to see how to handle the > partial tuples at the end or beginning of each chunk. We can read the > chunks in dsa/dsm instead of in local buffer for processing. > Alternatively, if we think that accessing shared memory can be costly > we can read the entire chunk in local memory, but copy the partial > tuple at the beginning of a chunk (if any) to a dsa. We mainly need > partial tuple in the shared memory area. The worker which has found > the initial part of the partial tuple will be responsible to process > the entire tuple. Now, to detect whether there is a partial tuple at > the beginning of a chunk, we always start reading one byte, prior to > the start of the current chunk and if that byte is not a terminating > line byte, we know that it is a partial tuple. Now, while processing > the chunk, we will ignore this first line and start after the first > terminating line. That's quiet similar to the approach I took with a parallel file_fdw patch[1], which mostly consisted of parallelising the reading part of copy.c, except that... > To connect the partial tuple in two consecutive chunks, we need to > have another data structure (for the ease of reference in this email, > I call it CTM (chunk-tuple-map)) in shared memory where we store some > per-chunk information like the chunk-number, dsa location of that > chunk and a variable which indicates whether we can free/reuse the > current entry. Whenever we encounter the partial tuple at the > beginning of a chunk we note down its chunk number, and dsa location > in CTM. Next, whenever we encounter any partial tuple at the end of > the chunk, we search CTM for next chunk-number and read from > corresponding dsa location till we encounter terminating line byte. > Once we have read and processed this partial tuple, we can mark the > entry as available for reuse. There are some loose ends here like how > many entries shall we allocate in this data structure. It depends on > whether we want to allow the worker to start reading the next chunk > before the partial tuple of the previous chunk is processed. To keep > it simple, we can allow the worker to process the next chunk only when > the partial tuple in the previous chunk is processed. This will allow > us to keep the entries equal to a number of workers in CTM. I think > we can easily improve this if we want but I don't think it will matter > too much as in most cases by the time we processed the tuples in that > chunk, the partial tuple would have been consumed by the other worker. ... I didn't use a shm 'partial tuple' exchanging mechanism, I just had each worker follow the final tuple in its chunk into the next chunk, and have each worker ignore the first tuple in chunk after chunk 0 because it knows someone else is looking after that. That means that there was some double reading going on near the boundaries, and considering how much I've been complaining about bogus extra system calls on this mailing list lately, yeah, your idea of doing a bit more coordination is a better idea. If you go this way, you might at least find the copy.c part of the patch I wrote useful as stand-in scaffolding code in the meantime while you prototype the parallel writing side, if you don't already have something better for this? > Another approach that came up during an offlist discussion with Robert > is that we have one dedicated worker for reading the chunks from file > and it copies the complete tuples of one chunk in the shared memory > and once that is done, a handover that chunks to another worker which > can process tuples in that area. We can imagine that the reader > worker is responsible to form some sort of work queue that can be > processed by the other workers. In this idea, we won't be able to get > the benefit of initial tokenization (forming tuple boundaries) via > parallel workers and might need some additional memory processing as > after reader worker has handed the initial shared memory segment, we > need to somehow identify tuple boundaries and then process them. Yeah, I have also wondered about something like this in a slightly different context. For parallel query in general, I wondered if there should be a Parallel Scatter node, that can be put on top of any parallel-safe plan, and it runs it in a worker process that just pushes tuples into a single-producer multi-consumer shm queue, and then other workers read from that whenever they need a tuple. Hmm, but for COPY, I suppose you'd want to push the raw lines with minimal examination, not tuples, into a shm queue, so I guess th
Re: Extracting only the columns needed for a query
> > > On Sat, Jun 15, 2019 at 10:02 AM Tom Lane wrote: > > > > > > Another reason for having the planner do this is that presumably, in > > > an AM that's excited about this, the set of fetched columns should > > > play into the cost estimates for the scan. I've not been paying > > > enough attention to the tableam work to know if we've got hooks for > > > the AM to affect scan costing ... but if we don't, that seems like > > > a hole that needs plugged. > > > > AM callback relation_estimate_size exists currently which planner > leverages. > > Via this callback it fetches tuples, pages, etc.. So, our thought is to > extend > > this API if possible to pass down needed column and help perform better > costing > > for the query. Though we think if wish to leverage this function, need > to know > > list of columns before planning hence might need to use query tree. > > I believe it would be beneficial to add this potential API extension patch > into > the thread (as an example of an interface defining how scanCols could be > used) > and review them together. > > Thanks for your suggestion, we paste one potential API extension change bellow for zedstore to use scanCols. The change contains 3 patches to clarify our idea. 0001-ANALYZE.patch is a generic patch for ANALYZE API extension, we develop it to make the analysis of zedstore tables more accurate. It is more flexible now, eg, TableAm can provide logical block number as random sample seed; TableAm can only analyze specified columns; TableAm can provide extra info besides the data tuple. 0002-Planner.patch is the real patch to show how we use rte->scanCols for a cost estimate, the main idea is adding a new metric 'stadiskfrac' to catalog pg_statistic, 'stadiskfrac' is the physical size ratio of a column, it is calculated when ANALYZE is performed, 0001-ANALYZE.patch can help to provide extra disk size info. So when set_plain_rel_size() is called by the planner, it uses rte->scanCols and 'stadiskfrac' to adjust the rel->pages, please see set_plain_rel_page_estimates(). 0003-ZedStore.patch is an example of how zedstore uses extended ANALYZE API, I paste it here anywhere, in case someone is interest in it. Thanks, Pengzhou From 77d257ab002a5e1b6a2f65e359cbfd7978e3cff5 Mon Sep 17 00:00:00 2001 From: Pengzhou Tang Date: Wed, 20 Nov 2019 06:42:37 -0500 Subject: [PATCH 1/3] ANALYZE tableam API change Extended three ANALYZE-related tableam APIs so AMs can take more control of ANALYZE progress: - scan_analyze_beginscan() : so AMs can has more flexible sampling strategy - scan_analyze_sample_tuple() : so ANALYZE can get extra info as needed - scan_analyze_endscan() : Also use struct AnalyzeSampleContext to provide more convenience, with it tableam analyze routines can provide extra info except the real data, for example: physical size or compression ratio. --- contrib/file_fdw/file_fdw.c | 35 +++--- contrib/postgres_fdw/postgres_fdw.c | 56 + src/backend/access/heap/heapam_handler.c | 109 ++-- src/backend/access/table/tableam.c | 209 +++ src/backend/commands/analyze.c | 181 -- src/include/access/tableam.h | 138 +--- src/include/foreign/fdwapi.h | 7 +- 7 files changed, 530 insertions(+), 205 deletions(-) diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c index 549821c..2344f01 100644 --- a/contrib/file_fdw/file_fdw.c +++ b/contrib/file_fdw/file_fdw.c @@ -19,6 +19,7 @@ #include "access/reloptions.h" #include "access/sysattr.h" #include "access/table.h" +#include "access/tableam.h" #include "catalog/pg_authid.h" #include "catalog/pg_foreign_table.h" #include "commands/copy.h" @@ -157,10 +158,8 @@ static void estimate_size(PlannerInfo *root, RelOptInfo *baserel, static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel, FileFdwPlanState *fdw_private, Cost *startup_cost, Cost *total_cost); -static int file_acquire_sample_rows(Relation onerel, int elevel, - HeapTuple *rows, int targrows, - double *totalrows, double *totaldeadrows); - +static void file_acquire_sample_rows(Relation onerel, int elevel, + AnalyzeSampleContext *context); /* * Foreign-data wrapper handler function: return a struct with pointers @@ -1091,14 +1090,16 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel, * may be meaningless, but it's OK because we don't use the estimates * currently (the planner only pays attention to correlation for indexscans). */ -static int +static void file_acquire_sample_rows(Relation onerel, int elevel, - HeapTuple *rows, int targrows, - double *totalrows, double *totaldeadrows) + AnalyzeSampleContext *context) { int numrows = 0; + int targrows = 0; + double totalrows = 0; double rowstoskip = -1; /* -1 means not set yet */ ReservoirStateData rstate; + HeapTuple tuple; TupleDesc tupDe
Re: assert pg_class.relnatts is consistent
On Fri, Feb 14, 2020 at 5:00 PM Amit Langote wrote: > I tried and think it works but not sure if that's good Perl > programming. See the attached. Hi Amit, I took this for a spin -- I just have a couple comments. + elsif ($attname eq 'relnatts') + { + ; + } With your patch, I get this when running src/include/catalog/reformat_dat_file.pl: strip_default_values: pg_class.relnatts undefined Rather than adding this one-off case to AddDefaultValues and then another special case to strip_default_values, maybe it would be better to just add a placeholder BKI_DEFAULT(0) to pg_class.h, with a comment that it's just a placeholder. + if ($catname eq "pg_class" && $attname eq "relnatts") + { + $bki_values{$attname} = $catalog_ncols{$bki_values{relname}}; + } + You could avoid the name/attr checks if you do it while building the pg_class lookup table, like this: foreach my $row (@{ $catalog_data{pg_class} }) { $classoids{ $row->{relname} } = $row->{oid}; + + # Also fill in correct value for relnatts. + $row->{relnatts} = $catalog_ncols{ $row->{relname} }; } -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: assert pg_class.relnatts is consistent
On Fri, Feb 14, 2020 at 06:00:05PM +0900, Amit Langote wrote: > On Fri, Feb 14, 2020 at 2:58 PM Amit Langote wrote: > > On Fri, Feb 14, 2020 at 1:04 AM Tom Lane wrote: > > > I've been burnt by this too :-(. However, I think this patch is > > > completely the wrong way to go about improving this. What we should > > > be doing, now that we have all that perl code generating postgres.bki, > > > is eliminating the problem at the source. That is, drop the hand-coded > > > relnatts values from pg_class.dat altogether, and let the perl code fill > > > it in --- compare the handling of pg_proc.pronargs for instance. > > > > I can't write Perl myself (maybe Justin), but +1 to this idea. > > I tried and think it works but not sure if that's good Perl > programming. See the attached. I quite like what you have here. Please note that this comment in genbki.pl is incorrect regarding relnatts (the last part could just be deleted): # Note: only bootstrap catalogs, ie those marked BKI_BOOTSTRAP, need to # have entries here. Be sure that the OIDs listed here match those given in # their CATALOG and BKI_ROWTYPE_OID macros, and that the relnatts values are # correct. -- Michael signature.asc Description: PGP signature
Re: [Proposal] Global temporary tables
čt 30. 1. 2020 v 15:21 odesílatel Pavel Stehule napsal: > > > čt 30. 1. 2020 v 15:17 odesílatel 曾文旌(义从) > napsal: > >> >> >> > 2020年1月29日 下午9:48,Robert Haas 写道: >> > >> > On Tue, Jan 28, 2020 at 12:12 PM 曾文旌(义从) >> wrote: >> >>> Opinion by Pavel >> >>> + rel->rd_islocaltemp = true; <<< if this is valid, then the >> name of field "rd_islocaltemp" is not probably best >> >>> I renamed rd_islocaltemp >> >> >> >> I don't see any change? >> >> >> >> Rename rd_islocaltemp to rd_istemp in >> global_temporary_table_v8-pg13.patch >> > >> > In view of commit 6919b7e3294702adc39effd16634b2715d04f012, I think >> > that this has approximately a 0% chance of being acceptable. If you're >> > setting a field in a way that is inconsistent with the current use of >> > the field, you're probably doing it wrong, because the field has an >> > existing purpose to which new code must conform. And if you're not >> > doing that, then you don't need to rename it. >> Thank you for pointing it out. >> I've rolled back the rename. >> But I still need rd_localtemp to be true, The reason is that >> 1 GTT The GTT needs to support DML in read-only transactions ,like local >> temp table. >> 2 GTT does not need to hold the lock before modifying the index buffer >> ,also like local temp table. >> >> Please give me feedback. >> > > maybe some like > > rel->rd_globaltemp = true; > > and somewhere else > > if (rel->rd_localtemp || rel->rd_globaltemp) > { > ... > } > > I tested this patch again and I am very well satisfied with behave. what doesn't work still - TRUNCATE statement postgres=# insert into foo select generate_series(1,1); INSERT 0 1 postgres=# \dt+ foo List of relations ┌┬──┬───┬───┬─┬┬─┐ │ Schema │ Name │ Type │ Owner │ Persistence │ Size │ Description │ ╞╪══╪═══╪═══╪═╪╪═╡ │ public │ foo │ table │ pavel │ session │ 384 kB │ │ └┴──┴───┴───┴─┴┴─┘ (1 row) postgres=# truncate foo; TRUNCATE TABLE postgres=# \dt+ foo List of relations ┌┬──┬───┬───┬─┬───┬─┐ │ Schema │ Name │ Type │ Owner │ Persistence │ Size │ Description │ ╞╪══╪═══╪═══╪═╪═══╪═╡ │ public │ foo │ table │ pavel │ session │ 16 kB │ │ └┴──┴───┴───┴─┴───┴─┘ (1 row) I expect zero size after truncate. Regards Pavel >> >> Wenjing >> >> >> >> >> > >> > -- >> > Robert Haas >> > EnterpriseDB: http://www.enterprisedb.com >> > The Enterprise PostgreSQL Company >> >>
Re: Fix compiler warnings on 64-bit Windows
On 2020-02-13 16:19, Tom Lane wrote: According to C99 and POSIX, intptr_t should be provided by ... now that we're requiring C99, can we get away with just #include'ing that directly in these test files? I think in the past we were worried about the C library not being fully C99. But the build farm indicates that even the trailing edge OS X and HP-UX members have it, so I'm content to require it. Then we should probably remove the Autoconf tests altogether. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: assert pg_class.relnatts is consistent
On Fri, Feb 14, 2020 at 2:58 PM Amit Langote wrote: > On Fri, Feb 14, 2020 at 1:04 AM Tom Lane wrote: > > I've been burnt by this too :-(. However, I think this patch is > > completely the wrong way to go about improving this. What we should > > be doing, now that we have all that perl code generating postgres.bki, > > is eliminating the problem at the source. That is, drop the hand-coded > > relnatts values from pg_class.dat altogether, and let the perl code fill > > it in --- compare the handling of pg_proc.pronargs for instance. > > I can't write Perl myself (maybe Justin), but +1 to this idea. I tried and think it works but not sure if that's good Perl programming. See the attached. Thanks, Amit From ec690b6e78176354ae033073cda1b0770e956a72 Mon Sep 17 00:00:00 2001 From: Amit Langote Date: Fri, 14 Feb 2020 16:04:48 +0900 Subject: [PATCH] Don't require relnatts to be specified in pg_class.dat --- src/backend/catalog/Catalog.pm | 4 src/backend/catalog/genbki.pl| 7 +++ src/include/catalog/pg_class.dat | 11 +++ 3 files changed, 18 insertions(+), 4 deletions(-) diff --git a/src/backend/catalog/Catalog.pm b/src/backend/catalog/Catalog.pm index c089b1d71d..8e84bf9d2d 100644 --- a/src/backend/catalog/Catalog.pm +++ b/src/backend/catalog/Catalog.pm @@ -341,6 +341,10 @@ sub AddDefaultValues { ; } + elsif ($attname eq 'relnatts') + { + ; + } elsif (defined $column->{default}) { $row->{$attname} = $column->{default}; diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl index 803251207b..eac4542ade 100644 --- a/src/backend/catalog/genbki.pl +++ b/src/backend/catalog/genbki.pl @@ -56,6 +56,7 @@ my %catalog_data; my @toast_decls; my @index_decls; my %oidcounts; +my %catalog_ncols; foreach my $header (@ARGV) { @@ -71,6 +72,7 @@ foreach my $header (@ARGV) { push @catnames, $catname; $catalogs{$catname} = $catalog; + $catalog_ncols{$catname} = scalar(@$schema); } # While checking for duplicated OIDs, we ignore the pg_class OID and @@ -524,6 +526,11 @@ EOM my $attname = $column->{name}; my $atttype = $column->{type}; + if ($catname eq "pg_class" && $attname eq "relnatts") + { + $bki_values{$attname} = $catalog_ncols{$bki_values{relname}}; + } + # Assign oid if oid column exists and no explicit assignment in row if ($attname eq "oid" and not defined $bki_values{$attname}) { diff --git a/src/include/catalog/pg_class.dat b/src/include/catalog/pg_class.dat index f70d5bacb9..d901988858 100644 --- a/src/include/catalog/pg_class.dat +++ b/src/include/catalog/pg_class.dat @@ -20,11 +20,14 @@ # Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId; # similarly, "1" in relminmxid stands for FirstMultiXactId +# Note: relnatts is computed automatically by genbki.pl, so need not be +# specified here. + { oid => '1247', relname => 'pg_type', reltype => 'pg_type', relam => 'heap', relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0', reltoastrelid => '0', relhasindex => 'f', relisshared => 'f', - relpersistence => 'p', relkind => 'r', relnatts => '31', relchecks => '0', + relpersistence => 'p', relkind => 'r', relchecks => '0', relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f', relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't', relreplident => 'n', relispartition => 'f', relfrozenxid => '3', @@ -34,7 +37,7 @@ relname => 'pg_attribute', reltype => 'pg_attribute', relam => 'heap', relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0', reltoastrelid => '0', relhasindex => 'f', relisshared => 'f', - relpersistence => 'p', relkind => 'r', relnatts => '25', relchecks => '0', + relpersistence => 'p', relkind => 'r', relchecks => '0', relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f', relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't', relreplident => 'n', relispartition => 'f', relfrozenxid => '3', @@ -44,7 +47,7 @@ relname => 'pg_proc', reltype => 'pg_proc', relam => 'heap', relfilenode => '0', relpages => '0', reltuples => '0', relallvisible => '0', reltoastrelid => '0', relhasindex => 'f', relisshared => 'f', - relpersistence => 'p', relkind => 'r', relnatts => '29', relchecks => '0', + relpersistence => 'p', relkind => 'r', relchecks => '0', relhasrules => 'f', relhastriggers => 'f', relhassubclass => 'f', relrowsecurity => 'f', relforcerowsecurity => 'f', relispopulated => 't', relreplident => 'n', relispartit
Re: Index Skip Scan
Thank you very much for the benchmarking! A bit different topic from the latest branch.. At Sat, 8 Feb 2020 14:11:59 +0100, Tomas Vondra wrote in > >Yes, I've mentioned that already in one of the previous emails :) The > >simplest way I see to achieve what we want is to do something like in > >attached modified version with a new hasDeclaredCursor field. It's not > >a > >final version though, but posted just for discussion, so feel free to > >suggest any improvements or alternatives. > > IMO the proper fix for this case (moving forward, reading backwards) > is > simply making it work by properly checking deleted tuples etc. Not > sure > why that would be so much complex (haven't tried implementing it)? I don't think it's not so complex. But I suspect that it might be a bit harder starting from the current shpae. The first attached (renamed to .txt not to confuse the cfbots) is a small patch that makes sure if _bt_readpage is called with the proper condition as written in its comment, that is, caller must have pinned and read-locked so->currPos.buf. This patch reveals many instances of breakage of the contract. The second is a crude fix the breakages, but the result seems far from neat.. I think we need rethinking taking modification of support functions into consideration. > I think making this depend on things like declared cursor etc. is > going > to be tricky, may easily be more complex than checking deleted tuples, > and the behavior may be quite surprising. Sure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From de129e5a261ed43f002c1684dc9d6575f3880b16 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 6 Feb 2020 14:31:36 +0900 Subject: [PATCH 1/2] debug aid --- src/backend/access/nbtree/nbtsearch.c | 1 + src/backend/storage/buffer/bufmgr.c | 13 + src/include/storage/bufmgr.h | 1 + 3 files changed, 15 insertions(+) diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index c5f5d228f2..5cd97d8bb5 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -1785,6 +1785,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) * used here; this function is what makes it good for currPos. */ Assert(BufferIsValid(so->currPos.buf)); + Assert(BufferLockAndPinHeldByMe(so->currPos.buf)); page = BufferGetPage(so->currPos.buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index aba3960481..08a75a6846 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1553,6 +1553,19 @@ ReleaseAndReadBuffer(Buffer buffer, return ReadBuffer(relation, blockNum); } +/* tmp function for debugging */ +bool +BufferLockAndPinHeldByMe(Buffer buffer) +{ + BufferDesc *b = GetBufferDescriptor(buffer - 1); + + if (BufferIsPinned(buffer) && + LWLockHeldByMe(BufferDescriptorGetContentLock(b))) + return true; + + return false; +} + /* * PinBuffer -- make buffer unavailable for replacement. * diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index 73c7e9ba38..8e5fc639a0 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -177,6 +177,7 @@ extern void MarkBufferDirty(Buffer buffer); extern void IncrBufferRefCount(Buffer buffer); extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation, BlockNumber blockNum); +extern bool BufferLockAndPinHeldByMe(Buffer buffer); extern void InitBufferPool(void); extern void InitBufferPoolAccess(void); -- 2.18.2 >From 912bad2ec8c66ccd01cebf1f69233b004c633243 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 6 Feb 2020 19:09:09 +0900 Subject: [PATCH 2/2] crude fix --- src/backend/access/nbtree/nbtsearch.c | 43 +-- 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index 5cd97d8bb5..1f18b38ca5 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -1619,6 +1619,9 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir, nextOffset = startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid); + if (nextOffset != startOffset) + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); + while (nextOffset == startOffset) { IndexTuple itup; @@ -1653,7 +1656,7 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir, offnum = OffsetNumberPrev(offnum); /* Check if _bt_readpage returns already fou
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Fri, Feb 14, 2020 at 11:42 AM Masahiko Sawada wrote: > > On Thu, 13 Feb 2020 at 13:16, Amit Kapila wrote: > > > > On Tue, Feb 11, 2020 at 9:13 PM Tom Lane wrote: > > > > > > I took a brief look through this patch. I agree with the fundamental > > > idea that we shouldn't need to use the heavyweight lock manager for > > > relation extension, since deadlock is not a concern and no backend > > > should ever need to hold more than one such lock at once. But it feels > > > to me like this particular solution is rather seriously overengineered. > > > I would like to suggest that we do something similar to Robert Haas' > > > excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c, > > > that is, > > > > > > * Create some predetermined number N of LWLocks for relation extension. > > > * When we want to extend some relation R, choose one of those locks > > > (say, R's relfilenode number mod N) and lock it. > > > > > > > I am imagining something on the lines of BufferIOLWLockArray (here it > > will be RelExtLWLockArray). The size (N) could MaxBackends or some > > percentage of it (depending on testing) and indexing into an array > > could be as suggested (R's relfilenode number mod N). > > I'm not sure it's good that the contention of LWLock slot depends on > MaxBackends. Because it means that the more MaxBackends is larger, the > less the LWLock slot conflicts, even if the same number of backends > actually connecting. Normally we don't want to increase unnecessarily > MaxBackends for security reasons. In the current patch we defined a > fixed length of array for extension lock but I agree that we need to > determine what approach is the best depending on testing. > I think MaxBackends will generally limit the number of different relations that can simultaneously extend, but maybe tables with many partitions might change the situation. You are right that some tests might suggest a good number, let Mahendra write a patch and then we can test it. Do you have any better idea? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Parallel copy
This work is to parallelize the copy command and in particular "Copy from 'filename' Where ;" command. Before going into how and what portion of 'copy command' processing we can parallelize, let us see in brief what are the top-level operations we perform while copying from the file into a table. We read the file in 64KB chunks, then find the line endings and process that data line by line, where each line corresponds to one tuple. We first form the tuple (in form of value/null array) from that line, check if it qualifies the where condition and if it qualifies, then perform constraint check and few other checks and then finally store it in local tuple array. Once we reach 1000 tuples or consumed 64KB (whichever occurred first), we insert them together via table_multi_insert API and then for each tuple insert into the index(es) and execute after row triggers. So if we see here we do a lot of work after reading each 64K chunk. We can read the next chunk only after all the tuples are processed in the previous chunk we read. This brings us an opportunity to parallelize each 64K chunk processing. I think we can do this in more than one way. The first idea is that we allocate each chunk to a worker and once the worker has finished processing the current chunk, it can start with the next unprocessed chunk. Here, we need to see how to handle the partial tuples at the end or beginning of each chunk. We can read the chunks in dsa/dsm instead of in local buffer for processing. Alternatively, if we think that accessing shared memory can be costly we can read the entire chunk in local memory, but copy the partial tuple at the beginning of a chunk (if any) to a dsa. We mainly need partial tuple in the shared memory area. The worker which has found the initial part of the partial tuple will be responsible to process the entire tuple. Now, to detect whether there is a partial tuple at the beginning of a chunk, we always start reading one byte, prior to the start of the current chunk and if that byte is not a terminating line byte, we know that it is a partial tuple. Now, while processing the chunk, we will ignore this first line and start after the first terminating line. To connect the partial tuple in two consecutive chunks, we need to have another data structure (for the ease of reference in this email, I call it CTM (chunk-tuple-map)) in shared memory where we store some per-chunk information like the chunk-number, dsa location of that chunk and a variable which indicates whether we can free/reuse the current entry. Whenever we encounter the partial tuple at the beginning of a chunk we note down its chunk number, and dsa location in CTM. Next, whenever we encounter any partial tuple at the end of the chunk, we search CTM for next chunk-number and read from corresponding dsa location till we encounter terminating line byte. Once we have read and processed this partial tuple, we can mark the entry as available for reuse. There are some loose ends here like how many entries shall we allocate in this data structure. It depends on whether we want to allow the worker to start reading the next chunk before the partial tuple of the previous chunk is processed. To keep it simple, we can allow the worker to process the next chunk only when the partial tuple in the previous chunk is processed. This will allow us to keep the entries equal to a number of workers in CTM. I think we can easily improve this if we want but I don't think it will matter too much as in most cases by the time we processed the tuples in that chunk, the partial tuple would have been consumed by the other worker. Another approach that came up during an offlist discussion with Robert is that we have one dedicated worker for reading the chunks from file and it copies the complete tuples of one chunk in the shared memory and once that is done, a handover that chunks to another worker which can process tuples in that area. We can imagine that the reader worker is responsible to form some sort of work queue that can be processed by the other workers. In this idea, we won't be able to get the benefit of initial tokenization (forming tuple boundaries) via parallel workers and might need some additional memory processing as after reader worker has handed the initial shared memory segment, we need to somehow identify tuple boundaries and then process them. Another thing we need to figure out is the how many workers to use for the copy command. I think we can use it based on the file size which needs some experiments or may be based on user input. I think we have two related problems to solve for this (a) relation extension lock (required for extending the relation) which won't conflict among workers due to group locking, we are working on a solution for this in another thread [1], (b) Use of Page locks in Gin indexes, we can probably disallow parallelism if the table has Gin index which is not a great thing but not bad either. To be clear