date:20190618

Re: POC: Cleaning up orphaned files using undo logs

2019-06-18 Thread Amit Kapila

On Tue, Jun 18, 2019 at 11:37 PM Robert Haas  wrote:
>
> On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila  wrote:
> > [ new patches ]
>
> I tried writing some code that throws an error from an undo log
> handler and the results were not good.  It appears that the code will
> retry in a tight loop:
>
> 2019-06-18 13:58:53.262 EDT [42803] ERROR:  robert_undo
> 2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
> 2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
..
>
> It seems clear that the error-handling aspect of this patch has not
> been given enough thought.  It's debatable what strategy should be
> used when undo fails, but retrying 40 times per millisecond isn't the
> right answer.
>

The reason for the same is that currently, the undo worker keep on
executing the requests if there are any.  I think this is good when
there are different requests, but getting the same request from error
queue and doing it, again and again, doesn't seem to be good and I
think it will not help either.

> I assume we want some kind of cool-down between retries.
> 10 seconds?  A minute?  Some kind of back-off algorithm that gradually
> increases the retry time up to some maximum?
>

Yeah, something on these lines would be good.  How about if we add
failure_count with each request in error queue?   Now, it will get
incremented on each retry and we can wait in proportion to that, say
10s after the first retry, 20s after second and so on and maximum up
to 10 failure_count (100s) will be allowed after which worker will
exit considering it has no more work to do.

Actually, we also need to think about what we should with such
requests because even if undo worker exits after retrying for some
threshold number of times, undo launcher will again launch a new
worker for this request unless we have some special handling for the
same.

We can issue some WARNING once any particular request reached the
maximum number of retries but not sure if that is enough because the
user might not notice the same or didn't take any action.  Do we want
to PANIC at some point of time, if so, when or the other alternative
is we can try at regular intervals till we succeed?

>  Should there be one or
> more GUCs?
>

Yeah, we can do that, something like undo_apply_error_retry_count, but
I am not completely sure about this, maybe some pre-defined number say
10 or 20 should be enough.  However, I am fine if you or others think
that a guc can help users in this case.

> Another thing that is not very nice is that when I tried to shut down
> the server via 'pg_ctl stop' while the above was happening, it did not
> shut down.  I had to use an immediate shutdown.  That's clearly not
> OK.
>

CHECK_FOR_INTERRUPTS is missing at one place, will fix.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: New EXPLAIN option: ALL

2019-06-18 Thread Peter Eisentraut

On 2019-06-18 23:15, David Fetter wrote:
> Are you proposing something along the lines of this?
> 
> PROFILE [statement]; /* Shows the plan */
> PROFILE RUN [statement]; /* Actually executes the query */

No, it would be

EXPLAIN statement; /* Shows the plan */
PROFILE statement; /* Actually executes the query */

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Replacing the EDH SKIP primes

2019-06-18 Thread Michael Paquier

On Wed, Jun 19, 2019 at 07:44:46AM +0200, Daniel Gustafsson wrote:
> I think this is v13 material, I’ll stick it in the next commitfest.

Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: Replacing the EDH SKIP primes

2019-06-18 Thread Daniel Gustafsson

> On 19 Jun 2019, at 05:40, Michael Paquier  wrote:

> Fine by me.  Let's stick with the 2048b-long one for now as we did in
> c0a15e0.  I am wondering if we should sneak that into v12, but I'd
> rather just wait for v13 to open.

I think this is v13 material, I’ll stick it in the next commitfest.

cheers ./daniel

Re: New vacuum option to do only freezing

2019-06-18 Thread Michael Paquier

On Mon, May 20, 2019 at 11:43:19AM +0900, Masahiko Sawada wrote:
> Thank you for comments. Attached updated version patch.

This is an open item present for quite some time, so I have looked at
the patch.  The base patch is fine.

+INSERT INTO no_index_cleanup(i, t) VALUES(1, repeat('1234567890',3));
Do we really need a string as long as that?

Is actually the existing set of tests that helpful?  We now have only
two VACUUM queries which run on no_index_cleanup, both of them using
FULL so the reloption as well as the value of INDEX_CLEANUP are
ignored.  Wouldn't it be better to redesign a bit the tests with more
combinations of options like that?
-- Toast inherit the value from the table
ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = false);
-- Value directly defined for both
ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = false,
toast.vacuum_index_cleanup = true);
ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = true,
toast.vacuum_index_cleanup = false);

It seems to me that we'd want tests to make sure that indexes are
actually cleaned up, where pageinspect could prove to be useful.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-18 Thread Amit Kapila

On Wed, Jun 19, 2019 at 10:09 AM Ian Barwick
 wrote:
>
> On 6/19/19 12:46 PM, Amit Kapila wrote:
> > On Mon, Jun 17, 2019 at 8:20 PM Ian Barwick  
> > wrote:
> >> On 6/15/19 1:08 AM, Stephen Frost wrote:
> >>   > * Amit Kapila (amit.kapil...@gmail.com) wrote:
> >>   >> Right.  I think if possible, it should use existing infrastructure to
> >>   >> write to postgresql.auto.conf rather than inventing a new way to
> >>   >> change it.  Apart from this issue, if we support multiple ways to edit
> >>   >> postgresql.auto.conf, we might end up with more problems like this in
> >>   >> the future where one system is not aware of the way file being edited
> >>   >> by another system.
> >>   >
> >>   > I agere that there should have been some effort put into making the way
> >>   > ALTER SYSTEM is modified be consistent between the backend and 
> >> utilities
> >>   > like pg_basebackup (which would also help third party tools understand
> >>   > how a non-backend application should be modifying the file).
> >>
> >> Did you mean to say "the way postgresql.auto.conf is modified"?
> >>
> >
> > Yes, that is what we are discussing here.  I think what we can do here
> > is to extract the functionality to set the parameter in .auto.conf
> > from AlterSystemSetConfigFile and expose it via a function that takes
> > (option_name, value) as a parameter.
>
> Yup, I was just considering what's involved there, will reply to another
> mail in the thread on that.
>
> > Then we can expose it via some
> > SQL function like set_auto_config (similar to what we have now for
> > set_config/set_config_by_name).  I think if we have something like
> > that then pg_basebackup or any other utility can use it in a
> > consistent way.
>
> Umm, but the point is here, the server will *not* be running at this point,
> so calling an SQL function is out of the question. And if the server
> is running, then you just execute "ALTER SYSTEM".
>

Sure, SQL function will be a by-product of this.  Can't we expose some
function that can be used by base backup?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-18 Thread Ian Barwick

n 6/18/19 12:41 AM, Stephen Frost wrote:
> Greetings,
>
> * Ian Barwick (ian.barw...@2ndquadrant.com) wrote
(...)

>> I suggest explicitly documenting postgresql.auto.conf behaviour (and the 
circumstances
>> where it's acceptable to modify it outside of ALTER SYSTEM calls) in the 
documentation
>> (and possibly in the code), so anyone writing utilities which need to
>> append to postgresql.auto.conf knows what the situation is.
>
> Yeah, I would think that, ideally, we'd have some code in the common
> library that other utilities could leverage and which the backend would
> also use.

So maybe something along the lines of creating a stripped-down variant of
AlterSystemSetConfigFile() (from "src/backend/utils/misc/guc.c") which can be
called from front-end code to safely modify .auto.conf while the server is *not*
running.

I'm not terribly familiar with the GUC code, but that would presumably mean 
making
parts of the GUC parsing/handling code linkable externally (ParseConfigFp() 
etc.)
as you'd need to parse the file before rewriting it. Something like (minimal
pseudo-code):

void
alter_system_set(char *name, char *value)
{
/*
 * check here that the server is *not* running
 */
...
ParseConfigFp(infile, AutoConfFileName, 0, LOG, &head, &tail)
...

/*
 * some robust portable way of ensuring another process can't
 * modify the file(s) until we're done
 */
lock_file(AutoConfFileName);

replace_auto_config_value(&head, &tail, name, value);

write_auto_conf_file(AutoConfTmpFileName, head)

durable_rename(AutoConfTmpFileName, AutoConfFileName, ERROR);

FreeConfigVariables(head);
unlock_file(AutoConfFileName);
}

I'm not sure how feasible it is to validate the provided parameter
without the server running, but if not possible, that's not any worse than the
status quo, i.e. the utility has to be trusted to write the correct parameters
to the file anyway.

The question is though - is this a change which is practical to make at this 
point
in the release cycle for Pg12?

Regards

Ian Barwick

--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Some reloptions non-initialized when loaded

2019-06-18 Thread Michael Paquier

Hi all,

While looking at this code, I have noticed that a couple of reloptions
which are not toast-specific don't get properly initialized.
toast_tuple_target and parallel_workers are the ones standing out.

Thoughts?
--
Michael
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index de06c92574..02ea6f748f 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1482,6 +1482,8 @@ heap_reloptions(char relkind, Datum reloptions, bool validate)
 rdopts->fillfactor = 100;
 rdopts->autovacuum.analyze_threshold = -1;
 rdopts->autovacuum.analyze_scale_factor = -1;
+rdopts->parallel_workers = -1;
+rdopts->toast_tuple_target = TOAST_TUPLE_TARGET;
 			}
 			return (bytea *) rdopts;
 		case RELKIND_RELATION:


signature.asc
Description: PGP signature

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-18 Thread Ian Barwick

On 6/19/19 12:46 PM, Amit Kapila wrote:

On Mon, Jun 17, 2019 at 8:20 PM Ian Barwick  wrote:

On 6/15/19 1:08 AM, Stephen Frost wrote:
  > * Amit Kapila (amit.kapil...@gmail.com) wrote:
  >> Right.  I think if possible, it should use existing infrastructure to
  >> write to postgresql.auto.conf rather than inventing a new way to
  >> change it.  Apart from this issue, if we support multiple ways to edit
  >> postgresql.auto.conf, we might end up with more problems like this in
  >> the future where one system is not aware of the way file being edited
  >> by another system.
  >
  > I agere that there should have been some effort put into making the way
  > ALTER SYSTEM is modified be consistent between the backend and utilities
  > like pg_basebackup (which would also help third party tools understand
  > how a non-backend application should be modifying the file).

Did you mean to say "the way postgresql.auto.conf is modified"?

Yes, that is what we are discussing here.  I think what we can do here
is to extract the functionality to set the parameter in .auto.conf
from AlterSystemSetConfigFile and expose it via a function that takes
(option_name, value) as a parameter.

Yup, I was just considering what's involved there, will reply to another
mail in the thread on that.

Then we can expose it via some
SQL function like set_auto_config (similar to what we have now for
set_config/set_config_by_name).  I think if we have something like
that then pg_basebackup or any other utility can use it in a
consistent way.

Umm, but the point is here, the server will *not* be running at this point,
so calling an SQL function is out of the question. And if the server
is running, then you just execute "ALTER SYSTEM".

Regards

Ian Barwick

--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-18 Thread Amit Kapila

On Mon, Jun 17, 2019 at 8:20 PM Ian Barwick  wrote:
> On 6/15/19 1:08 AM, Stephen Frost wrote:
>  > * Amit Kapila (amit.kapil...@gmail.com) wrote:
>  >> Right.  I think if possible, it should use existing infrastructure to
>  >> write to postgresql.auto.conf rather than inventing a new way to
>  >> change it.  Apart from this issue, if we support multiple ways to edit
>  >> postgresql.auto.conf, we might end up with more problems like this in
>  >> the future where one system is not aware of the way file being edited
>  >> by another system.
>  >
>  > I agere that there should have been some effort put into making the way
>  > ALTER SYSTEM is modified be consistent between the backend and utilities
>  > like pg_basebackup (which would also help third party tools understand
>  > how a non-backend application should be modifying the file).
>
> Did you mean to say "the way postgresql.auto.conf is modified"?
>

Yes, that is what we are discussing here.  I think what we can do here
is to extract the functionality to set the parameter in .auto.conf
from AlterSystemSetConfigFile and expose it via a function that takes
(option_name, value) as a parameter.  Then we can expose it via some
SQL function like set_auto_config (similar to what we have now for
set_config/set_config_by_name).  I think if we have something like
that then pg_basebackup or any other utility can use it in a
consistent way.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Replacing the EDH SKIP primes

2019-06-18 Thread Michael Paquier

On Tue, Jun 18, 2019 at 02:05:00PM +0200, Daniel Gustafsson wrote:
> The current hardcoded EDH parameter fallback use the old SKIP primes, for 
> which
> the source disappeared from the web a long time ago.  Referencing a known dead
> source seems a bit silly, so I think we should either switch to a non-dead
> source of MODP primes or use an archive.org link for SKIP.  Personally I 
> prefer
> the former.

I agree with you that it sounds more sensible to switch to a new prime
instead of relying on an archive of the past one.

> This was touched upon, but never really discussed AFAICT, back when then EDH
> parameters were reworked a few years ago.  Instead of replacing with custom
> ones, as suggested in [1] it we might as well replace with standardized ones 
> as
> this is a fallback.  Custom ones wont make it more secure, just add more work
> for the project.  The attached patch replace the SKIP prime with the 2048 bit
> MODP group from RFC 3526, which is the same change that OpenSSL did a few 
> years
> back [2].

Fine by me.  Let's stick with the 2048b-long one for now as we did in
c0a15e0.  I am wondering if we should sneak that into v12, but I'd
rather just wait for v13 to open.
--
Michael


signature.asc
Description: PGP signature

Re: openssl valgrind failures on skink are due to openssl issue

2019-06-18 Thread Michael Paquier

On Tue, Jun 18, 2019 at 07:44:26PM -0700, Andres Freund wrote:
> Really, I can do that?

Here is some of the stuff I use, just for the reference:
./Configure linux-x86_64 --prefix=$HOME/stable/openssl/1.1.1/
./config --prefix=$HOME/stable/openssl/1.1.1 shared
--
Michael


signature.asc
Description: PGP signature

Re: fix "Success" error messages

2019-06-18 Thread Michael Paquier

On Tue, Jun 18, 2019 at 09:13:19AM -0700, Shawn Debnath wrote:
>>  case SLRU_WRITE_FAILED:
>>  ereport(ERROR,
>>  (errcode_for_file_access(),
>>   errmsg("could not access status of 
>> transaction %u", xid),
>> - errdetail("Could not write to file 
>> \"%s\" at offset %u: %m.",
>> -   path, offset)));
>> + errno == 0
>> + ? errdetail("Short write to file 
>> \"%s\" at offset %u.", path, offset)
>> + : errdetail("Could not write to file 
>> \"%s\" at offset %u: %m.",
>> + path, 
>> offset)));

There is no point to call errcode_for_file_access() if we know that
errno is 0.  Not a big deal, still..  The last time I looked at that,
the best way I could think of would be to replace slru_errcause with a
proper error string generated at error time.  Perhaps the partial
read/write case does not justify this extra facility.  I don't know.
At least that would be more flexible.

> Similarly, "Encountered partial write to file ..." would be better? Not 
> a 100% on using "Encountered" but "partial" seems to be the right word 
> to use here.

811b6e36 has done some work so as we have more consistency in the
error messages and I don't think we should introduce more flavors of
this stuff as that makes the life of translators harder.

-   if (CloseTransientFile(fd))
+   if (CloseTransientFile(fd) < 0)
Some spots are missing:
$ git grep "CloseTransientFile" | grep "))" | wc -l
30
--
Michael

signature.asc
Description: PGP signature

Re: openssl valgrind failures on skink are due to openssl issue

2019-06-18 Thread Andres Freund

On 2019-06-19 11:28:16 +0900, Michael Paquier wrote:
> On Tue, Jun 18, 2019 at 04:34:07PM -0700, Andres Freund wrote:
> > It's merged to both branches that contain the broken code. Now we need
> > to wait for the next set of openssl releases, and then for distros to
> > pick that up. Based on the past release cadence
> > https://www.openssl.org/news/openssl-1.1.1-notes.html
> > that seems to be likely to happen within 2-3 months.
> 
> If that's for the buildfarm coverage.  I would be of the opinion to
> wait a bit.

Yea. For now I've just disabled ssl support on skink, but that has its
own disadvantages.

> Another possibility is that you could compile your own
> version of OpenSSL with the patch included, say only 1.1.1c with the
> patch.

Really, I can do that?

RE: [PATCH] memory leak in ecpglib

2019-06-18 Thread kuroda.hay...@fujitsu.com

Dear Zhang,

Sorry for my late reply.
I'm now planning to refactor this functionality:
https://www.postgresql.org/message-id/osapr01mb20048298f882d25897c6ab23f5...@osapr01mb2004.jpnprd01.prod.outlook.com

If DECLARE STATEMENT and other related statements are enabled only 
preprocessing process, this problem will be easily solved.

How about it?

Hayato Kuroda
Fujitsu LIMITED


-Original Message-
From: Zhang, Jie/张 杰 
Sent: Tuesday, June 11, 2019 5:14 PM
To: Kuroda, Hayato/�\田 隼人 ; Matsumura, Ryo/松村 量 
; pgsql-hackers@lists.postgresql.org
Subject: RE: [PATCH] memory leak in ecpglib

Hi Kuroda,

>If your patch is committed, in your example, any operation for cur1 will not 
>be accepted.
Although the return value after calling ecpg_get_con_name_by_cursor_name(cur1) 
is NULL,
in ecpg_get_connection(), actual_connection will be returned.
so, operation for cur1 will be accepted,

>p = ecpg_find_declared_statement(declared_name);
>if (p && p->cursor_name == cursor_name)
>p->cursor_name = ecpg_strdup(cursor_name, lineno);
Because the initial value of p->cursor_name is NULL, p->cursor_name will never 
be updated.

Best Regards!

-Original Message-
From: Kuroda, Hayato/�\田 隼人 
Sent: Tuesday, June 11, 2019 2:36 PM
To: Zhang, Jie/张 杰 ; Matsumura, Ryo/松村 量 
; pgsql-hackers@lists.postgresql.org
Subject: RE: [PATCH] memory leak in ecpglib

Dear Zhang,

# I resend the email

Thank you for reporting a bug. I didn't care about this case.

>> We should free p->cursor_name before p->cursor_name = 
>> ecpg_strdup(cursor_name, lineno).

I'm wondering whether this approach is correct or not.
If your patch is committed, in your example, any operation for cur1 will not be 
accepted.

My idea is changing ecpg_update_declare_statement() for permitting one-to-many 
relation between a declared name and cursors.
An example is as below:

p = ecpg_find_declared_statement(declared_name);
if (p && p->cursor_name == cursor_name)
p->cursor_name = ecpg_strdup(cursor_name, lineno);

Do you have any suggestions or comments for this?

Best Regards,
Hayato Kuroda
Fujitsu LIMITED

Re: openssl valgrind failures on skink are due to openssl issue

2019-06-18 Thread Michael Paquier

On Tue, Jun 18, 2019 at 04:34:07PM -0700, Andres Freund wrote:
> It's merged to both branches that contain the broken code. Now we need
> to wait for the next set of openssl releases, and then for distros to
> pick that up. Based on the past release cadence
> https://www.openssl.org/news/openssl-1.1.1-notes.html
> that seems to be likely to happen within 2-3 months.

If that's for the buildfarm coverage.  I would be of the opinion to
wait a bit.  Another possibility is that you could compile your own
version of OpenSSL with the patch included, say only 1.1.1c with the
patch.  Still that would cause the plpython tests to complain as the
system's python may still link to the system's OpenSSL which is
broken?

Another possibility would be to move back to 1.1.1b for the time
being...
--
Michael

signature.asc
Description: PGP signature

Re: pg_upgrade: Improve invalid option handling

2019-06-18 Thread Michael Paquier

On Tue, Jun 18, 2019 at 10:25:44AM +0200, Daniel Gustafsson wrote:
> Correct, that matches how pg_basebackup and psql does it.

Perhaps you have a patch at hand?  I can see four strings in
pg_upgrade, two in exec.c and two in option.c, which could be
improved.
--
Michael


signature.asc
Description: PGP signature

Re: Still some references to configure-time WAL segment size option in msvc scripts

2019-06-18 Thread Michael Paquier

On Mon, Jun 17, 2019 at 04:32:28PM +0900, Michael Paquier wrote:
> I have just bumped into $subject, which makes no sense now as this is
> an init-time option.  Any objections if I remove this code as per the
> attached?

And committed.
--
Michael


signature.asc
Description: PGP signature

Re: nbtdesc.c and nbtpage.c are inconsistent with XLOG_BTREE_META_CLEANUP (11~)

2019-06-18 Thread Michael Paquier

On Sun, Jun 16, 2019 at 07:14:05PM -0700, Peter Geoghegan wrote:
> The WAL record in question, XLOG_BTREE_META_CLEANUP, is certainly one
> of the less common record types used by nbtree. I agree that this
> should have been tested when it went in, but I'm not surprised that
> the bug remained undetected for a year. Not that many people use
> pg_waldump.

Actually, a simple installcheck generates a handful of them.  I have
not actually run into a crash, but this causes pg_waldump to describe
the record incorrectly.  Committed down to 11 after cross-checking
that the data inserted in the WAL record and what gets described are
both consistent.

_bt_restore_meta() does the right thing by the way when restoring the
page.
--
Michael

signature.asc
Description: PGP signature

Tweaking DSM and DSA limits

2019-06-18 Thread Thomas Munro

Hello,

If you run a lot of parallel queries that use big parallel hash joins
simultaneously, you can run out of DSM slots (for example, when
testing many concurrent parallel queries).  That's because we allow 64
slots + 2 * MaxBackends, but allocating seriously large amounts of
dynamic shared memory requires lots of slots.

Originally the DSM system was designed to support one segment per
parallel query, but now we also use one for the session and any number
for parallel executor nodes that want space limited by work_mem.

The number of slots it takes for a given total amount of shared memory
depends on the macro DSA_NUM_SEGMENTS_AT_EACH_SIZE.  Since DSM slots
are relatively scarce (we use inefficient algorithms to access them,
and we think that some operating systems won't like us if we create
too many, so we impose this scarcity on ourselves), each DSA area
allocates bigger and bigger segments as it goes, starting with 1MB.
The approximate number of segments required to allocate various sizes
incrementally using different values of DSA_NUM_SEGMENTS_AT_EACH_SIZE
can be seen in this table:

 N =   1   2   3   4

  1MB  1   1   1   1
 64MB  6  10  13  16
512MB  9  16  22  28
  1GB 10  18  25  32
  8GB 13  24  34  44
 16GB 14  26  37  48
 32GB 15  28  40  52
  1TB 20  38  55  72

It's currently set to 4, but I now think that was too cautious.  It
tries to avoid fragmentation by ramping up slowly (that is, memory
allocated and in some cases committed by the operating system that we
don't turn out to need), but it's pretty wasteful of slots.  Perhaps
it should be set to 2?

Perhaps also the number of slots per backend should be dynamic, so
that you have the option to increase it from the current hard-coded
value of 2 if you don't want to increase max_connections but find
yourself running out of slots (this GUC was a request from Andres but
the name was made up by me -- if someone has a better suggestion I'm
all ears).

Also, there are some outdated comments near
PG_DYNSHMEM_SLOTS_PER_BACKEND's definition that we might as well drop
along with the macro.

Draft patch attached.

-- 
Thomas Munro
https://enterprisedb.com


0001-Add-dynamic_shared_memory_segments_per_backend-GUC.patch
Description: Binary data

Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Andres Freund

Hi,

On June 18, 2019 5:38:34 PM PDT, Justin Pryzby  wrote:
>On Tue, Jun 18, 2019 at 06:48:58PM -0500, Justin Pryzby wrote:
>> On Tue, Jun 18, 2019 at 06:12:33PM -0500, Justin Pryzby wrote:
>> > A customers DB crashed due to OOM.  While investigating the issue
>in our
>> > report, I created MV stats, which causes this error:
>> > 
>> > ts=# CREATE STATISTICS sectors_stats (dependencies) ON
>site_id,sect_id FROM sectors;
>> > CREATE STATISTICS
>> > ts=# ANALYZE sectors;
>> > ERROR:  XX000: tuple already updated by self
>> > LOCATION:  simple_heap_update, heapam.c:4613
>> 
>> > I'm guessing the issue is with pg_statistic_ext, which I haven't
>touched.
>> > 
>> > Next step seems to be to truncate pg_statistic{,ext} and re-analyze
>the DB.
>> 
>> Confirmed the issue is there.
>> 
>> ts=# analyze sectors;
>> ERROR:  tuple already updated by self
>> ts=# begin; delete from pg_statistic_ext; analyze sectors;
>> BEGIN
>> DELETE 87
>> ANALYZE
>
>Why this works seems to me to be unexplained..

There's no extended stats to compute after that, thus we don't try to update 
the extended stats twice.

Address
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Justin Pryzby

On Tue, Jun 18, 2019 at 06:48:58PM -0500, Justin Pryzby wrote:
> On Tue, Jun 18, 2019 at 06:12:33PM -0500, Justin Pryzby wrote:
> > A customers DB crashed due to OOM.  While investigating the issue in our
> > report, I created MV stats, which causes this error:
> > 
> > ts=# CREATE STATISTICS sectors_stats (dependencies) ON site_id,sect_id FROM 
> > sectors;
> > CREATE STATISTICS
> > ts=# ANALYZE sectors;
> > ERROR:  XX000: tuple already updated by self
> > LOCATION:  simple_heap_update, heapam.c:4613
> 
> > I'm guessing the issue is with pg_statistic_ext, which I haven't touched.
> > 
> > Next step seems to be to truncate pg_statistic{,ext} and re-analyze the DB.
> 
> Confirmed the issue is there.
> 
> ts=# analyze sectors;
> ERROR:  tuple already updated by self
> ts=# begin; delete from pg_statistic_ext; analyze sectors;
> BEGIN
> DELETE 87
> ANALYZE

Why this works seems to me to be unexplained..

Justin

Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Peter Geoghegan

On Tue, Jun 18, 2019 at 5:09 PM Andres Freund  wrote:
> > It might be interesting to set a breakpoint within heap_update(),
> > which is called by simple_heap_update() --technically, this is where
> > the reported failure occurs. From there, you could send an image of
> > the page to the list by following the procedure described here:

> Hm, what are you hoping to glean by doing so?

Nothing in particular. I see no reason to assume that we know what
that looks like, though. It's easy to check.

-- 
Peter Geoghegan

Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Tom Lane

Andres Freund  writes:
> I think the problem is pretty plainly that for inheritance tables we'll
> try to store extended statistics twice. And thus end up updating the
> same row twice.

They shouldn't be the same row though.  If we're to try to capture
ext-stats on inheritance trees --- and I think that's likely a good
idea --- then we need a bool corresponding to pg_statistic's stainherit
as part of pg_statistic_ext's primary key.

Since there is no such bool there now, and I assume that nobody wants
yet another pg_statistic_ext-driven catversion bump for v12, the only
fix is to get the stats machinery to not compute or store such stats.
For now.  But I think we ought to change that in v13.

regards, tom lane

Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Andres Freund

Hi,

On 2019-06-18 17:00:09 -0700, Peter Geoghegan wrote:
> On Tue, Jun 18, 2019 at 4:49 PM Justin Pryzby  wrote:
> > Sure:
> >
> > (gdb) bt
> > #0  errfinish (dummy=0) at elog.c:414
> > #1  0x0085e834 in elog_finish (elevel=, 
> > fmt=) at elog.c:1376
> > #2  0x004b93bd in simple_heap_update (relation=0x7fee161700c8, 
> > otid=0x1fb7f44, tup=0x1fb7f40) at heapam.c:4613
> > #3  0x0051bdb7 in CatalogTupleUpdate (heapRel=0x7fee161700c8, 
> > otid=0x1fb7f44, tup=0x1fb7f40) at indexing.c:234
> 
> It might be interesting to set a breakpoint within heap_update(),
> which is called by simple_heap_update() --technically, this is where
> the reported failure occurs. From there, you could send an image of
> the page to the list by following the procedure described here:
> 
> https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB
> 
> You'll have to hit "next" a few times, until heap_update()'s "page"
> variable is initialized.

Hm, what are you hoping to glean by doing so?

Greetings,

Andres Freund

Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Andres Freund

Hi,

On 2019-06-18 18:48:58 -0500, Justin Pryzby wrote:
> Ah: the table is an inheritence parent.  If I uninherit its child, there's no
> error during ANALYZE.  MV stats on the child are ok:

It's a "classical" inheritance parent, not a builtin-partitioning type
of parent, right? And it contains data?

I assume it ought to not be too hard to come up with a reproducer
then...

I think the problem is pretty plainly that for inheritance tables we'll
try to store extended statistics twice. And thus end up updating the
same row twice.

> #6  0x00588346 in do_analyze_rel (onerel=0x7fee16140de8, options=2, 
> params=0x7ffe5b6bf8b0, va_cols=0x0, acquirefunc=0x492b4, relpages=36, 
> inh=true, in_outer_xact=false, elevel=13) at analyze.c:627

/* Build extended statistics (if there are any). */
BuildRelationExtStatistics(onerel, totalrows, numrows, rows, 
attr_cnt,

vacattrstats);

Note that for plain statistics we a) pass down the 'inh' flag to the
storage function, b) stainherit is part of pg_statistics' key:

Indexes:
"pg_statistic_relid_att_inh_index" UNIQUE, btree (starelid, staattnum, 
stainherit)

Tomas, I think at the very least extended statistics shouldn't be built
when building inherited stats. But going forward I think we should
probably have it as part of the key for pg_statistic_ext.

Greetings,

Andres Freund

Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Peter Geoghegan

On Tue, Jun 18, 2019 at 4:49 PM Justin Pryzby  wrote:
> Sure:
>
> (gdb) bt
> #0  errfinish (dummy=0) at elog.c:414
> #1  0x0085e834 in elog_finish (elevel=, 
> fmt=) at elog.c:1376
> #2  0x004b93bd in simple_heap_update (relation=0x7fee161700c8, 
> otid=0x1fb7f44, tup=0x1fb7f40) at heapam.c:4613
> #3  0x0051bdb7 in CatalogTupleUpdate (heapRel=0x7fee161700c8, 
> otid=0x1fb7f44, tup=0x1fb7f40) at indexing.c:234

It might be interesting to set a breakpoint within heap_update(),
which is called by simple_heap_update() --technically, this is where
the reported failure occurs. From there, you could send an image of
the page to the list by following the procedure described here:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB

You'll have to hit "next" a few times, until heap_update()'s "page"
variable is initialized.
-- 
Peter Geoghegan

Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Justin Pryzby

On Tue, Jun 18, 2019 at 06:48:58PM -0500, Justin Pryzby wrote:
> On Tue, Jun 18, 2019 at 06:12:33PM -0500, Justin Pryzby wrote:
> > ts=# ANALYZE sectors;
> > ERROR:  XX000: tuple already updated by self
> > LOCATION:  simple_heap_update, heapam.c:4613

> Ah: the table is an inheritence parent.  If I uninherit its child, there's no
> error during ANALYZE.

postgres=# CREATE TABLE t(i int,j int); CREATE TABLE u() INHERITS (t); CREATE 
STATISTICS t_stats ON i,j FROM t; INSERT INTO t VALUES(1,1);ANALYZE t;
CREATE TABLE
CREATE TABLE
CREATE STATISTICS
INSERT 0 1
ERROR:  tuple already updated by self

Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Justin Pryzby

On Tue, Jun 18, 2019 at 06:12:33PM -0500, Justin Pryzby wrote:
> A customers DB crashed due to OOM.  While investigating the issue in our
> report, I created MV stats, which causes this error:
> 
> ts=# CREATE STATISTICS sectors_stats (dependencies) ON site_id,sect_id FROM 
> sectors;
> CREATE STATISTICS
> ts=# ANALYZE sectors;
> ERROR:  XX000: tuple already updated by self
> LOCATION:  simple_heap_update, heapam.c:4613

> I'm guessing the issue is with pg_statistic_ext, which I haven't touched.
> 
> Next step seems to be to truncate pg_statistic{,ext} and re-analyze the DB.

Confirmed the issue is there.

ts=# analyze sectors;
ERROR:  tuple already updated by self
ts=# begin; delete from pg_statistic_ext; analyze sectors;
BEGIN
DELETE 87
ANALYZE

On Tue, Jun 18, 2019 at 04:30:33PM -0700, Andres Freund wrote:
> Any chance to get a backtrace for the error?

Sure:

(gdb) bt
#0  errfinish (dummy=0) at elog.c:414
#1  0x0085e834 in elog_finish (elevel=, fmt=) at elog.c:1376
#2  0x004b93bd in simple_heap_update (relation=0x7fee161700c8, 
otid=0x1fb7f44, tup=0x1fb7f40) at heapam.c:4613
#3  0x0051bdb7 in CatalogTupleUpdate (heapRel=0x7fee161700c8, 
otid=0x1fb7f44, tup=0x1fb7f40) at indexing.c:234
#4  0x0071e5ca in statext_store (onerel=0x7fee16140de8, 
totalrows=100843, numrows=100843, rows=0x1fd4028, natts=33260176, 
vacattrstats=0x1fb7ef0) at extended_stats.c:344
#5  BuildRelationExtStatistics (onerel=0x7fee16140de8, totalrows=100843, 
numrows=100843, rows=0x1fd4028, natts=33260176, vacattrstats=0x1fb7ef0) at 
extended_stats.c:130
#6  0x00588346 in do_analyze_rel (onerel=0x7fee16140de8, options=2, 
params=0x7ffe5b6bf8b0, va_cols=0x0, acquirefunc=0x492b4, relpages=36, inh=true, 
in_outer_xact=false, elevel=13) at analyze.c:627
#7  0x005891e1 in analyze_rel (relid=, 
relation=0x1ea22a0, options=2, params=0x7ffe5b6bf8b0, va_cols=0x0, 
in_outer_xact=false, bstrategy=0x1f38090) at analyze.c:317
#8  0x005fb689 in vacuum (options=2, relations=0x1f381f0, 
params=0x7ffe5b6bf8b0, bstrategy=, isTopLevel=) at vacuum.c:357
#9  0x005fbafe in ExecVacuum (vacstmt=, 
isTopLevel=) at vacuum.c:141
#10 0x00757a30 in standard_ProcessUtility (pstmt=0x1ea2410, 
queryString=0x1ea18c0 "ANALYZE sectors;", context=PROCESS_UTILITY_TOPLEVEL, 
params=0x0, queryEnv=0x0, dest=0x1ea26d0, completionTag=0x7ffe5b6bfdf0 "")
at utility.c:670
#11 0x7fee163a4344 in pgss_ProcessUtility (pstmt=0x1ea2410, 
queryString=0x1ea18c0 "ANALYZE sectors;", context=PROCESS_UTILITY_TOPLEVEL, 
params=0x0, queryEnv=0x0, dest=0x1ea26d0, completionTag=0x7ffe5b6bfdf0 "")
at pg_stat_statements.c:1005
#12 0x00753779 in PortalRunUtility (portal=0x1f1a8e0, pstmt=0x1ea2410, 
isTopLevel=, setHoldSnapshot=, 
dest=0x1ea26d0, completionTag=) at pquery.c:1178
#13 0x0075464d in PortalRunMulti (portal=0x1f1a8e0, isTopLevel=true, 
setHoldSnapshot=false, dest=0x1ea26d0, altdest=0x1ea26d0, 
completionTag=0x7ffe5b6bfdf0 "") at pquery.c:1331
#14 0x00754de8 in PortalRun (portal=0x1f1a8e0, 
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x1ea26d0, 
altdest=0x1ea26d0, completionTag=0x7ffe5b6bfdf0 "") at pquery.c:799
#15 0x00751987 in exec_simple_query (query_string=0x1ea18c0 "ANALYZE 
sectors;") at postgres.c:1145
#16 0x00752931 in PostgresMain (argc=, argv=, dbname=0x1edbad8 "ts", username=) at 
postgres.c:4182
#17 0x006e1ba7 in BackendRun (argc=, argv=) at postmaster.c:4358
#18 BackendStartup (argc=, argv=) at 
postmaster.c:4030
#19 ServerLoop (argc=, argv=) at 
postmaster.c:1707
#20 PostmasterMain (argc=, argv=) at 
postmaster.c:1380
#21 0x00656210 in main (argc=3, argv=0x1e9c4d0) at main.c:228

#3  0x0051bdb7 in CatalogTupleUpdate (heapRel=0x7fee161700c8, 
otid=0x1fb7f44, tup=0x1fb7f40) at indexing.c:234
indstate = 0x1fb84a0
#4  0x0071e5ca in statext_store (onerel=0x7fee16140de8, 
totalrows=100843, numrows=100843, rows=0x1fd4028, natts=33260176, 
vacattrstats=0x1fb7ef0) at extended_stats.c:344
stup = 0x1fb7f40
oldtup = 0x7fee16158530
values = {0, 0, 0, 0, 0, 0, 0, 33260544}
nulls = {true, true, true, true, true, true, true, false}
replaces = {false, false, false, false, false, false, true, true}
#5  BuildRelationExtStatistics (onerel=0x7fee16140de8, totalrows=100843, 
numrows=100843, rows=0x1fd4028, natts=33260176, vacattrstats=0x1fb7ef0) at 
extended_stats.c:130
stat = 
stats = 
lc2 = 
ndistinct = 
dependencies = 
pg_stext = 0x7fee161700c8
lc = 0x1fb8290
stats = 0xfb6a172d
cxt = 0x1fb7de0
oldcxt = 0x1f6dd60
__func__ = "BuildRelationExtStatistics"


Ah: the table is an inheritence parent.  If I uninherit its child, there's no
error during ANALYZE.  MV stats on the child are ok:

ts=# CREATE STATISTICS vzw_sectors_stats (dependencies) ON site_id,sect_id FROM 
vzw_sectors;
CREATE STAT

Re: openssl valgrind failures on skink are due to openssl issue

2019-06-18 Thread Andres Freund

Hi,

On 2019-06-18 19:25:12 -0400, Tom Lane wrote:
> Did you get any sense of how fast the openssl fix is goinng to show up?

It's merged to both branches that contain the broken code. Now we need
to wait for the next set of openssl releases, and then for distros to
pick that up. Based on the past release cadence
https://www.openssl.org/news/openssl-1.1.1-notes.html
that seems to be likely to happen within 2-3 months.

Greetings,

Andres Freund

Re: ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Andres Freund

Hi,

On 2019-06-18 18:12:33 -0500, Justin Pryzby wrote:
> A customers DB crashed due to OOM.  While investigating the issue in our
> report, I created MV stats, which causes this error:
> 
> ts=# CREATE STATISTICS sectors_stats (dependencies) ON site_id,sect_id FROM 
> sectors;
> CREATE STATISTICS
> ts=# ANALYZE sectors;
> ERROR:  XX000: tuple already updated by self
> LOCATION:  simple_heap_update, heapam.c:4613
> 
> The issue goes away if I drop the stats object and comes back if I recreate 
> it.
> 
> We're running 11.3 ; most of the (very few) reports from this error are from
> almost 10+ years ago, running pg7.3 like.
> 
> I've taken a couple steps to resolve the issue (vacuum full and then reindex
> pg_statistic and its toast and the target table, which doesn't have a toast).
> 
> I'm guessing the issue is with pg_statistic_ext, which I haven't touched.
> 
> Next step seems to be to truncate pg_statistic{,ext} and re-analyze the DB.
> 
> Does anyone want debugging/diagnostic info before I do that ?

Any chance to get a backtrace for the error?

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

You should be able to set a breakpoint to just the location pointed out
in the error message.

Greetings,

Andres Freund

Re: openssl valgrind failures on skink are due to openssl issue

2019-06-18 Thread Tom Lane

Andres Freund  writes:
> What we could do is add a suppression like:

> {
>broken-openssl-accesses-random
>Memcheck:Cond
>...
>fun:pg_strong_random
>fun:InitProcessGlobals
>fun:PostmasterMain
>fun:main
> }

> (alternatively one suppression for each RAND_status, RAND_poll,
> RAND_bytes(), to avoid suppressing all of pg_strong_random itself)
   
> and then prevent spread of the uninitialized memory by adding a
>   VALGRIND_MAKE_MEM_DEFINED(buf, len);
> after a successful RAND_bytes() call.

> I tested that that quiesces the problem locally. Probably not worth
> pushing something like that though?

Yeah, that seems awfully aggressive to be pushing to machines that
don't have the problem.  Did you get any sense of how fast the
openssl fix is goinng to show up?

regards, tom lane

ANALYZE: ERROR: tuple already updated by self

2019-06-18 Thread Justin Pryzby

A customers DB crashed due to OOM.  While investigating the issue in our
report, I created MV stats, which causes this error:

ts=# CREATE STATISTICS sectors_stats (dependencies) ON site_id,sect_id FROM 
sectors;
CREATE STATISTICS
ts=# ANALYZE sectors;
ERROR:  XX000: tuple already updated by self
LOCATION:  simple_heap_update, heapam.c:4613

The issue goes away if I drop the stats object and comes back if I recreate it.

We're running 11.3 ; most of the (very few) reports from this error are from
almost 10+ years ago, running pg7.3 like.

I've taken a couple steps to resolve the issue (vacuum full and then reindex
pg_statistic and its toast and the target table, which doesn't have a toast).

I'm guessing the issue is with pg_statistic_ext, which I haven't touched.

Next step seems to be to truncate pg_statistic{,ext} and re-analyze the DB.

Does anyone want debugging/diagnostic info before I do that ?

Justin

Re: openssl valgrind failures on skink are due to openssl issue

2019-06-18 Thread Andres Freund

Hi,

On 2019-06-11 14:07:29 -0700, Andres Freund wrote:
> On 2019-06-11 16:55:28 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > I can't think of a better way to fix skink for now than just disabling
> > > openssl for skink, until 1.1.1d is released.
> > 
> > Couldn't you install a local valgrind exclusion matching this stack trace?
> 
> Unfortunately no. The error spreads through significant parts of openssl
> *and* postgres, because it taints the returned random value, which then
> is used in a number of places. We could try to block all of those, but
> that seems fairly painful. And one, to my knowledge, cannot do valgrind
> suppressions based on the source of uninitialized memory.

What we could do is add a suppression like:

{
   broken-openssl-accesses-random
   Memcheck:Cond
   ...
   fun:pg_strong_random
   fun:InitProcessGlobals
   fun:PostmasterMain
   fun:main
}

(alternatively one suppression for each RAND_status, RAND_poll,
RAND_bytes(), to avoid suppressing all of pg_strong_random itself)
   
and then prevent spread of the uninitialized memory by adding a
VALGRIND_MAKE_MEM_DEFINED(buf, len);
after a successful RAND_bytes() call.

I tested that that quiesces the problem locally. Probably not worth
pushing something like that though?

Greetings,

Andres Freund

Re: PG 12 beta 1 segfault during analyze

2019-06-18 Thread Andres Freund

Hi,

On 2019-06-17 21:46:02 -0400, Steve Singer wrote:
> On 6/15/19 10:18 PM, Tom Lane wrote:
> > Steve Singer  writes:
> > > I encountered the following segfault when running against a  PG 12 beta1
> > > during a analyze against a table.
> > Nobody else has reported this, so you're going to have to work on
> > producing a self-contained test case, or else debugging it yourself.
> > 
> > regards, tom lane
> > 
> > 
> > 
> The attached patch fixes the issue.

Thanks for the bug report, diagnosis and patch. Pushed.

I included a small testcase for ANALYZE running in a transaction that
also modified a few rows, after going back and forth on it for a
while. Seems unlikely that we'll reintroduce this specific bug, but it
seems good to have test coverage a least some of the
HEAPTUPLE_DELETE_IN_PROGRESS path. We currently have none...

I think the testcase would catch the issue at hand on most machines, by
mixing live/dead/deleted-by-current-transaction rows.

Greetings,

Andres Freund

Re: pgsql: Avoid spurious deadlocks when upgrading a tuple lock

2019-06-18 Thread Alvaro Herrera

On 2019-Jun-18, Oleksii Kliukin wrote:

> Sorry, I was confused, as I was looking only at
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=de87a084c0a5ac927017cd0834b33a932651cfc9
> 
> without taking your subsequent commit that silences compiler warnings at
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3da73d6839dc47f1f47ca57974bf28e5abd9b572
> into consideration. With that commit, the danger is indeed in resetting the
> skip mechanism on each jump and potentially causing deadlocks.

Yeah, I understand the confusion.

Anyway, as bugs go, this one seems pretty benign.  It would result in a
unexplained deadlock, very rarely, and only for people who use a very
strange locking pattern that includes (row-level) lock upgrades.  I
think it also requires aborted savepoints too, though I don't rule out
the possibility that there might be a way to reproduce this without
that.

I pushed the patch again just now, with the new permutation.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-06-18 Thread Melanie Plageman

On Thu, Jun 13, 2019 at 7:10 AM Robert Haas  wrote:

> On Tue, Jun 11, 2019 at 2:35 PM Melanie Plageman
>  wrote:
> > Let me try to articulate what I think the bitmap implementation would
> look
> > like:
> >
> > Before doing chunked hashloop join for any batch, we would need to
> > know how many tuples are in the outer batch to make the bitmap the
> > correct size.
>
> I was thinking that we wouldn't need to know this, because if the
> bitmap is in a file, we can always extend it.  To imagine a needlessly
> dumb implementation, consider:
>
> set-bit(i):
>let b = i / 8
>while (b <= length of file in bytes)
>   append '\0' to file
>   read byte b from the file
>   modify the byte you read by setting bit i % 8
>   write the modified byte back to the file
>
> In reality, we'd have some kind of buffer.  I imagine locality of
> reference would be pretty good, because the outer tuples are coming to
> us in increasing-tuple-number order.
>
> If you want to prototype with an in-memory implementation, I'd suggest
> just pallocing 8kB initially and repallocing when the tuple number
> gets too big.  It'll be sorta inefficient, but who cares?  It's
> certainly way cheaper than an extra pass over the data, and for a POC
> it should be fine.
>
>
That approach makes sense. I have attached the first draft of a patch
I wrote to do parallel-oblivious hashjoin fallback. I haven't switched
to using the approach with a bitmap (or bytemap :) yet because I found
that using a linked list was easier to debug for now.

(Also, I did things like include the value of the outer tuple
attribute in the linked list nodes and assumed it was an int because
that is what I have been testing with--this would definitely be blown
away with everything else that is just there to help me with debugging
right now).

I am refactoring it now to change the state machine to make more sense
before changing the representation of the match statuses.

So, specifically, I am interested in high-level gut checks on the
state machine I am currently implementing (not reflected in this
patch).

This patch adds only one state -- HJ_ADAPTIVE_EMIT_UNMATCHED-- which
duplicates the logic of HJ_FILL_OUTER_TUPLE. Also, in this patch, the
existing HJ_NEED_NEW_BATCH state is used for new chunks. After
separating the logic that advanced the batches from that which loaded
a batch, it felt like NEED_NEW_CHUNK did not need to be its own state.
When a new chunk is required, if more exist, then the next one should
be loaded and outer should be rewound. Rewinding of outer was already
being done (seek to the beginning of the outer spill file is the
equivalent of "loading" it).

Currently, I am tracking a lot of state in the HashJoinState, which is
fiddly and error-prone.

New state machine (questions posed below):
To refactor the state machine, I am thinking of adding a new state
HJ_NEED_NEW_INNER_CHUNK which we would transition to when outer batch
is over. We would load the new chunk, rewind the outer, and transition
to HJ_NEED_NEW_OUTER. However, we would have to emit unmatched inner
tuples for that chunk (in case of ROJ) before that transition to
HJ_NEED_NEW_OUTER. This feels a little less clean because the
HJ_FILL_INNER_TUPLES state is transitioned into when the inner batch
is over as well. And, in the current flow I am sketching out, if the
inner batch is exhausted, we check if we should emit NULL-extended
inner tuples and then check if we should emit NULL-extended outer
tuples (since both batches are exhausted), whereas when a single inner
chunk is done being processed, we only want to emit NULL-extended
tuples for the inner side. Not to mention HJ_NEED_NEW_INNER_CHUNK
would transition to HJ_NEED_NEW_OUTER directly instead of first
advancing the batches. This can all be hacked around with if
statements, but, my point here is that if I am refactoring the state
machine to be more clear, ideally, it would be more clear.

A similar problem happens with HJ_FILL_OUTER_TUPLE and the
non-fallback case. For the fallback case, with this implementation,
you must wait until after exhausting the inner side to emit
NULL-extended outer tuples. In the non-fallback case -- a batch which
can fit in memory or, always, for batch 0 -- the unmatched outer
tuples are emitted as they are encountered.

It makes most sense in the context of the state machine, as far as I
can tell, after exhausting both outer and inner batch, to emit
NULL-extended inner tuples for that chunk and then emit NULL-extended
outer tuples for that batch.

So, requiring an additional read of the outer side to emit
NULL-extended tuples at the end of the inner batch would slow things
down for the non-fallback case, however, it seems like special casing
the fallback case would make the state machine much more confusing --
basically like mashing two totally different state machines together.

These questions will probably make a lot more sense with corresponding
code, so I will follow up with the second version of th

PL/Python fails on new NetBSD/PPC 8.0 install

2019-06-18 Thread Tom Lane

I noticed that the old NetBSD 5.1.5 installation I had on my G4 Mac
was no longer passing our regression tests, because it has a strtof()
that is sloppy about underflow.  Rather than fight with that I decided
to update it to something shinier (well, as shiny as you can get on
hardware that's old enough to apply for a driver's license).  I stuck in
NetBSD/macppc 8.0, and things seem to work, except that PL/Python
crashes on launch.  I see something like this in the postmaster log:

Traceback (most recent call last):
  File "", line 1162, in 
_install_external_importers
  File "", line 980, in _find_and_load
  File "", line 149, in __enter__
  File "", line 84, in acquire
RuntimeError: no current thread ident
Fatal Python error: initexternalimport: external importer setup failed

Current thread 0x (most recent call first):
2019-06-18 17:40:59.629 EDT [20764] LOG:  server process (PID 23714) was 
terminated by signal 6: Abort trap
2019-06-18 17:40:59.629 EDT [20764] DETAIL:  Failed process was running: CREATE 
FUNCTION stupid() RETURNS text AS 'return "zarkon"' LANGUAGE plpython3u;

and a stack trace like

#0  0xfddd383c in _lwp_kill () from /usr/lib/libc.so.12
#1  0xfddd3800 in raise () from /usr/lib/libc.so.12
#2  0xfddd2e38 in abort () from /usr/lib/libc.so.12
#3  0xf4c371dc in fatal_error () from /usr/pkg/lib/libpython3.7.so.1.0
#4  0xf4c38370 in _Py_FatalInitError () from /usr/pkg/lib/libpython3.7.so.1.0
#5  0xf4c38f7c in Py_InitializeEx () from /usr/pkg/lib/libpython3.7.so.1.0
#6  0xf4c38fc0 in Py_Initialize () from /usr/pkg/lib/libpython3.7.so.1.0
#7  0xfdc8d548 in PLy_initialize () at plpy_main.c:135
#8  0xfdc8da0c in plpython3_validator (fcinfo=)
at plpy_main.c:192
#9  0x01d4a904 in FunctionCall1Coll (flinfo=0xd608, 
collation=, arg1=) at fmgr.c:1140
#10 0x01d4b03c in OidFunctionCall1Coll (functionId=functionId@entry=16464, 
collation=collation@entry=0, arg1=arg1@entry=32774) at fmgr.c:1418
#11 0x0196a9d0 in ProcedureCreate (
procedureName=procedureName@entry=0xfdb0aac0 "transaction_test1", 
procNamespace=procNamespace@entry=2200, replace=replace@entry=false, 
returnsSet=returnsSet@entry=false, returnType=returnType@entry=2278, 
proowner=10, languageObjectId=languageObjectId@entry=16465, 
languageValidator=languageValidator@entry=16464, 
prosrc=prosrc@entry=0xfdb0abf8 "\nfor i in range(0, 10):\n
plpy.execute(\"INSERT INTO test1 (a) VALUES (%d)\" % i)\nif i % 2 == 0:\n   
 plpy.commit()\nelse:\nplpy.rollback()\n", 
probin=probin@entry=0x0, 
...

The "no current thread ident" error rings some vague bells, but I could
not find any previous discussion matching that in our archives.

This is with today's HEAD of our code and the python37-3.7.1 package from
NetBSD 8.0.

Any ideas?  I'm not so wedded to PL/Python that I'll spend a lot of time
making it go on this old box ... but seeing that 3.7 is still pretty
bleeding-edge Python, I wonder if other people will start getting this
too.

regards, tom lane

O(N^2) when building multi-column MCV lists

2019-06-18 Thread Tomas Vondra


Hi,

When experimenting with multi-column MCV lists with statistic target set
to high value (e.g. 10k), I've realized there's an O(N^2) issue in
statext_mcv_build() when computing base frequencies.

We do this:

   for (i = 0; i < nitems; i++)
   {
   ...
   item->base_frequency = 1.0;
   for (j = 0; j < numattrs; j++)
   {
   intcount = 0;
   intk;

   for (k = 0; k < ngroups; k++)
   {
   if (multi_sort_compare_dim(j, &groups[i], &groups[k], mss) == 0)
   count += groups[k].count;
   }

   ...
   }
   }


That is, for each item on the MCV list, we walk through all the groups
(for each dimension independently) to determine the total frequency of
the value.

With many groups (which can easily happen for high statistics target),
this can easily get very expensive.

I think the best solution here is to pre-compute frequencies for values
in all dimensions, and then just access that instead of looping through
the groups over and over.

IMHO this is something we should fix for PG12, so I'll put that on the
open items list, and produce a fix shortly.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Multivariate MCV list vs. statistics target

2019-06-18 Thread Tomas Vondra


Hi,

One slightly inconvenient thing I realized while playing with the
address data set is that it's somewhat difficult to set the desired size
of the multi-column MCV list.

At the moment, we simply use the maximum statistic target for attributes
the MCV list is built on. But that does not allow keeping default size
for per-column stats, and only increase size of multi-column MCV lists.

So I'm thinking we should allow tweaking the statistics for extended
stats, and serialize it in the pg_statistic_ext catalog. Any opinions
why that would be a bad idea?

I suppose it should be part of the CREATE STATISTICS command, but I'm
not sure what'd be the best syntax. We might also have something more
similar to ALTER COLUMNT, but perhaps

ALTER STATISTICS s SET STATISTICS 1000;

looks a bit too weird.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: New EXPLAIN option: ALL

2019-06-18 Thread David Fetter

On Tue, Jun 18, 2019 at 11:08:31PM +0200, Peter Eisentraut wrote:
> On 2019-05-15 19:58, Andres Freund wrote:
> > On 2019-05-15 13:53:26 -0400, Tom Lane wrote:
> >> FWIW, given the conflict against "EXPLAIN EXECUTE prepared_stmt_name",
> >> we should probably just drop the whole idea.  It seemed like a great
> >> idea at the time, but it's going to confuse people not just Bison.
> > I'm not particularly invested in the idea of renaming ANALYZE - but I
> > think we might be able to come up with something less ambiguous than
> > EXECUTE. Even EXECUTION might be better.
> 
> The GQL draft uses PROFILE as a separate top-level command, so it would be
> 
> PROFILE SELECT ...
> 
> That seems nice and clear.

Are you proposing something along the lines of this?

PROFILE [statement]; /* Shows the plan */
PROFILE RUN [statement]; /* Actually executes the query */

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: POC: Cleaning up orphaned files using undo logs

2019-06-18 Thread Robert Haas

On Tue, Jun 18, 2019 at 2:07 PM Robert Haas  wrote:
> On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila  wrote:
> > [ new patches ]
>
> I tried writing some code that throws an error from an undo log
> handler and the results were not good.

I discovered another bothersome thing here: if you have a single
transaction that generates a bunch of undo records, the first one has
uur_dbid set correctly and the remaining records have uur_dbid set to
0.  That means if you try to write a sanity check like if
(record->uur_dbid != MyDatabaseId) elog(ERROR, "the undo system messed
up") it fails.

The original idea of UnpackedUndoRecord was this: you would put a
bunch of data into an UnpackedUndoRecord that you wanted written to
undo, and the undo system would find ways to compress stuff out of the
on-disk representation by e.g. omitting the fork number if it's
MAIN_FORKNUM. Then, when you read an undo record, it would decompress
so that you ended up with the same UnpackedUndoRecord that you had at
the beginning. However, the inclusion of transaction headers has made
this a bit confusing: that stuff isn't being added by the user but by
the undo system itself. It's not very clear from the comments what the
contract is around these things: do you need to set uur_dbid to
MyDatabaseId when preparing to insert an undo record? Or can you just
leave it unset and then it'll possibly be set at decoding time?  The
comments for the UnpackedUndoRecord structure don't explain this.

I'd really like to see this draw a cleaner distinction between the
stuff that the user is expected to set and the other stuff we deal
with internally to the undo subsystem.  For example, suppose that
UnpackedUndoRecord didn't include any of the fields that are only
present in the transaction header.  Maybe there's another structure,
like UndoTransactionHeader, that includes those fields.  The client of
the undo subsystem creates a bunch of UnpackedUndoRecords and inserts
them.  At undo time, the callback gets back an identical set of
UnpackedUndoRecords.  And maybe it also gets a pointer to the
UndoTransactionHeader which contains all of the system-generated
stuff. Under this scheme, uur_xid, uur_xidepoch (which still need to
be combined into uur_fxid), uur_progress, uur_dbid, uur_next,
uur_prevlogstart, and uur_prevurp would all move out of the
UnpackedUndoRecord and into the UndoTransactionHeader. The user would
supply none of those things when inserting undo records, but the
rm_undo callback could examine those values if it wished.

A weaker approach would be to at least clean up the structure
definition so that the transaction-header fields set by the system are
clearly segregated from the per-record fields set by the
undo-inserter, with comments explaining that those fields don't need
to be set but will (or may?) be set at undo time. That would be better
than what we have right now - because it would hopefully make it much
more clear which fields need to be set on insert and which fields can
be expected to be set when decoding - but I think it's probably not
going far enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: New EXPLAIN option: ALL

2019-06-18 Thread Peter Eisentraut

On 2019-05-15 19:58, Andres Freund wrote:
> On 2019-05-15 13:53:26 -0400, Tom Lane wrote:
>> FWIW, given the conflict against "EXPLAIN EXECUTE prepared_stmt_name",
>> we should probably just drop the whole idea.  It seemed like a great
>> idea at the time, but it's going to confuse people not just Bison.
> I'm not particularly invested in the idea of renaming ANALYZE - but I
> think we might be able to come up with something less ambiguous than
> EXECUTE. Even EXECUTION might be better.

The GQL draft uses PROFILE as a separate top-level command, so it would be

PROFILE SELECT ...

That seems nice and clear.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Choosing values for multivariate MCV lists

2019-06-18 Thread Tomas Vondra


Hi,

The current implementation of multi-column MCV lists (added in this
cycle) uses a fairly simple algorithm to pick combinations to include in
the MCV list. We just compute a minimum number of occurences, and then
include all entries sampled more often. See get_mincount_for_mcv_list().

By coincidence I received a real-world data set where this does not work
particularly well, so I'm wondering if we can improve this somehow. It
does not make the estimates worse than without the MCV lists, so I don't
think we have an issue, but I wonder if we could do better.

The data set is very simple - it's a single table of valid addresses
(city/street name and street number):

   CREATE TABLE addresses (
   city_name   text,
   street_name text,
   street_no   int
   ); 


Data with Czech addresses are available here (it's ~9.3MB, so I'm not
attaching it directly)

   https://drive.google.com/file/d/1EiZGsk6s5hqzZrL7t5KiaqByJnBsndfA

and attached is a SQL script I used to compute a "virtual" MCV list.

There are about 3M records, so let's query for a street name in one of
the cities:

 EXPLAIN ANALYZE
 SELECT * FROM addresses
  WHERE city_name = 'Most' AND street_name = 'Pionýrů'


 QUERY PLAN
 
  Seq Scan on addresses  (cost=0.00..62645.38 rows=1 width=25)
 (actual time=117.225..204.291 rows=779 loops=1)
Filter: ((city_name = 'Most'::text) AND (street_name = 'Pionýrů'::text))
Rows Removed by Filter: 2923846
  Planning Time: 0.065 ms
  Execution Time: 204.525 ms
 (5 rows)

It's true 779 rows is only a tiny fraction of the data set (~0.025%),
but this data set is a bit weird in one other aspect - about half of the
records has empty (NULL) street_name, it's just city + number. Small
villages simply don't have streets at all, and large cities probably
started as small villages, so they have such addresses too.

This however means that by choosing the MCV entries solely based on the
number of occurrences in the sample, we end up with MCV lists where vast
majority of entries has NULL street name.

That's why we got such poor estimate in the example query, despite the
fact that the city/street combination is the most frequent in the data
set (with non-NULL street name).

The other weird thing is that frequency of NULL street names is fairly
uniform in the whole data set. In total about 50% addresses match that,
and for individual cities it's generally between 25% and 100%, so the
estimate is less than 2x off in those cases.

But for addresses with non-NULL street names, the situation is very
different. Some street names are unique to a single city, etc.

Overall, this means we end up with MCV list with entries representing
the mostly-uniform part of the data set, instead of prefering the
entries that are truly skewed.

So I'm wondering if we could/should rethink the algorithm, so look at
the frequency and base_frequency, and maybe pick entries based on their
ratio, or something like that.

For example, we might sort the entries by

   abs(freq - base_freq) / freq

which seems like a reasonable "measure of mis-estimate". Or maybe we
might look just at abs(freq - base_freq)? I think the first option would
be better, because (1 row vs. 100.000 rows) is probably worse than (1M
rows vs. 2M rows).

Of course, this is a challenging problem, for a couple of reasons.

Firstly, picking simply the most frequent groups is very simple and it
gives us additional information about the largest group (which may be
useful elsewhere, e.g. the incremental sort patch).

Secondly, if the correlated combinations are less frequent, how reliably
can we even estimate the frequency from a sample? The combination in the
example query was ~0.02% of the data, so how likely it's to be sampled?

Alternatively, it's possible to increase statistics target to make the
sample larger, but that also keeps more entries in the MCV list,
including the correlated ones. So maybe the best thing we can do is to
just rely on that?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 


addresses.sql
Description: application/sql

Re: initdb recommendations

2019-06-18 Thread Peter Eisentraut

On 2019-05-23 18:54, Peter Eisentraut wrote:
> To recap, the idea here was to change the default authentication methods
> that initdb sets up, in place of "trust".
> 
> I think the ideal scenario would be to use "peer" for local and some
> appropriate password method (being discussed elsewhere) for host.

Patch for that attached.

> Looking through the buildfarm, I gather that the only platforms that
> don't support peer are Windows, AIX, and HP-UX.

Note that with this change, running initdb without arguments will now
error on those platforms: You need to supply either a password or select
a different default authentication method.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 88de6f226233bb183dc75ee047501fae5051e287 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 18 Jun 2019 22:20:23 +0200
Subject: [PATCH v1] initdb: Change authentication defaults

Change the defaults for the pg_hba.conf generated by initdb to "peer"
for local (if supported, else "md5") and "md5" for host.

(Changing from "md5" to SCRAM is left as a separate exercise.)

"peer" is currently not supported on AIX, HP-UX, and Windows.  Users
on those operating systems will now either have to provide a password
to initdb or choose a different authentication method when running
initdb.

Discussion: 
https://www.postgresql.org/message-id/flat/bec17f0a-ddb1-8b95-5e69-368d9d0a3390%40postgresql.org
---
 doc/src/sgml/ref/initdb.sgml|  9 -
 doc/src/sgml/runtime.sgml   | 23 +
 doc/src/sgml/standalone-install.xml |  9 -
 src/bin/initdb/initdb.c | 31 ++---
 src/include/port.h  |  5 +
 src/test/regress/pg_regress.c   |  2 +-
 6 files changed, 33 insertions(+), 46 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 7fc3152c6d..c47b9139eb 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -136,9 +136,16 @@ Options
 replication connections.

 
+   
+The default is peer for Unix-domain socket
+connections on operating systems that support it, otherwise
+md5, and md5 for TCP/IP
+connections.
+   
+

 Do not use trust unless you trust all local users 
on your
-system.  trust is the default for ease of 
installation.
+system.

   
  
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 365ec75aad..305698aa0e 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -156,24 +156,19 @@ Creating a Database Cluster
   
 
   
-   However, while the directory contents are secure, the default
-   client authentication setup allows any local user to connect to the
-   database and even become the database superuser. If you do not
-   trust other local users, we recommend you use one of
+   The default client authentication setup is such that users can connect over
+   the Unix-domain socket to the same database user name as their operating
+   system user names (on operating systems that support this, which are most
+   modern Unix-like systems, but not Windows) and otherwise with a password.
+   To assign a password to the initial database superuser, use one of
initdb's -W, --pwprompt
-   or --pwfile options to assign a password to the
-   database superuser.
+   or --pwfile options.
  password
  of the superuser

-   Also, specify -A md5 or
-   -A password so that the default trust 
authentication
-   mode is not used; or modify the generated pg_hba.conf
-   file after running initdb, but
-   before you start the server for the first time. (Other
-   reasonable approaches include using peer authentication
-   or file system permissions to restrict connections. See  for more information.)
+   This configuration is secure and sufficient to get started.  Later, see
+for more information about setting
+   up client authentication.
   
 
   
diff --git a/doc/src/sgml/standalone-install.xml 
b/doc/src/sgml/standalone-install.xml
index f584789f9a..749a071061 100644
--- a/doc/src/sgml/standalone-install.xml
+++ b/doc/src/sgml/standalone-install.xml
@@ -63,15 +63,6 @@ Getting Started
 

 
-   
-
- At this point, if you did not use the initdb 
-A
- option, you might want to modify pg_hba.conf to 
control
- local access to the server before you start it.  The default is to
- trust all local users.
-
-   
-

 
  The previous initdb step should have told you how to
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ad5cd4194a..fd69c5c95e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -185,7 +185,6 @@ static const char *default_timezone = NULL;
 "# allows any local user to connect as any PostgreSQL user, including\n" \
 "# the database superuser.  If you do n

Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

2019-06-18 Thread Andres Freund

Hi,

On 2018-12-17 15:35:01 -0800, Andres Freund wrote:
> On 2018-12-16 13:48:00 -0800, Andres Freund wrote:
> > On 2018-12-17 08:25:38 +1100, Thomas Munro wrote:
> > > On Mon, Dec 17, 2018 at 7:57 AM Andres Freund  wrote:
> > > > The interesting bit is that if I replace the _exit(2) in
> > > > bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers),
> > > > or manully add an OPENSSL_cleanup() before the _exit(2), valgrind
> > > > doesn't find errors.
> > > 
> > > Weird.  Well I can see that there were bugs last year where OpenSSL
> > > failed to clean up its thread locals[1], and after they fixed that,
> > > cases where it bogusly cleaned up someone else's thread locals[2].
> > > Maybe there is some interference between pthread keys or something
> > > like that.
> > > 
> > > [1] https://github.com/openssl/openssl/issues/3033
> > > [2] https://github.com/openssl/openssl/issues/3584
> > 
> > What confuses the heck out of me is that it happens on _exit(). Those
> > issues ought to be only visible when doing exit(), no?
> > 
> > I guess there's also a good argument to make that valgrind running it's
> > intercept in the _exit() case is a bit dubious (given that's going to be
> > used in cases where e.g. a signal handler might have interrupted a
> > malloc), but given the stacktraces here I don't think that can be the
> > cause.
> 
> I've for now put --run-libc-freeres=no into skink's config. Locally that
> "fixes" the issue for me, but of course is not a proper solution. But I
> want to see whether that allows running all tests under valgrind.

Turns out to be caused by a glibc bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=24476

The reason it only fails if ssl is enabled, and only after the openssl
randomness was integrated, is that openssl randomness initialization
creates a TLS variable, which glibc then frees accidentally (as it tries
to free something not initialized).

Thus this can be "worked around" by doing something like
shared_preload_libraries=pg_stat_statements, as dlopening a library
initializes the relevant state.

Greetings,

Andres Freund

Re: idea: log_statement_sample_rate - bottom limit for sampling

2019-06-18 Thread Pavel Stehule

út 18. 6. 2019 v 14:03 odesílatel Adrien Nayrat 
napsal:

> Hi,
>
> I tried the patch, here my comment:
>
> > gettext_noop("Zero effective disables sampling. "
> >  "-1 use sampling every time (without limit)."),
>
> I do not agree with the zero case. In fact, sampling is disabled as soon as
> setting is less than log_min_duration_statements. Furthermore, I think we
> should
> provide a more straightforward description for users.
>

You have true, but I have not a idea,how to describe it in one line. In
this case the zero is corner case, and sampling is disabled without
dependency on log_min_duration_statement.

I think so this design has only few useful values and ranges

a) higher than log_min_duration_statement .. sampling is active with limit
b) 0 .. for this case - other way how to effective disable sampling - no
dependency on other
c) -1 or negative value - sampling is allowed every time.

Sure, there is range (0..log_min_duration_statement), but for this range
this value has not sense. I think so this case cannot be mentioned in short
description. But it should be mentioned in documentation.


> I changed few comments and documentation:
>
>   * As we added much more logic in this function with statement and
> transaction
> sampling. And now with statement_sample_rate, it is not easy to understand
> the
> logic on first look. I reword comment in check_log_duration, I hope it is
> more
> straightforward.
>
>   * I am not sure if "every_time" is a good naming for the variable. In
> fact, if
> duration exceeds limit we disable sampling. Maybe sampling_disabled is
> more clear?
>

For me important is following line

(exceeded && (in_sample || every_time))

I think so "every_time" or "always" or "every" is in this context more
illustrative than "sampling_disabled". But my opinion is not strong in this
case, and I have not a problem accept common opinion.


>
>   * I propose to add some words in log_min_duration_statement and
> log_statement_sample_rate documentation.
>
>   * Rephrased log_statement_sample_limit documentation, I hope it help
> understanding.
>
> Patch attached.
>
> Regards,
>
> --
> Adrien
>

Re: pgsql: Avoid spurious deadlocks when upgrading a tuple lock

2019-06-18 Thread Oleksii Kliukin

Alvaro Herrera  wrote:

> On 2019-Jun-16, Oleksii Kliukin wrote:
> 
>> Alvaro Herrera  wrote:
>> 
>>> On 2019-Jun-14, Alvaro Herrera wrote:
>>> 
 I think there are worse problems here.  I tried the attached isolation
 spec.  Note that the only difference in the two permutations is that s0
 finishes earlier in one than the other; yet the first one works fine and
 the second one hangs until killed by the 180s timeout.  (s3 isn't
 released for a reason I'm not sure I understand.)
>>> 
>>> Actually, those behaviors both seem correct to me now that I look
>>> closer.  So this was a false alarm.  In the code before de87a084c0, the
>>> first permutation deadlocks, and the second permutation hangs.  The only
>>> behavior change is that the first one no longer deadlocks, which is the
>>> desired change.
>>> 
>>> I'm still trying to form a case to exercise the case of skip_tuple_lock
>>> having the wrong lifetime.
>> 
>> Hm… I think it was an oversight from my part not to give skip_lock_tuple the
>> same lifetime as have_tuple_lock or first_time (also initializing it to
>> false at the same time). Even if now it might not break anything in an
>> obvious way, a backward jump to l3 label will leave skip_lock_tuple
>> uninitialized, making it very dangerous for any future code that will rely
>> on its value.
> 
> But that's not the danger ... with the current coding, it's initialized
> to false every time through that block; that means the tuple lock will
> never be skipped if we jump back to l3.  So the danger is that the first
> iteration sets the variable, then jumps back; second iteration
> initializes the variable again, so instead of skipping the lock, it
> takes it, causing a spurious deadlock.

Sorry, I was confused, as I was looking only at
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=de87a084c0a5ac927017cd0834b33a932651cfc9

without taking your subsequent commit that silences compiler warnings at
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3da73d6839dc47f1f47ca57974bf28e5abd9b572
into consideration. With that commit, the danger is indeed in resetting the
skip mechanism on each jump and potentially causing deadlocks.

Cheers,
Oleksii

Re: POC: Cleaning up orphaned files using undo logs

2019-06-18 Thread Robert Haas

On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila  wrote:
> [ new patches ]

I tried writing some code that throws an error from an undo log
handler and the results were not good.  It appears that the code will
retry in a tight loop:

2019-06-18 13:58:53.262 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.264 EDT [42803] ERROR:  robert_undo

It seems clear that the error-handling aspect of this patch has not
been given enough thought.  It's debatable what strategy should be
used when undo fails, but retrying 40 times per millisecond isn't the
right answer. I assume we want some kind of cool-down between retries.
10 seconds?  A minute?  Some kind of back-off algorithm that gradually
increases the retry time up to some maximum?  Should there be one or
more GUCs?

Another thing that is not very nice is that when I tried to shut down
the server via 'pg_ctl stop' while the above was happening, it did not
shut down.  I had to use an immediate shutdown.  That's clearly not
OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding possible future conformance headaches in JSON work

2019-06-18 Thread Chapman Flack

On 6/18/19 12:51 PM, Oleg Bartunov wrote:
>> have 'pg lax $.map(x => x + 10)'.
> 
> This is exactly what we were thinking about !

Perfect!

>> specify a POSIX re. Or, as like_regex has a named-parameter-like
>> syntax--like_regex("abc" flag "i")--perhaps 'posix' should just be
>> an extra keyword in that grammar: like-regex("abc" posix). That would
>> be safe from the committee adding a P flag that means something else.
> 
> We didn't think about regex, I don't know anybody working on xquery.

I do. :)

But is that even the point? It's already noted in [1] that the standard
calls for one style of regexps and we're providing another.

It's relatively uncomplicated now to add some kind of distinguishing
syntax for our posix flavor of like_regex. Yes, it would be a change
between beta1 and final release, but that doesn't seem unheard-of.

In contrast, if such a distinction is not added now, we know that will
be a headache for any future effort to more closely conform to the
standard. Whether such a future effort seems near-term or far off, it
doesn't seem strategic to make current choices that avoidably make it
harder.

Aside: I just looked over the 12 doco to see if the note in [1] is
in there, and all I see is that 'like_regex' is documented as "Tests
pattern matching with POSIX regular expressions."

In my opinion, that ought to have a note flagging that as different
from the standard. The user experience is not so good if someone comes
assuming we conform to the standard, writes code, then has to learn
why it didn't work. The whole doc section [2] about XML is intended
to spare people from unwelcome discoveries of that sort, but it was
written after the fact. I think it's better to have it from the start.

[1]
https://github.com/obartunov/sqljsondoc/blob/master/jsonpath.md#sqljson-conformance

[2] https://www.postgresql.org/docs/12/xml-limits-conformance.html

Re: Avoiding possible future conformance headaches in JSON work

2019-06-18 Thread Oleg Bartunov

On Sat, 1 Jun 2019, 16:41 Chapman Flack,  wrote:

> Hi,
>
> We had a short conversation about this on Friday but I didn't have time
> to think of a constructive suggestion, and now I've had more time to
> think about it.
>
> Regarding the proposed PG 13 jsonpath extensions (array, map, and
> sequence constructors, lambdas, map/fold/reduce, user-defined
> functions), literally all this stuff is in XPath/XQuery 3.1, and
> clearly the SQL committee is imitating XPath/XQuery in the design
> of jsonpath.
>
> Therefore it would not be surprising at all if the committee eventually
> adds those features in jsonpath. At that point, if the syntax matches
> what we've added, we are happy, and if not, we have a multi-year,
> multi-release, standard_conforming_strings-style headache.
>
> So, a few ideas fall out
>
> First, with Peter being a participant, if there are any rumblings in the
> SQL committee about adding those features, we should know the proposed
> syntax as soon as we can and try to follow that.
>

AFAIK, there is rumour about 'native json data type' and 'dot style syntax'
for json, but not about jsonpath.


> If such rumblings are entirely absent, we should see what we can do to
> start some, proposing the syntax we've got.
>
> In either case, perhaps we should immediately add a way to identify a
> jsonpath as being PostgreSQL-extended. Maybe a keyword 'pg' that can
> be accepted at the start in addition to any lax/strict, so you could
> have 'pg lax $.map(x => x + 10)'.
>

This is exactly what we were thinking about !

>
> If we initially /require/ 'pg' for the extensions to be recognized, then
> we can relax the requirement for whichever ones later appear in the spec
> using the same syntax. If they appear in the spec with a different
> syntax, then by requiring 'pg' already for our variant, we already have
> avoided the standard_conforming_strings kind of multi-release
> reconciliation effort.
>
> In the near term, there is already one such potential conflict in
> 12beta: the like_regex using POSIX REs instead of XQuery ones as the
> spec requires. Of course we don't currently have an XQuery regex
> engine, but if we ever have one, we then face a headache if we want to
> move jsonpath toward using it. (Ties in to conversation [1].)
>
> Maybe we could avoid that by recognizing now an extra P in flags, to
> specify a POSIX re. Or, as like_regex has a named-parameter-like
> syntax--like_regex("abc" flag "i")--perhaps 'posix' should just be
> an extra keyword in that grammar: like-regex("abc" posix). That would
> be safe from the committee adding a P flag that means something else.
>
> The conservative approach would be to simply require the 'posix' keyword
> in all cases now, simply because we don't have the XQuery regex engine.
>
> Alternatively, if there's a way to analyze a regex for the use of any
> constructs with different meanings in POSIX and XQuery REs (and if
> that's appreciably easier than writing an XQuery regex engine), then
> the 'posix' keyword could be required only when it matters. But the
> conservative approach sounds easier, and sufficient. The finer-grained
> analysis would have to catch not just constructs that are in one RE
> style and not the other, but any subtleties in semantics, and I
> certainly wouldn't trust myself to write that.
>

We didn't think about regex, I don't know anybody working on xquery.


> -Chap
>
>
> [1]
> https://www.postgresql.org/message-id/5CF2754F.7000702%40anastigmatix.net
>
>
>

Re: pgsql: Avoid spurious deadlocks when upgrading a tuple lock

2019-06-18 Thread Alvaro Herrera

On 2019-Jun-16, Alvaro Herrera wrote:

> So, I'm too lazy today to generate a case that fully reproduces the
> deadlock, because you need to stall 's2' a little bit using the
> well-known advisory lock trick, but this one hits the code that would
> re-initialize the variable.

Here's such a case.  I was unable to reproduce the condition with a
smaller sequence of commands.  This one does hit the deadlock when used
with the previous code, as expected; with the fixed code (ie.
skip_tuple_lock in the outer scope and same lifetime as "first_time")
then it works fine, no deadlock.

I'm going to push the fixed commit this afternoon, including this as an
additional permutation in the spec file.

setup
{
drop table if exists tlu_job;
create table tlu_job (id integer primary key, name text);

insert into tlu_job values(1, 'a');
}

teardown
{
drop table tlu_job;
}

session "s0"
setup { begin; }
step "s0_keyshare" { select id from tlu_job where id = 1 for key share; }
step "s0_share" { select id from tlu_job where id = 1 for share; }
step "s0_rollback" { rollback; } 

session "s1"
setup { begin; }
step "s1_keyshare" { select id from tlu_job where id = 1 for key share; }
step "s1_savept_e" { savepoint s1_e; }
step "s1_share" { select id from tlu_job where id = 1 for share; }
step "s1_savept_f" { savepoint s1_f; }
step "s1_fornokeyupd" { select id from tlu_job where id = 1 for no key update; }
step "s1_rollback_f" { rollback to s1_f; }
step "s1_rollback_e" { rollback to s1_e; }
step "s1_rollback" { rollback; }

session "s2"
setup { begin; }
step "s2_keyshare" { select id from tlu_job where id = 1 for key share; }
step "s2_fornokeyupd" { select id from tlu_job where id = 1 for no key update; }
step "s2_rollback" { rollback; }

session "s3"
setup { begin; }
step "s3_for_update" { select id from tlu_job where id = 1 for update; }
step "s3_rollback" { rollback; }

permutation "s1_keyshare" "s3_for_update" "s2_keyshare" "s1_savept_e" 
"s1_share" "s1_savept_f" "s1_fornokeyupd" "s2_fornokeyupd" "s0_keyshare" 
"s1_rollback_f" "s0_share" "s1_rollback_e" "s1_rollback" "s2_rollback" 
"s0_rollback" "s3_rollback"

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Avoiding possible future conformance headaches in JSON work

2019-06-18 Thread Alvaro Herrera

On 2019-Jun-01, Chapman Flack wrote:

> In either case, perhaps we should immediately add a way to identify a
> jsonpath as being PostgreSQL-extended. Maybe a keyword 'pg' that can
> be accepted at the start in addition to any lax/strict, so you could
> have 'pg lax $.map(x => x + 10)'.
> 
> If we initially /require/ 'pg' for the extensions to be recognized, then
> we can relax the requirement for whichever ones later appear in the spec
> using the same syntax. If they appear in the spec with a different
> syntax, then by requiring 'pg' already for our variant, we already have
> avoided the standard_conforming_strings kind of multi-release
> reconciliation effort.

I agree we should do this (or something similar) now, to avoid future
pain.  It seems a similar problem to E'' strings vs. SQL-standard
''-ones, which was a painful transition.  We have an opportunity to do
better this time.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PG 12 beta 1 segfault during analyze

2019-06-18 Thread Tom Lane

Andres Freund  writes:
> On 2019-06-18 00:32:17 -0400, Tom Lane wrote:
>> Hmm, that's a pretty obvious mistake :-( but after some fooling around
>> I've not been able to cause a crash with it.  I wonder what test case
>> you were using, on what platform?

> I suspect that's because the bug is "only" in the
> HEAPTUPLE_DELETE_IN_PROGRESS case. And it's "harmless" as far as
> crashing goes in the !TransactionIdIsCurrentTransactionId() case,
> because as the tuple is sampled, we just return. And then there still
> needs to be an actually dead row afterwards, to actually trigger
> dereferencing the modified deadrows. And then acquire_sample_rows()'s
> deadrows actually needs to point to something that causes crashes when
> modified.

Right, I'd come to the same conclusions last night, but failed to make
a crasher example.  Not sure why though, because my first try today
blew up real good:

---
\set N 10

create table bug as select generate_series(1,:N) f1;
delete from bug where f1 = :N;

begin;
delete from bug;
analyze verbose bug;
rollback;

drop table bug;
---

On my machine, N smaller than 10 doesn't do it, but of course that
will be very platform-specific.

> Will fix tomorrow morning.

OK.  To save you the trouble of "git blame", it looks like
737a292b5de296615a715ddce2b2d83d1ee245c5 introduced this.

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-18 Thread Magnus Hagander

On Tue, Jun 18, 2019 at 3:37 PM Stephen Frost  wrote:

> Greetings,
>
> * Magnus Hagander (mag...@hagander.net) wrote:
> > On Mon, Jun 17, 2019 at 5:41 PM Stephen Frost 
> wrote:
> > > I'd further say something along the lines of 'utilities should not
> > > modify a postgresql.auto.conf that's in place under a running
> PostgreSQL
> > > cluster'.
> >
> > Do we need to differ between "external" and "internal" utilities here?
>
> I don't think so..?  Is there something there that you're thinking would
> be different between them?
>

Probably not. In general thinking that we could "allow" internal tools to
do things externals shouldn't do, for example using internal APIs. But it's
probably a bad idea to go down that road.


> > I'd rather say that 'any duplicate items should be removed, and a
> > > WARNING emitted when detected', or something along those lines.  Same
> > > for comment lines...
> >
> > I think it's perfectly fine to silently drop comments (other than the one
> > at the very top which says don't touch this file).
>
> I'm not sure why that's different?  I don't really think that I agree
> with you on this one- anything showing up in that file that we're ending
> up removing must have gotten there because someone or something didn't
> realize the rules around managing the file, and that's a problem...
>

I'm not strongly against it, I just consider it unnecessary :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-18 Thread Stephen Frost

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Mon, Jun 17, 2019 at 5:41 PM Stephen Frost  wrote:
> > I'd further say something along the lines of 'utilities should not
> > modify a postgresql.auto.conf that's in place under a running PostgreSQL
> > cluster'.
> 
> Do we need to differ between "external" and "internal" utilities here?

I don't think so..?  Is there something there that you're thinking would
be different between them?

> > I'd rather say that 'any duplicate items should be removed, and a
> > WARNING emitted when detected', or something along those lines.  Same
> > for comment lines...
> 
> I think it's perfectly fine to silently drop comments (other than the one
> at the very top which says don't touch this file).

I'm not sure why that's different?  I don't really think that I agree
with you on this one- anything showing up in that file that we're ending
up removing must have gotten there because someone or something didn't
realize the rules around managing the file, and that's a problem...

Thanks,

Stephen

signature.asc
Description: PGP signature

fix "Success" error messages

2019-06-18 Thread Peter Eisentraut

In a case of a corrupted database, I saw an error message like

Could not read from file ...: Success.

from the SLRU module.  This is because it checks that it reads or writes
exactly BLCKSZ, and else goes to the error path.  The attached patch
gives a different error message in this case.

Because of the structure of this code, we don't have the information to
do the usual "read %d of %zu", but at least this is better than
reporting a "success" error.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 680651938e30b5208a73b9d76813a5880af4ab54 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 18 Jun 2019 14:27:53 +0200
Subject: [PATCH 1/2] Better error messages for short reads/writes in SLRU

This avoids getting a

Could not read from file ...: Success.

for a short read or write (since errno is not set in that case).
Instead, report a more specific error messages.
---
 src/backend/access/transam/slru.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/slru.c 
b/src/backend/access/transam/slru.c
index 2dbc65b125..315c0612e6 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -923,15 +923,19 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId 
xid)
ereport(ERROR,
(errcode_for_file_access(),
 errmsg("could not access status of 
transaction %u", xid),
-errdetail("Could not read from file 
\"%s\" at offset %u: %m.",
-  path, offset)));
+errno == 0
+? errdetail("Short read from file 
\"%s\" at offset %u.", path, offset)
+: errdetail("Could not read from file 
\"%s\" at offset %u: %m.",
+path, 
offset)));
break;
case SLRU_WRITE_FAILED:
ereport(ERROR,
(errcode_for_file_access(),
 errmsg("could not access status of 
transaction %u", xid),
-errdetail("Could not write to file 
\"%s\" at offset %u: %m.",
-  path, offset)));
+errno == 0
+? errdetail("Short write to file 
\"%s\" at offset %u.", path, offset)
+: errdetail("Could not write to file 
\"%s\" at offset %u: %m.",
+path, 
offset)));
break;
case SLRU_FSYNC_FAILED:
ereport(data_sync_elevel(ERROR),
-- 
2.22.0

From 9d6d381993701fc696466e822d58289c480c358f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 18 Jun 2019 14:29:31 +0200
Subject: [PATCH 2/2] Use consistent style for checking return from system
 calls

Use

if (something() < 0)
error ...

instead of just

if (something)
error ...

The latter is not incorrect, but it's a bit confusing and not the
common style.
---
 src/backend/access/transam/slru.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/slru.c 
b/src/backend/access/transam/slru.c
index 315c0612e6..dab5b6ffa0 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -697,7 +697,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
}
pgstat_report_wait_end();
 
-   if (CloseTransientFile(fd))
+   if (CloseTransientFile(fd) < 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
slru_errno = errno;
@@ -869,7 +869,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, 
SlruFlush fdata)
if (!fdata)
{
pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
-   if (ctl->do_fsync && pg_fsync(fd))
+   if (ctl->do_fsync && pg_fsync(fd) < 0)
{
pgstat_report_wait_end();
slru_errcause = SLRU_FSYNC_FAILED;
@@ -879,7 +879,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, 
SlruFlush fdata)
}
pgstat_report_wait_end();
 
-   if (CloseTransientFile(fd))
+   if (CloseTransientFile(fd) < 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
slru_errno = errno;
@@ -1150,7 +1150,7 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
for (i = 0; i < fdata.num_files; i++)
{

Replacing the EDH SKIP primes

2019-06-18 Thread Daniel Gustafsson

The current hardcoded EDH parameter fallback use the old SKIP primes, for which
the source disappeared from the web a long time ago.  Referencing a known dead
source seems a bit silly, so I think we should either switch to a non-dead
source of MODP primes or use an archive.org link for SKIP.  Personally I prefer
the former.

This was touched upon, but never really discussed AFAICT, back when then EDH
parameters were reworked a few years ago.  Instead of replacing with custom
ones, as suggested in [1] it we might as well replace with standardized ones as
this is a fallback.  Custom ones wont make it more secure, just add more work
for the project.  The attached patch replace the SKIP prime with the 2048 bit
MODP group from RFC 3526, which is the same change that OpenSSL did a few years
back [2].

cheers ./daniel

[1] 
https://www.postgresql.org/message-id/54f44984-2f09-8744-927f-140a90c379dc%40ohmu.fi
[2] 
https://github.com/openssl/openssl/commit/fb015ca6f05e09b11a3932f89d25bae697c8af1e



skip_primes.patch
Description: Binary data

Re: idea: log_statement_sample_rate - bottom limit for sampling

2019-06-18 Thread Adrien Nayrat

Hi,

I tried the patch, here my comment:

> gettext_noop("Zero effective disables sampling. "
>  "-1 use sampling every time (without limit)."),

I do not agree with the zero case. In fact, sampling is disabled as soon as
setting is less than log_min_duration_statements. Furthermore, I think we should
provide a more straightforward description for users.

I changed few comments and documentation:

  * As we added much more logic in this function with statement and transaction
sampling. And now with statement_sample_rate, it is not easy to understand the
logic on first look. I reword comment in check_log_duration, I hope it is more
straightforward.

  * I am not sure if "every_time" is a good naming for the variable. In fact, if
duration exceeds limit we disable sampling. Maybe sampling_disabled is more 
clear?

  * I propose to add some words in log_min_duration_statement and
log_statement_sample_rate documentation.

  * Rephrased log_statement_sample_limit documentation, I hope it help
understanding.

Patch attached.

Regards,

-- 
Adrien
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..ad965a084c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5850,7 +5850,8 @@ local0.*/var/log/postgresql
 
  Causes the duration of each completed statement to be logged
  if the statement ran for at least the specified number of
- milliseconds, modulated by log_statement_sample_rate.
+ milliseconds, modulated by log_statement_sample_rate
+ and log_statement_sample_limit.
  Setting this to zero prints all statement durations.
  -1 (the default) disables logging statements due to
  exceeding duration threshold; for example, if you set it to
@@ -5894,6 +5895,8 @@ local0.*/var/log/postgresql
   to be logged.
  The default is 1.0, meaning log all such
  statements.
+ log_statement_sample_limit allows to disable sampling
+ if duration exceeds the limit.
  Setting this to zero disables logging by duration, same as setting
  log_min_duration_statement to
  -1.
@@ -5903,6 +5906,26 @@ local0.*/var/log/postgresql

   
 
+ 
+  log_statement_sample_limit (integer)
+  
+   log_statement_sample_limit configuration parameter
+  
+  
+   
+
+ Disable effect of log_statement_sample_rate when
+ statements last longer than the specified limit in milliseconds.
+ Otherwise, statements that last longer than log_min_duration_statement
+ but less than log_statement_sample_limit are subject
+ to log sampling according to log_statement_sample_rate.
+ When log_statement_sample_limit is less or equal to
+ log_min_duration_statement then log sampling is
+ never applied.
+
+   
+  
+
  
   log_transaction_sample_rate (real)
   
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 44a59e1d4f..50b3f839da 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2192,10 +2192,15 @@ check_log_statement(List *stmt_list)
 
 /*
  * check_log_duration
- *		Determine whether current command's duration should be logged.
- *		If log_statement_sample_rate < 1.0, log only a sample.
- *		We also check if this statement in this transaction must be logged
- *		(regardless of its duration).
+ *
+ * Determine whether current command's duration should be logged. There is also
+ * sampling mechanism following these rules:
+ *		- If log_statement_sample_rate < 1.0, log only a sample.
+ *		- If log_statement_sample_limit >= 0 and duration exceeds this limit,
+ *		  sampling is ignored, so the command is logged.
+ *
+ * If xact_is_sampled is true (due to log_transaction_sample_rate), it means
+ * the whole transaction is logged (regardless of its duration).
  *
  * Returns:
  *		0 if no logging is needed
@@ -2217,6 +,7 @@ check_log_duration(char *msec_str, bool was_logged)
 		int			usecs;
 		int			msecs;
 		bool		exceeded;
+		bool		every_time;
 		bool		in_sample;
 
 		TimestampDifference(GetCurrentStatementStartTimestamp(),
@@ -2234,6 +2240,14 @@ check_log_duration(char *msec_str, bool was_logged)
 	 (secs > log_min_duration_statement / 1000 ||
 	  secs * 1000 + msecs >= log_min_duration_statement)));
 
+		/*
+		 * When log_statement_sample_limit is exceeded, then query is
+		 * logged every time.
+		 */
+		every_time = (log_statement_sample_limit >= 0 &&
+	   (secs > log_statement_sample_limit / 1000 ||
+		secs * 1000 + msecs >= log_statement_sample_limit));
+
 		/*
 		 * Do not log if log_statement_sample_rate = 0. Log a sample if
 		 * log_statement_sample_rate <= 1 and avoid unnecessary random() call
@@ -2245,7 +2259,7 @@ check_log_duration(char *msec_str, bool was_logged)
 			(log_statement_sample_rate == 1 ||
 			 random() <= log_statement_s

Re: PG 12 beta 1 segfault during analyze

2019-06-18 Thread Steve Singer


On 6/18/19 12:32 AM, Tom Lane wrote:

Steve Singer  writes:

The attached patch fixes the issue.

Hmm, that's a pretty obvious mistake :-( but after some fooling around
I've not been able to cause a crash with it.  I wonder what test case
you were using, on what platform?

regards, tom lane




I was running the slony regression tests.  The crash happened when it 
tries to replicate a particular table that already has data in it on the 
replica.  It doesn't happen with every table and  I haven't been able to 
replicate the crash in as self contained test by manually doing similar 
steps to just that table with psql.


This is on x64.

Re: Batch insert in CTAS/MatView code

2019-06-18 Thread Paul Guo

On Mon, Jun 17, 2019 at 8:53 PM Paul Guo  wrote:

> Hi all,
>
> I've been working other things until recently I restarted the work,
> profiling & refactoring the code.
> It's been a long time since the last patch was proposed. The new patch has
> now been firstly refactored due to
> 4da597edf1bae0cf0453b5ed6fc4347b6334dfe1 (Make TupleTableSlots extensible,
> finish split of existing slot type).
>
> Now that TupleTableSlot, instead of HeapTuple is one argument of
> intorel_receive() so we can not get the
> tuple length directly. This patch now gets the tuple length if we know all
> columns are with fixed widths, else
> we calculate an avg. tuple length using the first MAX_MULTI_INSERT_SAMPLES
> (defined as 1000) tuples
> and use for the total length of tuples in a batch.
>
> I noticed that to do batch insert, we might need additional memory copy
> sometimes comparing with "single insert"
> (that should be the reason that we previously saw a bit regressions) so a
> good solution seems to fall back
> to "single insert" if the tuple length is larger than a threshold. I set
> this as 2000 after quick testing.
>
> To make test stable and strict, I run checkpoint before each ctas, the
> test script looks like this:
>
> checkpoint;
> \timing
> create table tt as select a,b,c from t11;
> \timing
> drop table tt;
>
> Also previously I just tested the BufferHeapTupleTableSlot (i.e. create
> table tt as select * from t11),
> this time I test VirtualTupleTableSlot (i.e. create table tt as select
> a,b,c from t11) additionally.
> It seems that VirtualTupleTableSlot is very common in real cases.
>
> I tested four kinds of tables, see below SQLs.
>
> -- tuples with small size.
> create table t11 (a int, b int, c int, d int);
> insert into t11 select s,s,s,s from generate_series(1, 1000) s;
> analyze t11;
>
> -- tuples that are untoasted and tuple size is 1984 bytes.
> create table t12 (a name, b name, c name, d name, e name, f name, g name,
> h name, i name, j name, k name, l name, m name, n name, o name, p name, q
> name, r name, s name, t name, u name, v name, w name, x name, y name, z
> name, a1 name, a2 name, a3 name, a4 name, a5 name);
> insert into t12 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
> 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
> 'z', 'a', 'b', 'c', 'd', 'e' from generate_series(1, 50);
> analyze t12;
>
> -- tuples that are untoasted and tuple size is 2112 bytes.
> create table t13 (a name, b name, c name, d name, e name, f name, g name,
> h name, i name, j name, k name, l name, m name, n name, o name, p name, q
> name, r name, s name, t name, u name, v name, w name, x name, y name, z
> name, a1 name, a2 name, a3 name, a4 name, a5 name, a6 name, a7 name);
> insert into t13 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
> 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
> 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g' from generate_series(1, 50);
> analyze t13;
>
> -- tuples that are toastable and tuple compressed size is 1084.
> create table t14 (a text, b text, c text, d text, e text, f text, g text,
> h text, i text, j text, k text, l text, m text, n text, o text, p text, q
> text, r text, s text, t text, u text, v text, w text, x text, y text, z
> text);
> insert into t14 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i,
> i, i, i, i, i, i, i, i, i from (select repeat('123456789', 1) from
> generate_series(1,5000)) i;
> analyze t14;
>
>
> I also tested two scenarios for each testing.
>
> One is to clean up all kernel caches (page & inode & dentry on Linux)
> using the command below and then run the test,
> sync; echo 3 > /proc/sys/vm/drop_caches
> After running all tests all relation files will be in kernel cache (my
> test system memory is large enough to accommodate all relation files),
> then I run the tests again. I run like this because in real scenario the
> result of the test should be among the two results. Also I rerun
> each test and finally I calculate the average results as the experiment
> results. Below are some results:
>
>
> scenario1: All related kernel caches are cleaned up (note the first two
> columns are time with second).
>
> baselinepatch   diff%   SQL
>
>
>
>
> 10.15.5744.85%  create table tt as select * from t11;
>
> 10.75.5248.41%  create table tt as select a,b,c from t11;
>
> 9.5710.2-6.58%  create table tt as select * from t12;
>
> 9.648.6310.48%  create table tt as select
> a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;
>
> 14.214.46   -1.83%  create table tt as select * from t13;
>
> 11.88   12.05   -1.43%  create table tt as select
> a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from
> t13;
>
> 3.173.25-2.52%  create table tt as select * from t14;
>
>
> 2.933.12-6.48%  create table tt as select
> a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-18 Thread Magnus Hagander

On Mon, Jun 17, 2019 at 5:41 PM Stephen Frost  wrote:

>
> * Ian Barwick (ian.barw...@2ndquadrant.com) wrote:
> > On 6/15/19 1:08 AM, Stephen Frost wrote:
> > > * Ian Barwick (ian.barw...@2ndquadrant.com) wrote:
>
> > >> Attached patch attempts to rectify this situation by having
> replace_auto_config_value()
> > >> deleting any duplicate entries first, before making any changes to
> the last entry.
> > >
> > > While this might be a good belt-and-suspenders kind of change to
> > > include, I don't think pg_basebackup should be causing us to have
> > > multiple entries in the file in the first place..
> > (...)
> > >> Also attached is a set of TAP tests to check ALTER SYSTEM works as
> expected (or
> > >> at least as seems correct to me).
> > >
> > > In my view, at least, we should have a similar test for pg_basebackup
> to
> > > make sure that it doesn't create an invalid .auto.conf file.
> >
> > Indeed... I'd be happy to create tests... but first we need a definition
> of what
> > constitutes a valid .auto.conf file.
> >
> > If that definition includes requiring that a parameter may occur only
> once, then
> > we need to provide a way for utilities such as pg_basebackup to write
> the replication
> > configuration to a configuration file in such a way that it doesn't
> somehow
> > render that file invalid.
>
> Yes, I think that we do need to require that a parameter only occur once
> and pg_basebackup and friends need to be able to manage that.
>

+1.


> > I agere that there should have been some effort put into making the way
>
> > ALTER SYSTEM is modified be consistent between the backend and utilities
> > > like pg_basebackup (which would also help third party tools understand
> > > how a non-backend application should be modifying the file).
> >
> > Did you mean to say "the way postgresql.auto.conf is modified"?
>
> Ah, yes, more-or-less.  I think I was going for 'the way ALTER SYSTEM
> modifies postgresql.auto.conf'.
>
> > I suggest explicitly documenting postgresql.auto.conf behaviour (and the
> circumstances
> > where it's acceptable to modify it outside of ALTER SYSTEM calls) in the
> documentation
> > (and possibly in the code), so anyone writing utilities which need to
> > append to postgresql.auto.conf knows what the situation is.
>
> Yeah, I would think that, ideally, we'd have some code in the common
> library that other utilities could leverage and which the backend would
> also use.
>
> > - postgresql.auto.conf is maintained by PostgreSQL and can be rewritten
> at will by the system
> >   at any time
>
> I'd further say something along the lines of 'utilities should not
> modify a postgresql.auto.conf that's in place under a running PostgreSQL
> cluster'.
>

Do we need to differ between "external" and "internal" utilities here?



> > - there is no guarantee that items in postgresql.auto.conf will be in a
> particular order
> > - it  must never be manually edited (though it may be viewed)
>
> 'must' is perhaps a bit strong...  I would say something like
> "shouldn't", as users may *have* to modify it, if PostgreSQL won't
> start due to some configuration in it.
>


+1.


> - postgresql.auto.conf may be appended to by utilities which need to
> write configuration
> >   items and which and need a guarantee that the items will be read by
> the server at startup
> >   (but only when the server is down of course)
>
> Well, I wouldn't support saying "append" since that's what got us into
> this mess. :)
>
> > - any duplicate items will be removed when ALTER SYSTEM is executed to
> change or reset
> >   an item (a WARNING will be emitted about duplicate items removed)
> > - comment lines (apart from the warning at the top of the file) will be
> silently removed
> >   (this is currently the case anyway)
>
> I'd rather say that 'any duplicate items should be removed, and a
> WARNING emitted when detected', or something along those lines.  Same
> for comment lines...
>

I think it's perfectly fine to silently drop comments (other than the one
at the very top which says don't touch this file).

//Magnus

Re: pg_upgrade: Improve invalid option handling

2019-06-18 Thread Daniel Gustafsson

> On 18 Jun 2019, at 10:15, Michael Paquier  wrote:
> 
> On Fri, Jun 14, 2019 at 12:34:36PM +0200, Daniel Gustafsson wrote:
>> +if ((log_opts.internal = fopen_priv(INTERNAL_LOG_FILE, "a")) == NULL)
>> +pg_fatal("could not write to log file \"%s\"\n", 
>> INTERNAL_LOG_FILE);
>> 
>> While we’re at it, should we change this to “could not open log file” to make
>> the messaging more consistent across the utilities (pg_basebackup and psql 
>> both
>> use “could not open”)?
> 
> I would suggest "could not open file \"%s\": %s" instead with a proper
> strerror().

Correct, that matches how pg_basebackup and psql does it.

cheers ./daniel

Re: pg_upgrade: Improve invalid option handling

2019-06-18 Thread Michael Paquier

On Fri, Jun 14, 2019 at 12:34:36PM +0200, Daniel Gustafsson wrote:
> + if ((log_opts.internal = fopen_priv(INTERNAL_LOG_FILE, "a")) == NULL)
> + pg_fatal("could not write to log file \"%s\"\n", 
> INTERNAL_LOG_FILE);
> 
> While we’re at it, should we change this to “could not open log file” to make
> the messaging more consistent across the utilities (pg_basebackup and psql 
> both
> use “could not open”)?

I would suggest "could not open file \"%s\": %s" instead with a proper
strerror().
--
Michael


signature.asc
Description: PGP signature

Re: Generating partitioning tuple conversion maps faster

2019-06-18 Thread Michael Paquier

On Mon, Jun 17, 2019 at 12:03:03PM +0200, didier wrote:
> cherry-pick apply cleanly and with a 100 columns table it improves
> performance nicely (20%).

42f70cd is a performance improvement, and not a bug fix.
--
Michael


signature.asc
Description: PGP signature

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-06-18 Thread shawn wang

Masahiko Sawada  于2019年6月17日周一 下午8:30写道：

> On Fri, Jun 14, 2019 at 7:41 AM Tomas Vondra
>  wrote:
> > I personally find the idea of encrypting tablespaces rather strange.
> > Tablespaces are meant to define hysical location for objects, but this
> > would also use them to "mark" objects as encrypted or not. That just
> > seems misguided and would make the life harder for many users.
> >
> > For example, what if I don't have any tablespaces (except for the
> > default one), but I want to encrypt only some objects? Suddenly I have
> > to create a tablespace, which will however cause various difficulties
> > down the road (during pg_basebackup, etc.).
>
> I guess that we can have an encrypted tabelspace by default (e.g.
> pg_default_enc). Or we encrypt per tables while having encryption keys
> per tablespaces.
>

Hi Sawada-san,
I do agree with it.

>
> On Mon, Jun 17, 2019 at 6:54 AM Tomas Vondra
>  wrote:
> >
> > On Sun, Jun 16, 2019 at 02:10:23PM -0400, Stephen Frost wrote:
> > >Greetings,
> > >
> > >* Joe Conway (m...@joeconway.com) wrote:
> > >> On 6/16/19 9:45 AM, Bruce Momjian wrote:
> > >> > On Sun, Jun 16, 2019 at 07:07:20AM -0400, Joe Conway wrote:
> > >> >> In any case it doesn't address my first point, which is limiting
> the
> > >> >> volume encrypted with the same key. Another valid reason is you
> might
> > >> >> have data at varying sensitivity levels and prefer different keys
> be
> > >> >> used for each level.
> > >> >
> > >> > That seems quite complex.
> > >>
> > >> How? It is no more complex than encrypting at the tablespace level
> > >> already gives you - in that case you get this property for free if you
> > >> care to use it.
> > >
> > >Perhaps not surprising, but I'm definitely in agreement with Joe
> > >regarding having multiple keys when possible and (reasonably)
> > >straight-forward to do so.  I also don't buy off on the OpenSSL
> > >argument; their more severe issues certainly haven't been due to key
> > >management issues such as what we're discussing here, so I don't think
> > >the argument applies.
> > >
> >
> > I'm not sure what exactly is the "OpenSSL argument" you're disagreeing
> > with? IMHO Bruce is quite right that the risk of vulnerabilities grows
> > with the complexity of the system (both due to implementation bugs and
> > general design weaknesses). I don't think it's tied to the key
> > management specifically, except that it's one of the parts that may
> > contribute to the complexity.
> >
> > (It's often claimed that key management is one of the weakest points of
> > current crypto systems - we have safe (a)symmetric algorithms, but safe
> > handling of keys is an issue. I don't have data / papers supporting this
> > claim, I kinda believe it.)
> >
> > Now, I'm not opposed to eventually implementing something more
> > elaborate, but I also think just encrypting the whole cluster (not
> > necessarily with a single key, but with one master key) would be enough
> > for vast majority of users. Plus it's less error prone and easier to
> > operate (backups, replicas, crash recovery, ...).
> >
> > But there's about 0% chance we'll get that in v1, of course, so we need
> > s "minimum viable product" to build on anyway.
> >
>
> I agree that we need minimum viable product first. But I'm not sure
> it's true that the implementing the cluster-wide TDE first could be
> the first step of per-tablespace/table TDE.
>

Yes, we could complete the per-tablespace/table TDE in version 13.
And we could do cluster-wide TDE in the next version.
But I remember you said there are so many keys to manage in the table-level.
Will we add the table-level TDE in the first version?

And I have two questions.
1. Will we add hooks to support replacing the encryption algorithms?
2. Will we add some encryption algorithm or use some in some libraries?

Regards,

--
Shwan Wang
HIGHGO SOFTWARE


> The purpose of cluster-wide TDE and table/tablespace TDE are slightly
> different in terms of encryption target objects. The cluster-wide TDE
> would be a good solution for users who want to encrypt everything
> while the table/tabelspace TDE would help more severe use cases in
> terms of both of security and performance.
>
> The cluster-wide TDE eventually encrypts SLRU data and all WAL
> including non-user data related WAL while table/tablespace TDE doesn't
> unless we develop such functionality. In addition, the cluster-wide
> TDE also encrypts system catalogs but in table/tablespace TDE user
> would be able to control that somewhat. That is, if we developed the
> cluster-wide TDE first, when we develop table/tablespace TDE on top of
> that we would need to change TDE so that table/tablespace TDE can
> encrypt even non-user data related data while retaining its simple
> user interface, which would rather make the feature complex, I'm
> concerned. We can support them as different TDE features but I'm not
> sure it's a good choice for users.
>
> From perspective of  cryptographic, I think the fine grained TD

Re: PG 12 beta 1 segfault during analyze

2019-06-18 Thread Andres Freund

Hi,

On 2019-06-18 00:32:17 -0400, Tom Lane wrote:
> Steve Singer  writes:
> > The attached patch fixes the issue.
>
> Hmm, that's a pretty obvious mistake :-( but after some fooling around
> I've not been able to cause a crash with it.  I wonder what test case
> you were using, on what platform?

I suspect that's because the bug is "only" in the
HEAPTUPLE_DELETE_IN_PROGRESS case. And it's "harmless" as far as
crashing goes in the !TransactionIdIsCurrentTransactionId() case,
because as the tuple is sampled, we just return. And then there still
needs to be an actually dead row afterwards, to actually trigger
dereferencing the modified deadrows. And then acquire_sample_rows()'s
deadrows actually needs to point to something that causes crashes when
modified.

I can definitely get it to do a "wild" pointer write:

Breakpoint 2, heapam_scan_analyze_next_tuple (scan=0x55f8fcb92728, 
OldestXmin=512, liverows=0x7fff56159850,
deadrows=0x7fff56159f50, slot=0x55f8fcb92b40) at 
/home/andres/src/postgresql/src/backend/access/heap/heapam_handler.c:1061
1061*deadrows += 1;
(gdb) p deadrows
$9 = (double *) 0x7fff56159f50
(gdb) up
#1  0x55f8fad922c5 in table_scan_analyze_next_tuple (scan=0x55f8fcb92728, 
OldestXmin=512, liverows=0x7fff56159850,
deadrows=0x7fff56159848, slot=0x55f8fcb92b40) at 
/home/andres/src/postgresql/src/include/access/tableam.h:1467
1467return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, 
OldestXmin,
(gdb) p deadrows
$10 = (double *) 0x7fff56159848

making a question of a crash just a question of the exact stack layout
and the number of deleted tuples.

Will fix tomorrow morning.

Greetings,

Andres Freund

Re: Hash join explain is broken

2019-06-18 Thread Andres Freund

Hi,

On 2019-06-13 16:23:34 -0700, Andres Freund wrote:
> On June 13, 2019 3:38:47 PM PDT, Tom Lane  wrote:
> >Andres Freund  writes:
> >> I am too tired to look further into this. I suspect the only reason
> >we
> >> didn't previously run into trouble with the executor stashing
> >hashkeys
> >> manually at a different tree level with:
> >> ((HashState *) innerPlanState(hjstate))->hashkeys
> >> is that hashkeys itself isn't printed...
> >
> >TBH, I think 5f32b29c is just wrong and should be reverted for now.
> >If there's a need to handle those expressions differently, it will
> >require some cooperation from the planner not merely a two-line hack
> >in executor startup.  That commit didn't include any test case or
> >other demonstration that it was solving a live problem, so I think
> >we can leave it for v13 to address the issue.
> 
> I'm pretty sure you'd get an assertion failure if you reverted it
> (that's why it was added). So it's a bit more complicated than that.
> Unfortunately I'll not get back to work until Monday, but I'll spend
> time on this then.

Indeed, there are assertion failures when initializing the expression
with HashJoinState as parent - that's because when computing the
hashvalue for nodeHash input, we expect the slot from the node below to
be of the type that HashState returns (as that's what INNER_VAR for an
expression at the HashJoin level refers to), rather than the type of the
input to HashState.  We could work around that by marking the slots from
underlying nodes as being of an unknown type, but that'd slow down
execution.

I briefly played with the dirty hack of set_deparse_planstate()
setting dpns->inner_planstate = ps for IsA(ps, HashState), but that
seems just too ugly.

I think the most straight-forward fix might just be to just properly
split the expression at plan time. Adding workarounds for things as
dirty as building an expression for a subsidiary node in the parent, and
then modifying the subsidiary node from the parent, doesn't seem like a
better way forward.

The attached *prototype* does so.

If we go that way, we probably need to:
- Add a test for the failure case at hand
- check a few of the comments around inner/outer in nodeHash.c
- consider moving the setrefs.c code into its own function?
- probably clean up the naming scheme in createplan.c

I think there's a few more things we could do, although it's not clear
that that needs to happen in v12:
- Consider not extracting hj_OuterHashKeys, hj_HashOperators,
  hj_Collations out of HashJoin->hashclauses, and instead just directly
  handing them individually in the planner.  create_mergejoin_plan()
  already partially does that.

Greetings,

Andres Freund

PS: If I were to write hashjoin today, it sure wouldn't be as two nodes
- it seems pretty clear that the boundaries are just too fuzzy. To the
point that I wonder if it'd not be worth merging them at some point.
>From ba351470f7d6837a4a86c24ce87ee33919314a77 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 17 Jun 2019 23:59:38 -0700
Subject: [PATCH v1] wip-fix-hash-key-computation

---
 src/backend/executor/nodeHash.c | 19 +++
 src/backend/executor/nodeHashjoin.c | 18 +++---
 src/backend/nodes/copyfuncs.c   |  1 +
 src/backend/nodes/outfuncs.c|  1 +
 src/backend/nodes/readfuncs.c   |  1 +
 src/backend/optimizer/plan/createplan.c | 13 +
 src/backend/optimizer/plan/setrefs.c| 16 
 src/include/nodes/execnodes.h   |  3 ---
 src/include/nodes/plannodes.h   |  1 +
 9 files changed, 47 insertions(+), 26 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index d16120b9c48..181f45a5b1c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -157,7 +157,8 @@ MultiExecPrivateHash(HashState *node)
 	econtext = node->ps.ps_ExprContext;
 
 	/*
-	 * get all inner tuples and insert into the hash table (or temp files)
+	 * Get all tuples from subsidiary node and insert into the hash table (or
+	 * temp files).
 	 */
 	for (;;)
 	{
@@ -165,7 +166,7 @@ MultiExecPrivateHash(HashState *node)
 		if (TupIsNull(slot))
 			break;
 		/* We have to compute the hash value */
-		econtext->ecxt_innertuple = slot;
+		econtext->ecxt_outertuple = slot;
 		if (ExecHashGetHashValue(hashtable, econtext, hashkeys,
  false, hashtable->keepNulls,
  &hashvalue))
@@ -281,7 +282,7 @@ MultiExecParallelHash(HashState *node)
 slot = ExecProcNode(outerNode);
 if (TupIsNull(slot))
 	break;
-econtext->ecxt_innertuple = slot;
+econtext->ecxt_outertuple = slot;
 if (ExecHashGetHashValue(hashtable, econtext, hashkeys,
 		 false, hashtable->keepNulls,
 		 &hashvalue))
@@ -388,8 +389,8 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	/*
 	 * initialize child expressions
 	 */
-	hashstate->ps.qual =
-		ExecInitQual(node->plan.qual, (PlanState *) hashstate

68 matches

Mail list logo