Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-24 Thread Michael Paquier
On Thu, Jan 25, 2018 at 06:14:41AM +, Tsunakawa, Takayuki wrote:
> I don't know why pg_temp3.fetchchunks still exists.  Maybe the user
> ran pg_ctl stop -mi while pg_rewind was running.

Likely that was the case :(

As a superuser, DROP TABLE should work on the temporary schema of
another session. Have you tried that to solve the situation?

> * I think temporary tables should not require vacuuming for XID
> wraparound.  Furtherover, should updates/deletes to temporary tables
> be in-place instead of creating garbage, so that any form of vacuum is
> unnecessary?  Other sessions do not need to read temporary tables.

Yeah, there are many areas of improvements in this area. Temp tables
also generate WAL..

> * In this incident, autovacuum worker misjudged that
> pg_temp_3.fetchchunks can't be deleted, although the creator
> (pg_rewind) is no longer active.  How can we delete orphan temporary
> tables safely? 

As long as Postgres sees that its temporary schema is in use, it would
think that the table is not orphaned. Another thing possible would be to
have the session now holding this schema space to reuse fetchchunks so
as things are reset.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] Lockable views

2018-01-24 Thread Thomas Munro
On Sun, Dec 31, 2017 at 11:57 PM, Yugo Nagata  wrote:
> On Fri, 29 Dec 2017 23:39:39 +0900 (JST)
> Tatsuo Ishii  wrote:
>> Your addition to the doc:
>> +   Automatically updatable views (see )
>> +   that do not have INSTEAD OF triggers or INSTEAD
>> +   rules are also lockable. When a view is locked, its base relations are
>> +   also locked recursively with the same lock mode.
>
> I added this point to the documentation.

+   Automatically updatable views (see )
+   that do not have INSTEAD OF triggers or INSTEAD

Tthe documentation doesn't build: you now need to say 
instead of , and you need to say .

-- 
Thomas Munro
http://www.enterprisedb.com



Re: ALTER TABLE ADD COLUMN fast default

2018-01-24 Thread Thomas Munro
On Wed, Jan 17, 2018 at 2:21 AM, Andrew Dunstan
 wrote:
> Yeah, got caught by the bki/pg_attribute changes on Friday. here's an
> updated version. Thanks for looking.

A boring semi-automated update:  this no long compiles, because
8561e4840c8 added a new call to heap_attisnull().  Pretty sure it just
needs NULL in the third argument.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [PATCH] Logical decoding of TRUNCATE

2018-01-24 Thread Thomas Munro
On Wed, Jan 24, 2018 at 6:47 AM, Marco Nenciarini
 wrote:
> Version 16 attached.

Hi Marco,

FYI this version doesn't compile:

worker.c: In function ‘apply_handle_truncate’:
worker.c:888:39: error: ‘cascade’ undeclared (first use in this function)
  relid = logicalrep_read_truncate(s, &cascade, &restart_seqs);
   ^

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-01-24 Thread Thomas Munro
On Thu, Jan 18, 2018 at 9:17 AM, Claudio Freire  wrote:
> Huh. That was simpler than I thought.
>
> Attached rebased versions.

Hi Claudio,

FYI the regression test seems to have some run-to-run variation.
Though it usually succeeds, recently I have seen a couple of failures
like this:

= Contents of ./src/test/regress/regression.diffs
*** 
/home/travis/build/postgresql-cfbot/postgresql/src/test/regress/expected/vacuum.out
2018-01-24 01:41:28.200454371 +
--- 
/home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/vacuum.out
2018-01-24 01:51:07.970049937 +
***
*** 128,134 
  SELECT pg_relation_size('vactst', 'main');
   pg_relation_size
  --
! 0
  (1 row)

  SELECT count(*) FROM vactst;
--- 128,134 
  SELECT pg_relation_size('vactst', 'main');
   pg_relation_size
  --
!  8192
  (1 row)

  SELECT count(*) FROM vactst;
==

-- 
Thomas Munro
http://www.enterprisedb.com



Re: JIT compiling with LLVM v9.0

2018-01-24 Thread Andres Freund
Hi!

On 2018-01-24 22:51:36 -0800, Jeff Davis wrote:
> A couple high-level questions:
> 
> 1. I notice a lot of use of the LLVM builder, for example, in
> slot_compile_deform(). Why can't you do the same thing you did with
> function code, where you create the ".bc" at build time from plain C
> code, and then load it at runtime?

Not entirely sure what you mean. You mean why I don't inline
slot_getsomeattrs() etc and instead generate code manually?  The reason
is that the generated code is a *lot* smarter due to knowing the
specific tupledesc.


> 2. I'm glad you considered extensions. How far can we go with this in
> the future?

> Can we have bitcode-only extensions that don't need a .so
> file?

Hm. I don't see a big problem introducing this. There'd be some
complexity in how to manage the lifetime of JITed functions generated
that way, but that should be solvable.


> Can we store the bitcode in pg_proc, simplifying deployment and
> allowing extensions to travel over replication?

Yes, we could. You'd need to be a bit careful that all the machines have
similar-ish cpu generations or compile with defensive settings, but that
seems okay.

Greetings,

Andres Freund



Re: JIT compiling with LLVM v9.0

2018-01-24 Thread Jeff Davis
On Tue, Jan 23, 2018 at 11:20 PM, Andres Freund  wrote:
> Hi,
>
> I've spent the last weeks working on my LLVM compilation patchset. In
> the course of that I *heavily* revised it. While still a good bit away
> from committable, it's IMO definitely not a prototype anymore.

Great!

A couple high-level questions:

1. I notice a lot of use of the LLVM builder, for example, in
slot_compile_deform(). Why can't you do the same thing you did with
function code, where you create the ".bc" at build time from plain C
code, and then load it at runtime?
2. I'm glad you considered extensions. How far can we go with this in
the future? Can we have bitcode-only extensions that don't need a .so
file? Can we store the bitcode in pg_proc, simplifying deployment and
allowing extensions to travel over replication? I am not asking for
this now, of course, but I'd like to get the idea out there so we
leave room.

Regards,
 Jeff Davis



Re: JIT compiling with LLVM v9.0

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 22:33:30 -0800, Jeff Davis wrote:
> On Wed, Jan 24, 2018 at 1:35 PM, Pierre Ducroquet  wrote:
> > In LLVM 5.0, it looks like DebugInfo.h is not available in llvm-c, only as 
> > a C
> > ++ API in llvm/IR/DebugInfo.h.
> 
> The LLVM APIs don't seem to be very stable; won't there just be a
> continuous stream of similar issues?

There'll be some of that yes. But the entire difference between 5 and
what will be 6 was not including one header, and not calling one unneded
function. That doesn't seem like a crazy amount of adaption that needs
to be done.  From a quick look about porting to 4, it'll be a bit, but
not much more effort.

The reason I'm using the C-API where possible is that it's largely
forward compatible (i.e. new features added, but seldomly things are
removed). The C++ code changes a bit more, but it's not that much code
we're interfacing with either.

I think we'll have to make do with a number of ifdefs - I don't really
see an alternative. Unless you've a better idea?


Greetings,

Andres Freund



Re: JIT compiling with LLVM v9.0

2018-01-24 Thread Jeff Davis
On Wed, Jan 24, 2018 at 1:35 PM, Pierre Ducroquet  wrote:
> In LLVM 5.0, it looks like DebugInfo.h is not available in llvm-c, only as a C
> ++ API in llvm/IR/DebugInfo.h.

The LLVM APIs don't seem to be very stable; won't there just be a
continuous stream of similar issues?

Pinning major postgresql versions to specific LLVM versions doesn't
seem very appealing. Even if you aren't interested in the latest
changes in LLVM, trying to get the right version on your machine will
be annoying.

Regards,
Jeff Davis



Re: Rangejoin rebased

2018-01-24 Thread Jeff Davis
On Tue, Jan 23, 2018 at 2:04 AM, Simon Riggs  wrote:
> Perhaps we are misunderstanding each other
>
> TIMESTAMP <@ RANGE1 doesn't match if RANGE1 is empty
> and that is the most important case

When <@ is supported, that case should be fine if range1 is on the
outer. The case I was concerned about is with a R1 <@ R2 join where R1
is on the inner side and could have empty ranges.

One option would be to try to force R2 to be on the inner. But that
doesn't quite solve other related issues, like if R2 has a few large
ranges that contain almost everything.

Right now I'm considering an approach where we use some counters to
determine that a few ranges are preventing us from moving the mark
forward, and then move those few ranges into a separate tuplestore so
that we can move the mark forward.

> RANGE OP RANGE is important also. It would be useful for OP to be more
> than just &&

I agree that contains/contained-by are useful; do you have other
operators in mind as well?

>
> It's certainly weird that R1 @> EMPTY is true, but R1 && EMPTY is not.

This was discussed back in 9.2, and there were no obviously better
semantics available. I chose to follow set semantics: X contains Y
means that Y is a subset of X; X overlaps Y means that the
intersection of X and Y is nonempty.

I understand it can be surprising, but other definitions can be surprising, too.

Regards,
 Jeff Davis



Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-24 Thread Tsunakawa, Takayuki
Hello,

I've found a problem that an orphaned temporary table could cause XID 
wraparound.  Our customer encountered this problem with PG 9.5.2, but I think 
this will happen with the latest PG.

I'm willing to fix this, but I'd like to ask you what approach we should take.


PROBLEM


The customer has a database for application data, which I call it user_db here. 
 They don't store application data in postgres database.

No tables in user_db was autovacuumed for more than a month, leading to user 
tables bloating.  Those tables are eligible for autovacuum according to 
pg_stat_all_tables and autovacuum settings.

age(datfrozenxid) of user_db and postgres databases are greater than 
autovacuum_max_freeze_age, so they are eligible for autovacuuming for XID 
wraparound.

There are user tables in user_db whose age(relfrozenxid) is greater than 
autovacuum_freeze_max_age, so those tables should get autovacuum treatment.


CAUSE


postgres database has a table named pg_temp_3.fetchchunks, whose 
age(relfrozenxid) is greater than autovacuum_freeze_max_age.  This temporary 
table is the culprit.  pg_temp_3.fetchchunks is created by pg_rewind.  The 
customer says they ran pg_rewind.

autovacuum launcher always choose postgres, because do_start_worker() scans 
pg_database and finds that postgres database needs vacuuming for XID 
wraparound.  user_db is never chosen for vacuuming, although it also needs 
vacuuming for XID wraparound.

autovacuum worker doesn't delete pg_temp3.fetchchunks, because the backendid 3 
is used by some application so autovacuum worker thinks that the backend is 
active and the temporary table cannot be dropped.

I don't know why pg_temp3.fetchchunks still exists.  Maybe the user ran pg_ctl 
stop -mi while pg_rewind was running.


FIX


I have the following questions.  Along which line should I proceed to fix the 
problem?

* Why does autovacuum launcher always choose only one database when that 
database need vacuuming for XID wraparound?  Shouldn't it also choose other 
databases?

* I think temporary tables should not require vacuuming for XID wraparound.  
Furtherover, should updates/deletes to temporary tables  be in-place instead of 
creating garbage, so that any form of vacuum is unnecessary?  Other sessions do 
not need to read temporary tables.

* In this incident, autovacuum worker misjudged that pg_temp_3.fetchchunks 
can't be deleted, although the creator (pg_rewind) is no longer active.  How 
can we delete orphan temporary tables safely?


Regards
Takayuki Tsunakawa






Re: PATCH: Exclude unlogged tables from base backups

2018-01-24 Thread Masahiko Sawada
On Thu, Jan 25, 2018 at 3:25 AM, David Steele  wrote:
> Hi Masahiko,
>
> Thanks for the review!
>
> On 1/22/18 3:14 AM, Masahiko Sawada wrote:
>> On Thu, Dec 14, 2017 at 11:58 PM, Robert Haas  wrote:
>>>
>>> We would also have a problem if the missing file caused something in
>>> recovery to croak on the grounds that the file was expected to be
>>> there, but I don't think anything works that way; I think we just
>>> assume missing files are an expected failure mode and silently do
>>> nothing if asked to remove them.
>>
>> I also couldn't see a problem in this approach.
>>
>> Here is the first review comments.
>>
>> +   unloggedDelim = strrchr(path, '/');
>>
>> I think it doesn't work fine on windows. How about using
>> last_dir_separator() instead?
>
> I think this function is OK on Windows -- we use it quite a bit.
> However, last_dir_separator() is clearer so I have changed it.

Thank you for updating this. I was concerned about a separator
character '/' might not work fine on windows.

>
>> 
>> + * Find all unlogged relations in the specified directory and return
>> their OIDs.
>>
>> What the ResetUnloggedrelationsHash() actually returns is a hash
>> table. The comment of this function seems not appropriate.
>
> Fixed.
>
>> +   /* Part of path that contains the parent directory. */
>> +   int parentPathLen = unloggedDelim - path;
>> +
>> +   /*
>> +* Build the unlogged relation hash if the parent path is 
>> either
>> +* $PGDATA/base or a tablespace version path.
>> +*/
>> +   if (strncmp(path, "./base", parentPathLen) == 0 ||
>> +   (parentPathLen >=
>> (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) &&
>> +strncmp(unloggedDelim -
>> (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
>> +TABLESPACE_VERSION_DIRECTORY,
>> +
>> sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
>> +   unloggedHash = ResetUnloggedRelationsHash(path);
>> +   }
>>
>> How about using get_parent_directory() to get parent directory name?
>
> get_parent_directory() munges the string that is passed to it which I
> was trying to avoid (we'd need a copy) - and I don't think it makes the
> rest of the logic any simpler without constructing yet another string to
> hold the tablespace path.

Agreed.

>
> I know performance isn't the most important thing here, so if the
> argument is for clarity perhaps it makes sense. Otherwise I don't know
> if it's worth it.
>
>> Also, I think it's better to destroy the unloggedHash after use.
>
> Whoops! Fixed.
>
>> +   /* Exclude all forks for unlogged tables except the
>> init fork. */
>> +   if (unloggedHash && ResetUnloggedRelationsMatch(
>> +   unloggedHash, de->d_name) == unloggedOther)
>> +   {
>> +   elog(DEBUG2, "unlogged relation file \"%s\"
>> excluded from backup",
>> +de->d_name);
>> +   continue;
>> +   }
>>
>> I think it's better to log this debug message at DEBUG2 level for
>> consistency with other messages.
>
> I think you mean DEBUG1?  It's already at DEBUG2.

Oops, yes I meant DEBUG1.

>
> I considered using DEBUG1 but decided against it.  The other exclusions
> will produce a limited amount of output because there are only a few of
> them.  In the case of unlogged tables there could be any number of
> exclusions and I thought that was too noisy for DEBUG1.

IMO it's okay to output many unlogged tables for a debug purpose but I
see your point.

>
>> +   ok(!-f "$tempdir/tbackup/tblspc1/$tblspc1UnloggedBackupPath",
>> +   'unlogged imain fork not in tablespace backup');
>>
>> s/imain/main/
>
> Fixed.
>
>> If a new unlogged relation is created after constructed the
>> unloggedHash before sending file, we cannot exclude such relation. It
>> would not be problem if the taking backup is not long because the new
>> unlogged relation unlikely becomes so large. However, if takeing a
>> backup takes a long time, we could include large main fork in the
>> backup.
>
> This is a good point.  It's per database directory which makes it a
> little better, but maybe not by much.
>
> Three options here:
>
> 1) Leave it as is knowing that unlogged relations created during the
> backup may be copied and document it that way.
>
> 2) Construct a list for SendDir() to work against so the gap between
> creating that and creating the unlogged hash is as small as possible.
> The downside here is that the list may be very large and take up a lot
> of memory.
>
> 3) Check each file that looks like a relation in the loop to see if it
> has an init fork.  This might affect performance since an
> opendir/readdir loop would be required for every relation.
>
> Personally, I'm in favor of #1, at least for the time being.  I've
> updated the docs as indic

Re: Further cleanup of pg_dump/pg_restore item selection code

2018-01-24 Thread David G. Johnston
On Wednesday, January 24, 2018, Tom Lane  wrote:
>
> I think you might be missing one of the main points here, which is
> that --create is specified as causing both a CREATE DATABASE and a
> reconnect to that database.
>

I missed the implication though I read and even thought about questioning
that aspect (in pg_dump though, not restore).

Should pg_restore fail if asked to --create without a database entry in the
TOC?

David J.


Re: Further cleanup of pg_dump/pg_restore item selection code

2018-01-24 Thread David G. Johnston
On Wednesday, January 24, 2018, Tom Lane  wrote:
>
>
> This does not go all the way towards making pg_restore's item selection
> switches dump subsidiary objects the same as pg_dump would, because
> pg_restore doesn't really have enough info to deal with indexes and
> table constraints the way pg_dump does.  Perhaps we could intuit the
> parent table by examining object dependencies, but that would be a
> bunch of new logic that seems like fit material for a different patch.
> In the meantime I just added a disclaimer to pg_restore.sgml explaining
> this.
>

Unless we really wanted to keep prior-version compatibility it would seem
more reliable to have pg_dump generate a new TOC entry type describing
these dependencies and have pg_restore read and interpret them during
selective restore mode.  Basically a cross-referencing "if you restore A
also restore B". Where A can only be tables and B indexes and constraints
(if we can get away with it being that limited initially).

David J.


Re: CREATE ROUTINE MAPPING

2018-01-24 Thread David Fetter
On Thu, Jan 18, 2018 at 04:09:13PM -0500, Corey Huinker wrote:
> >
> >
> > >
> > > But other situations seem un-handle-able to me:
> > >
> > > SELECT remote_func1(l.x) FROM local_table l WHERE l.active = true;
> >
> > Do we have any way, or any plan to make a way, to push the set (SELECT
> > x FROM local_table WHERE active = true) to the remote side for
> > execution there?  Obviously, there are foreign DBs that couldn't
> > support this, but I'm guessing they wouldn't have much by way of UDFs
> > either.
> >
> 
> No. The remote query has to be generated at planning time, so it can't make
> predicates out of anything that can't be resolved into constants by the
> planner itself. The complexities of doing so would be excessive, far better
> to let the application developer split the queries up because they know
> better which parts have to resolve first.

So Corey and I, with lots of inputs from Andrew Gierth and Matheus
Oliveira, have come up with a sketch of how to do this, to wit:

- Extend CREATE FUNCTION to take either FOREIGN and SERVER or AS and
  LANGUAGE as parameters, but not both. This seems simpler, at least
  in a proof of concept, than creating SQL standard compliant grammar
  out of whole cloth.  The SQL standard grammar could be layered in
  later via the rewriter if this turns out to work.

- In pg_proc, store foreign functions as having a new language,
  sql_med, which doesn't actually exist.  This "language" would
  function as a hint to the planner.

- Add a new system catalog for foreign functions that references
  pg_proc and pg_foreign_server. Writing to it would also do the usual
  stuff with pg_depend.

- During planning, at least to start, we'd ensure that foreign
  functions can only take arguments on the same server.

- Once it's established that the combinations could actually work,
  execution gets pushed to the foreign server(s)

What say?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 11:02 PM, Tom Lane  wrote:
> I may be wasting my breath here, but in one more attempt to convince
> you that "time make check" on your laptop is not the only number that
> anyone should be interested in, ...

Now that is not what I said, or at least not what I intended to say.
I'm taking the position that what happens on a developer laptop is
more relevant than what happens on an ancient buildfarm critter.  I am
NOT taking the position that my particular laptop or even developer
laptops generally are the *only* thing that matters.  I gave the
numbers from my laptop because it's the one I can test.  I cannot
easily test yours.

> What I take away here is that there's been a pretty steep cost increase
> for the regression tests since v10, and that is *not* in line with the
> historical average.  In fact, in most years we've bought enough speedup
> through performance improvements to pay for the test cases we added.
> This is masked if you just eyeball "make check" compared to several years
> ago.  But to do that, you have to ignore the fact that we made substantial
> improvements in the runtime of initdb as well as the regression tests
> proper circa v10, and we've now thrown that away and more.

OK, I can see some increase there.  It's definitely more for you than
it is for me, since you see something like a 10% slowdown between 10
and master and I see basically no difference.  I don't know why that
should be, but I'm not doubting you.

> So I remain dissatisfied with these results, particularly because in
> my own work habits, the time for "make installcheck-parallel" is way
> more interesting than "make check".  I avoid redoing installs and
> initdbs if I don't need them.

I'm not that efficient, but noted.

>> ... Even if that meant that you had
>> to wait 1 extra second every time you run 'make check', I would judge
>> that worthwhile.
>
> I think this is a bad way of looking at it.  Sure, in terms of
> one developer doing one test run, a second or two means nothing.
> But for instance, if you want to do 100 test runs in hope of catching
> a seldom-reproduced bug, it adds up.  It also adds up when you consider
> the aggregate effort expended by the buildfarm, or the time you have
> to wait to see buildfarm results.

Sure, but as Andres said, you also have to consider how much developer
time it takes to recoup the savings.  If it takes a day of development
time to save a second of regression test time, that might be worth it;
if it takes a month, color me doubtful, especially if the result is a
more fragile test that happens to pass on all of the buildfarm
critters we have now but might fail spuriously when somebody adds a
new one.

> Yeah.  What I thought this argument was about was convincing *you*
> that that would be a reasonable patch to apply.  It seems from my
> experiment on gaur that that patch makes the results unstable, so
> if we can do it at all it will need more work.  But I do think
> it's worth putting in some more sweat here.  In the long run the
> time savings will add up.

Why me?  Thomas wrote the patch, Andres committed it, and you
complained about it.  I'm just along for the ride.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Further cleanup of pg_dump/pg_restore item selection code

2018-01-24 Thread Tom Lane
"David G. Johnston"  writes:
> On Wednesday, January 24, 2018, Tom Lane  wrote:
>> The same behaviors occur if you do e.g.
>>  pg_dump -Fc -t sometable somedb | pg_restore --create
>> which is another undocumented misbehavior: the docs for pg_restore do not
>> suggest that --create might fail if the source archive was selective.

> pg_restore -t:

> "When -t is specified, pg_restore makes no attempt to restore any other
> database objects that the selected table(s) might depend upon. Therefore,
> there is no guarantee that a specific-table restore into a clean database
> will succeed."

I think you might be missing one of the main points here, which is
that --create is specified as causing both a CREATE DATABASE and a
reconnect to that database.  If it silently becomes a no-op, which
is what happens today, the restore is likely to go into the wrong
database entirely (or at least not the DB the user expected).

I won't deny the possibility that some people are depending on the
existing wrong behavior, but that's a hazard with any bug fix.

regards, tom lane



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Tom Lane
Robert Haas  writes:
> There is no need to collect years of data in order to tell whether or
> not the time to run the tests has increased by as much on developer
> machines as it has on prairiedog.  You showed the time going from 3:36
> to 8:09 between 2014 and the present.  That is a 2.26x increase.  It
> is obvious from the numbers I posted before that no such increase has
> taken place in the time it takes to run 'make check' on my relatively
> modern laptop.  Whatever difference exists is measured in
> milliseconds.

I may be wasting my breath here, but in one more attempt to convince
you that "time make check" on your laptop is not the only number that
anyone should be interested in, here are some timings off my development
workstation.  These are timings off current tip of each release branch,
all the same build options etc, so very comparable:

9.4
[ don't remember the equivalent to top-level make temp-install here ]
top-level make check 24.725s
installcheck-parallel 15.383s
installcheck 27.560s

9.5
make temp-install 3.702s
initdb 2.328s
top-level make check 24.709s
installcheck-parallel 16.632s
installcheck 32.427s

9.6
make temp-install 3.971s
initdb 2.178s
top-level make check 24.048s
installcheck-parallel 15.889s
installcheck 32.775s

10
make temp-install 4.051s
initdb 1.363s
top-level make check 21.784s
installcheck-parallel 15.209s
installcheck 31.938s

HEAD
make temp-install 4.048s
initdb 1.361s
top-level make check 24.027s
installcheck-parallel 16.914s
installcheck 35.745s

I copied-and-pasted the "real time" results of time(1) for each of these,
not bothering to round them off; but the numbers are only reproducible
to half a second or so, so there's no significance in the last couple
digits.  Most numbers above are the minimum of 2 or more runs.

What I take away here is that there's been a pretty steep cost increase
for the regression tests since v10, and that is *not* in line with the
historical average.  In fact, in most years we've bought enough speedup
through performance improvements to pay for the test cases we added.
This is masked if you just eyeball "make check" compared to several years
ago.  But to do that, you have to ignore the fact that we made substantial
improvements in the runtime of initdb as well as the regression tests
proper circa v10, and we've now thrown that away and more.

So I remain dissatisfied with these results, particularly because in
my own work habits, the time for "make installcheck-parallel" is way
more interesting than "make check".  I avoid redoing installs and
initdbs if I don't need them.

> ... Even if that meant that you had
> to wait 1 extra second every time you run 'make check', I would judge
> that worthwhile.

I think this is a bad way of looking at it.  Sure, in terms of
one developer doing one test run, a second or two means nothing.
But for instance, if you want to do 100 test runs in hope of catching
a seldom-reproduced bug, it adds up.  It also adds up when you consider
the aggregate effort expended by the buildfarm, or the time you have
to wait to see buildfarm results.

> Another thing you could do is consider applying the patch Thomas
> already posted to reduce the size of the tables involved.

Yeah.  What I thought this argument was about was convincing *you*
that that would be a reasonable patch to apply.  It seems from my
experiment on gaur that that patch makes the results unstable, so
if we can do it at all it will need more work.  But I do think
it's worth putting in some more sweat here.  In the long run the
time savings will add up.

regards, tom lane



Further cleanup of pg_dump/pg_restore item selection code

2018-01-24 Thread David G. Johnston
On Wednesday, January 24, 2018, Tom Lane  wrote:

> and it has a bunch of strange
> behaviors that can really only be characterized as bugs.  An example is
> that
>
> pg_dump --create -t sometable somedb
>
>
pg_dump -t:

"The -n and -N switches have no effect when -t is used, because tables
selected by -t will be dumped regardless of those switches, and non-table
objects will not be dumped."

a database is a non-table object...though the proposed behavior seems
reasonable; but maybe worthy of being pointed out just like the -n switches
are.

(probably been asked before but why 'no effect' instead of 'will cause the
dump to fail before it begins'?)


> The same behaviors occur if you do e.g.
>
> pg_dump -Fc -t sometable somedb | pg_restore --create
>
> which is another undocumented misbehavior: the docs for pg_restore do not
> suggest that --create might fail if the source archive was selective.
>

pg_restore -t:

"When -t is specified, pg_restore makes no attempt to restore any other
database objects that the selected table(s) might depend upon. Therefore,
there is no guarantee that a specific-table restore into a clean database
will succeed."

That -t was specified on the dump instead of the restore could be clarified
but applies nonetheless.

Do we document anywhere what is a "database object" and what are "property
objects" that are restored even when -t is specified?


> Attached is a patch that proposes to clean this up.  It arranges
> things so that CREATE DATABASE (and, if --clean, DROP DATABASE)
> is emitted if and only if you said --create, regardless of other
> selectivity switches.


Makes sense, the bit about regardless of other switches probably should
find it's way into the docs.

Adding a drop database where there never was one before is a bit
disconcerting though...even if the switches imply the user should be
expecting one,


>
>  And it fixes selective pg_restore to include
> subsidiary ACLs, comments, and security labels.


+1

David J.


RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-24 Thread Tsunakawa, Takayuki
From: Robert Haas [mailto:robertmh...@gmail.com]
> I think open_datasync will be worse on systems where fsync() is expensive
> -- it forces the data out to disk immediately, even if the data doesn't
> need to be flushed immediately.  That's bad, because we wait immediately
> when we could have deferred the wait until later and maybe gotten the WAL
> writer to do the work in the background.  But it might be better on systems
> where fsync() is basically free, because there you might as well just get
> it out of the way immediately and not leave something left to be done later.
> 
> This is just a guess, of course.  You didn't mention what the underlying
> storage for your test was?

Uh, your guess was correct.  My file system was ext3, where fsync() writes all 
dirty buffers in page cache.

As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM 
volume with ext4 (mounted with options noatime, nobarrier) on a PCIe flash 
memory.

5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 50829.597 ops/sec  20 usecs/op
fdatasync 42094.381 ops/sec  24 usecs/op
fsync  42209.972 ops/sec  
24 usecs/op
fsync_writethroughn/a
open_sync 48669.605 ops/sec  21 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 26366.373 ops/sec  38 usecs/op
fdatasync 33922.725 ops/sec  29 usecs/op
fsync 32990.209 ops/sec  30 usecs/op
fsync_writethroughn/a
open_sync 24326.249 ops/sec  41 usecs/op

What do you think about changing the default value of wal_sync_method on Linux 
in PG 11?  I can understand the concern that users might hit performance 
degredation if they are using PostgreSQL on older systems.  But it's also 
mottainai that many users don't notice the benefits of wal_sync_method = 
open_datasync on new systems.

Regards
Takayuki Tsunakawa




Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> So are we at a consensus yet?

> You had me at "make public less special", I was just trying to make sure
> we all understand what that means.
> +1 from me for moving forward.

Applying this patch will leave us with the original pg_dump misbehavior
removed, AFAICS.  So now we are hard up against the question of whether
--no-comments can support its weight as a pure feature, rather than a
workaround for a misbehavior/bug.

I've been pretty much on the negative side of that question, but
I just posted a patch here:
https://postgr.es/m/32668.1516848...@sss.pgh.pa.us
that will have the effect of causing pg_restore to print ACLs,
comments, and seclabels in cases where it formerly didn't.
If you want to get back to the old behavior in those cases,
we have you covered with --no-acl and --no-security-labels ...
but not so much for comments.

Between that and Robert's point that we have --no-foo for every
other subsidiary object property, so why not comments, I'm prepared
to change my vote.

This isn't a positive review of the contents of the patch,
because I've not read it.  But I'm on board with the goal now.

regards, tom lane



Further cleanup of pg_dump/pg_restore item selection code

2018-01-24 Thread Tom Lane
As I've been hacking on the pg_dump code recently, I got annoyed at the
ugliness of its code for deciding whether or not to emit database-related
TOC entries.  That logic is implemented in a completely different place
from other TOC entry selection decisions, and it has a bunch of strange
behaviors that can really only be characterized as bugs.  An example is
that

pg_dump --create -t sometable somedb

will not emit a CREATE DATABASE command as you might expect, because
the use of -t prevents the DATABASE TOC entry from being made in the
first place.  Also,

pg_dump --clean --create -t sometable somedb

will not emit any DROP commands: the code expects that with the
combination of --clean and --create, it should emit only DROP
DATABASE, but nothing happens because there's no DATABASE entry.

The same behaviors occur if you do e.g.

pg_dump -Fc -t sometable somedb | pg_restore --create

which is another undocumented misbehavior: the docs for pg_restore do not
suggest that --create might fail if the source archive was selective.

Another set of problems is that if you use pg_restore's item
selection switches (-t etc), those do not restore ACLs, comments,
or security labels associated with the selected object(s).  This
is strange, and unlike the behavior of the comparable switches in
pg_dump, and certainly undocumented.

Attached is a patch that proposes to clean this up.  It arranges
things so that CREATE DATABASE (and, if --clean, DROP DATABASE)
is emitted if and only if you said --create, regardless of other
selectivity switches.  And it fixes selective pg_restore to include
subsidiary ACLs, comments, and security labels.

This does not go all the way towards making pg_restore's item selection
switches dump subsidiary objects the same as pg_dump would, because
pg_restore doesn't really have enough info to deal with indexes and
table constraints the way pg_dump does.  Perhaps we could intuit the
parent table by examining object dependencies, but that would be a
bunch of new logic that seems like fit material for a different patch.
In the meantime I just added a disclaimer to pg_restore.sgml explaining
this.

I also made an effort to make _tocEntryRequired() a little better
organized and better documented.  It's grown a lot of different
behaviors over the years without much thought about layout or
keeping related checks together.  (There's a chunk in the middle
of the function that now needs to be indented one stop to the right,
but in the interests of keeping the patch reviewable, I've not done
so yet.)

Comments?

regards, tom lane

diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index a2ebf75..b1b507f 100644
*** a/doc/src/sgml/ref/pg_restore.sgml
--- b/doc/src/sgml/ref/pg_restore.sgml
***
*** 446,451 
--- 446,456 
   flag of pg_dump.  There is not currently
   any provision for wild-card matching in pg_restore,
   nor can you include a schema name within its -t.
+  And, while pg_dump's -t
+  flag will also dump subsidiary objects (such as indexes) of the
+  selected table(s),
+  pg_restore's -t
+  flag does not include such subsidiary objects.
  
 
  
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index fb03793..051df9e 100644
*** a/src/bin/pg_dump/pg_backup_archiver.c
--- b/src/bin/pg_dump/pg_backup_archiver.c
*** static void _selectOutputSchema(ArchiveH
*** 70,76 
  static void _selectTablespace(ArchiveHandle *AH, const char *tablespace);
  static void processEncodingEntry(ArchiveHandle *AH, TocEntry *te);
  static void processStdStringsEntry(ArchiveHandle *AH, TocEntry *te);
! static teReqs _tocEntryRequired(TocEntry *te, teSection curSection, RestoreOptions *ropt);
  static RestorePass _tocEntryRestorePass(TocEntry *te);
  static bool _tocEntryIsACL(TocEntry *te);
  static void _disableTriggersIfNecessary(ArchiveHandle *AH, TocEntry *te);
--- 70,76 
  static void _selectTablespace(ArchiveHandle *AH, const char *tablespace);
  static void processEncodingEntry(ArchiveHandle *AH, TocEntry *te);
  static void processStdStringsEntry(ArchiveHandle *AH, TocEntry *te);
! static teReqs _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH);
  static RestorePass _tocEntryRestorePass(TocEntry *te);
  static bool _tocEntryIsACL(TocEntry *te);
  static void _disableTriggersIfNecessary(ArchiveHandle *AH, TocEntry *te);
*** ProcessArchiveRestoreOptions(Archive *AH
*** 312,318 
  		if (te->section != SECTION_NONE)
  			curSection = te->section;
  
! 		te->reqs = _tocEntryRequired(te, curSection, ropt);
  	}
  
  	/* Enforce strict names checking */
--- 312,318 
  		if (te->section != SECTION_NONE)
  			curSection = te->section;
  
! 		te->reqs = _tocEntryRequired(te, curSection, AH);
  	}
  
  	/* Enforce strict names checking */
***

Re: [HACKERS] Support to COMMENT ON DATABASE CURRENT_DATABASE

2018-01-24 Thread Jing Wang
>Not surprisingly, this patch no longer applies in the wake of commit
>b3f840120.  Rather than rebasing the pg_dump portions, I would suggest
>you just drop them.

It  has been removed from the pg_dump codes.

>I notice some other patch application failures in dbcommands.c,
>objectaddress.c, and user.c, so there's more work to do besides
>this
Yes.  fixed it.

Enclosed please find the latest patch fixed above.


Regards,
Jing Wang
Fujitsu Australia


support_CURRENT_DATABASE_keyword_v4.7.patch
Description: Binary data


Re: non-bulk inserts and tuple routing

2018-01-24 Thread Amit Langote
On 2018/01/24 17:25, Amit Langote wrote:
> On 2018/01/20 7:07, Robert Haas wrote:
>> On Fri, Jan 19, 2018 at 3:56 AM, Amit Langote wrote:
>>> I rebased the patches, since they started conflicting with a recently
>>> committed patch [1].
>>
>> I think that my latest commit has managed to break this pretty thoroughly.
> 
> I rebased it.  Here are the performance numbers again.
> 
> * Uses following hash-partitioned table:
> 
> create table t1 (a int, b int) partition by hash (a);
> create table t1_x partition of t1 for values with (modulus M, remainder R)
> ...
> 
> 
> * Non-bulk insert uses the following code (insert 100,000 rows one-by-one):
> 
> do $$
> begin
>   for i in 1..10 loop
> insert into t1 values (i, i+1);
>   end loop;
> end; $$;
> 
> Times in milliseconds:
> 
> #parts   HEADPatched
>  8   6148.313   4938.775
> 16   8882.420   6203.911
> 32  14251.072   8595.068
> 64  24465.691  13718.161
>128  45099.435  23898.026
>256  87307.332  44428.126
> 
> * Bulk-inserting 100,000 rows using COPY:
> 
> copy t1 from '/tmp/t1.csv' csv;
> 
> Times in milliseconds:
> 
> #parts   HEADPatched
> 
>  8466.170446.865
> 16445.341444.990
> 32443.544487.713
> 64460.579435.412
>128469.953422.403
>256463.592431.118

Rebased again.

Thanks,
Amit
From 250fe77e304a0f0841f2d7978405c2a32fdeb7a2 Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 19 Dec 2017 10:43:45 +0900
Subject: [PATCH v4 1/3] Teach CopyFrom to use ModifyTableState for
 tuple-routing

This removes all fields of CopyStateData that were meant for
tuple routing and instead uses ModifyTableState that has all those
fields, including transition_tupconv_maps.  In COPY's case,
transition_tupconv_maps is only required if tuple routing is being
used, so it's safe.
---
 src/backend/commands/copy.c | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 04a24c6082..251676b321 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -166,9 +166,6 @@ typedef struct CopyStateData
boolvolatile_defexprs;  /* is any of defexprs volatile? 
*/
List   *range_table;
 
-   /* Tuple-routing support info */
-   PartitionTupleRouting *partition_tuple_routing;
-
TransitionCaptureState *transition_capture;
 
/*
@@ -2285,6 +2282,7 @@ CopyFrom(CopyState cstate)
ResultRelInfo *resultRelInfo;
ResultRelInfo *saved_resultRelInfo = NULL;
EState *estate = CreateExecutorState(); /* for ExecConstraints() */
+   ModifyTableState *mtstate = makeNode(ModifyTableState);
ExprContext *econtext;
TupleTableSlot *myslot;
MemoryContext oldcontext = CurrentMemoryContext;
@@ -2303,6 +2301,8 @@ CopyFrom(CopyState cstate)
SizebufferedTuplesSize = 0;
int firstBufferedLineNo = 0;
 
+   PartitionTupleRouting *proute = NULL;
+
Assert(cstate->rel);
 
/*
@@ -2468,10 +2468,15 @@ CopyFrom(CopyState cstate)
 */
if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
-   PartitionTupleRouting *proute;
+   ModifyTable *node = makeNode(ModifyTable);
 
-   proute = cstate->partition_tuple_routing =
-   ExecSetupPartitionTupleRouting(NULL, cstate->rel, 1, 
estate);
+   /* Just need make this field appear valid. */
+   node->nominalRelation = 1;
+   mtstate->ps.plan = (Plan *) node;
+   mtstate->ps.state = estate;
+   mtstate->resultRelInfo = resultRelInfo;
+   proute = mtstate->mt_partition_tuple_routing =
+   ExecSetupPartitionTupleRouting(mtstate, cstate->rel, 1, 
estate);
 
/*
 * If we are capturing transition tuples, they may need to be
@@ -2496,7 +2501,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-   cstate->partition_tuple_routing != NULL ||
+   mtstate->mt_partition_tuple_routing != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2571,10 +2576,9 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
/* Determine the partition to heap_insert the tuple into */
-   if (cstate->partition_tuple_routing)
+   if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
int

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Peter Geoghegan
On Wed, Jan 24, 2018 at 5:31 PM, Thomas Munro
 wrote:
> Here's a version that works, and a minimal repro test module thing.
> Without 0003 applied, it hangs.

I can confirm that this version does in fact fix the problem with
parallel CREATE INDEX hanging in the event of (simulated) worker
fork() failure. And, it seems to have at least one tiny advantage over
the other approaches I was talking about that you didn't mention,
which is that we never have to wait until the leader stops
participating as a worker before an error is raised. IOW, either the
whole parallel CREATE INDEX operation throws an error at an early
point in the CREATE INDEX, or the CREATE INDEX completely succeeds.

Obviously, the other, stated advantage is more relevant: *everyone*
automatically doesn't have to worry about nworkers_launched being
inaccurate this way, including code that gets away with this today
only due to using a tuple queue, such as nodeGather.c, but may not
always get away with it in the future.

I've run out of time to assess what you've done here in any real
depth. For now, I will say that this approach seems interesting to me.
I'll take a closer look tomorrow.

-- 
Peter Geoghegan



Re: Failed to request an autovacuum work-item in silence

2018-01-24 Thread Masahiko Sawada
On Thu, Jan 25, 2018 at 12:14 AM, Fabrízio de Royes Mello
 wrote:
>
>
> On Wed, Jan 24, 2018 at 12:31 PM, Fabrízio de Royes Mello
>  wrote:
>>
>>
>>
>> On Tue, Jan 23, 2018 at 11:44 PM, Masahiko Sawada 
>> wrote:
>> >
>> > On Tue, Jan 23, 2018 at 8:03 PM, Fabrízio de Royes Mello
>> >  wrote:
>> > >
>> > > Em ter, 23 de jan de 2018 às 03:36, Masahiko Sawada
>> > > 
>> > > escreveu:
>> > >>
>> > >> Hi all,
>> > >>
>> > >> While reading the code, I realized that the requesting an autovacuum
>> > >> work-item could fail in silence if work-item array is full. So the
>> > >> users cannot realize that work-item is never performed.
>> > >> AutoVacuumRequestWork() seems to behave so from the initial
>> > >> implementation but is there any reason of such behavior? It seems to
>> > >> me that it can be a problem even now that there is only one kind of
>> > >> work-item. Attached patch for fixing it.
>> > >
>> > >
>> > > Seems reasonable but maybe you can use the word "worker" instead of
>> > > "work
>> > > item" for report message.
>> > >
>> >
>> > Thank you for the comment.
>> > Or can we use the word "work-item" since the commit log and source
>> > code use this word?
>> >
>>
>> You're correct, I misunderstood it thinking about autovacuum workers and
>> not the internal workitem array.
>>
>> As NUM_WORKITEMS is fixed in 256 we don't have any real feedback if in a
>> real workload this can send a lot of messages to log, so:
>> 1) maybe invent a new GUC to control if we need or not to send this info
>> to log
>> 2) change elevel for DEBUG1
>>

Hmm, I'd rather think to log the message at LOG level and to have a
new GUC to control the size of maximum of work-items array . I think
we should always notify users that work-item couldn't get added
because it can become a potential performance problem. Also it might
lead other problems once we have other types of work-item. In the log
message, we can hint for user that you might want to increase maximum
of work-items array.

>
> Looking better for the autovacuum code your patch will work just for BRIN
> request "brin_summarize_range", correct?
>

Correct.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Thomas Munro
On Thu, Jan 25, 2018 at 9:28 AM, Peter Geoghegan  wrote:
> On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro
>  wrote:
>> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan  wrote:
>>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>>
>>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>>> real problem. Sure enough, the coding pattern within
>>> _bt_leader_heapscan() can cause us to wait forever even with commit
>>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>>> consequence of the patch not using tuple queues (it uses the new
>>> tuplesort sharing thing instead).
>>
>> Just curious: does the attached also help?
>
> I can still reproduce the problem without the fix I described (which
> does work), using your patch instead.
>
> Offhand, I suspect that the way you set ParallelMessagePending may not
> always leave it set when it should be.

Here's a version that works, and a minimal repro test module thing.
Without 0003 applied, it hangs.  With 0003 applied, it does this:

postgres=# call test_fork_failure();
CALL
postgres=# call test_fork_failure();
CALL
postgres=# call test_fork_failure();
ERROR:  lost connection to parallel worker
postgres=# call test_fork_failure();
ERROR:  lost connection to parallel worker

I won't be surprised if 0003 is judged to be a horrendous abuse of the
interrupt system, but these patches might at least be useful for
understanding the problem.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Chaos-monkey-fork-failure.patch
Description: Binary data


0002-A-simple-test-module-that-hangs-on-fork-failure.patch
Description: Binary data


0003-Pessimistic-fork-failure-detector.patch
Description: Binary data


Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-01-24 Thread Michael Paquier
On Wed, Jan 24, 2018 at 12:45:48PM -0300, Alvaro Herrera wrote:
> /me wonders if there's anything that would suggest to make this
> extensive to processes other than backends ...

That's a good thought. Now ProcessInterrupts() is not used by
non-backend processes. For example the WAL receiver has its own logic to
handle interrupts but I guess that we could have an interface which can
be plugged in for any processes, which is by default enabled for
bgworkers.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

2018-01-24 Thread Michael Paquier
On Wed, Jan 24, 2018 at 12:43:51PM -0500, Stephen Frost wrote:
> * chenhj (chjis...@163.com) wrote:
>> At 2018-01-23 09:56:48, "Stephen Frost"  wrote:
>>> I've only read through the thread to try and understand what's going on
>>> and the first thing that comes to mind is that you're changing
>>> pg_rewind to not remove the WAL from before the divergence (split)
>>> point, but I'm not sure why.  As noted, that WAL isn't needed for
>>> anything (it's from before the split, after all), so why keep it?  Is
>>> there something in this optimization that depends on the old WAL being
>>> there and, if so, what and why?
>> 
>> After run pg_rewind, the first startup of postgres will do crash recovery.
>> And crash recovery will begin from the previous redo point preceding the 
>> divergence.
>> So, the WAL after the redo point and before the divergence is needed.
> 
> Right.

Most of the time, and particularly since v11 has removed the need to
retain more past segments than one completed checkpoint, those segments
have less chances to be on the source server, limiting more the impact
of the patch discussed on this thread.

>> Of course, the WAL before the redo point is not needed, but in my point of 
>> view,
>> recycling unwanted WAL does not have to be done by pg_rewind.
> 
> That's what pg_rewind has been doing though, isn't it?  And it's not
> like that WAL is useful for anything, is it?  That's also how
> pg_basebackup works.

As of HEAD, pg_rewind handles data in pg_wal similarly to other
paths which are not relation files: the files from the source are just
blindly copied to the target. After the rewind and once recovery begins,
we just let the startup process do the cleanup instead of
pg_rewind. Regarding that pg_basebackup is different, you get the choice
to do what you want using --wal-method, and you actually get the
segments that you only need to get a self-contained base backup.

>>> That's also different from how pg_basebackup works, which I don't think
>>> is good (seems like pg_rewind should operate in a pretty similar manner
>>> to pg_basebackup).
>> 
>> Thanks for your comments!
>> I also considered copy WAL just like how pg_basebackup does,but a
>> implement similar to pg_basebackup's manner may be not so simple. 
> 
> Using the replication protocol to fetch WAL would be a good thing to do
> (actually, making pg_rewind entirely work through a connection to the
> current primary would be great) but that's independent of what I'm
> asking for here.  Here I'm just suggesting that we not change what
> pg_rewind is doing today when it comes to the existing WAL on the
> old-primary.

Yes, superuser is necessary now, if we could get to a point where only a
replication permission is needed that would be nice. Now we could do
things differently. We could have a system role dedicated to pg_rewind
which works only on the functions from genfile.c that pg_rewind needs,
in order to leverage the need of a superuser.

>> And the WAL which contains the previous redo point preceding the
>> divergence may be only exists in target server and had been recycled
>> in source. That's different between pg_rewind and pg_basebackup. 
> 
> Hm, pg_rewind was removing that and expecting it to be on the new
> primary?  If that's the case then I could see an argument for keeping
> WAL that's from the divergence point onward, but I still don't think
> we should have pg_rewind just leave all of the prior WAL in place.

Another thing that we could as well do is simply not fetching any WAL
files at all during a rewind, then let the startup process of the
rewound server decide by itself what it needs. This would leverage the
data transfered in all cases. It is easy to define the start point of
WAL segments needed for a rewound server because the last checkpoint
record before WAL forked is calculated before transferring any data. The
finish point cannot be exact though because you don't know up to which
point you should transfer it. In some ways, this is close to a base
backup. We could as well define an end point to minimize the amount of
WAL as the last completed segment before data transfer begins, but then
you need to worry about WAL segment holes and such. At the end of the
day, just not transferring any data from pg_wal looks more solid to me
as a first step if we need to worry about data that is transferred but
finishes by being useless.
--
Michael


signature.asc
Description: PGP signature


Re: copy.c allocation constant

2018-01-24 Thread Tom Lane
Bruce Momjian  writes:
> On Thu, Jan 25, 2018 at 09:30:54AM +1300, Thomas Munro wrote:
>> On Thu, Jan 25, 2018 at 7:19 AM, Robert Haas  wrote:
>>> My guess is that a fairly common pattern for larger chunks will be to
>>> round the size up to a multiple of 4kB, the usual memory page size.
>> 
>> See also this discussion:
>> https://www.postgresql.org/message-id/flat/CAEepm%3D1bRyd%2B_W9eW-QmP1RGP03ti48zgd%3DK11Q6o4edQLgkcg%40mail.gmail.com#CAEepm=1bRyd+_W9eW-QmP1RGP03ti48zgd=k11q6o4edqlg...@mail.gmail.com
>> TL;DR glibc doesn't actually round up like that below 128kB, but many
>> others including FreeBSD, macOS etc round up to various page sizes or
>> size classes including 8kB (!), 512 bytes.  I find this a bit
>> frustrating because it means that the most popular libc implementation
>> doesn't have the problem so this kind of thing probably isn't a high
>> priority, but probably on most other Unices (and I have no clue for
>> Windows) including my current favourite we waste a bunch of memory.

> The BSD memory allocator used to allocate in powers of two, and keep the
> header in a separate location.  They did this so they could combine two
> free, identically-sized memory blocks into a single one that was double
> the size.  I have no idea how it works now.

It seems like there's fairly good reason to suppose that most versions
of malloc are efficient for power-of-2 request sizes.  Our big problem
is the overhead that we add ourselves.  That, however, we could compensate
for by adjusting request sizes in cases where the request is flexible.
I'd definitely support some sort of memory context API to allow such
adjustment, as we speculated about in the thread Thomas points to above.

regards, tom lane



Re: [HACKERS] SERIALIZABLE with parallel query

2018-01-24 Thread Thomas Munro
On Wed, Dec 13, 2017 at 5:30 PM, Thomas Munro
 wrote:
> On Wed, Dec 13, 2017 at 2:09 PM, Haribabu Kommi
>  wrote:
>> Thanks for explaining the problem in generating an isolation test to
>> test the serialize parallel query.
>>
>> Committer can decide whether existing test is fine to part of the test suite
>> or remove it, other than that everything is fine. so I am moving the patch
>> into "ready for committer" state.
>
> Thank you!  I will try to find a good benchmark that will really
> exercise parallel query + SSI.

This started crashing some time yesterday with an assertion failure in
the isolation tests after commit 2badb5af landed.  Reordering of code
in parallel.c confused patch's fuzz heuristics leading
SetSerializableXact() to be called too soon.  Here's a fix for that.

-- 
Thomas Munro
http://www.enterprisedb.com


ssi-parallel-v10.patch
Description: Binary data


Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Peter Geoghegan
On Wed, Jan 24, 2018 at 3:37 PM, Robert Haas  wrote:
> Well, I've been resisting that approach from the very beginning of
> parallel query.  Eventually, I hope that we're going to go in the
> direction of changing our mind about how many workers parallel
> operations use "on the fly".  For example, if there are 8 parallel
> workers available and 4 of them are in use, and you start a query (or
> index build) that wants 6 but only gets 4, it would be nice if the
> other 2 could join later after the other operation finishes and frees
> some up.

That seems like a worthwhile high-level goal.

I remember looking into Intel Threading Building Blocks many years
ago, and seeing some interesting ideas there. According to Wikipedia,
"TBB implements work stealing to balance a parallel workload across
available processing cores in order to increase core utilization and
therefore scaling". The programmer does not operate in terms of an
explicit number of threads, and there are probably certain types of
problems that this has an advantage with.

That model also has its costs, though, and I don't think it's every
going to supplant a lower level approach. In an ideal world, you have
both things, because TBB's approach apparently has high coordination
overhead on many core systems.

> That, of course, won't work very well if parallel operations
> are coded in such a way that the number of workers must be nailed down
> at the very beginning.

But my whole approach to sorting is based on the idea that each worker
produces a roughly even amount of output to merge. I don't see any
scope to do better for parallel CREATE INDEX. (Other uses for parallel
sort are another matter, though.)

> Now maybe all that seems like pie in the sky, and perhaps it is, but I
> hold out hope.  For queries, there is another consideration, which is
> that some queries may run with parallelism but actually finish quite
> quickly - it's not desirable to make the leader wait for workers to
> start when it could be busy computing.  That's a lesser consideration
> for bulk operations like parallel CREATE INDEX, but even there I don't
> think it's totally negligible.

Since I don't have to start this until the leader stops participating
as a worker, there is no wait in the leader. In the vast majority of
cases, a call to something like WaitForParallelWorkersToAttach() ends
up looking at state in shared memory, immediately determining that
every launched process initialized successfully. The overhead should
be negligible in the real world.

> For both reasons, it's much better, or so it seems to me, if parallel
> operations are coded to work with the number of workers that show up,
> rather than being inflexibly tied to a particular worker count.

I've been clear from day one that my approach to parallel tuplesort
isn't going to be that useful to parallel query in its first version.
You need some kind of partitioning (a distribution sort of some kind)
for that, and probably plenty of cooperation from within the executor.
I've also said that I don't think we can do much better for parallel
CREATE INDEX even *with* support for partitioning, which is something
borne out by comparisons with other systems. My patch was always
presented as an 80/20 solution.

I have given you specific technical reasons why I think that using a
barrier is at least a bad idea for nbtsort.c, and probably for
nodeGather.c, too. Those problems will need to be worked through if
you're not going to concede the point on using a barrier. Your
aspirations around not assuming that workers cannot join later seem
like good ones, broadly speaking, but they are not particularly
applicable to how *anything* happens to work now.

Besides all this, I'm not even suggesting that I need to know the
number of workers up front for parallel CREATE INDEX. Perhaps
nworkers_launched can be incremented after the fact following some
later enhancement to the parallel infrastructure, in which case
parallel CREATE INDEX will theoretically be prepared to take advantage
right away (though other parallel sort operations seem more likely to
*actually* benefit). That will be a job for the parallel
infrastructure, though, not for each and every parallel operation --
how else could we possibly hope to add more workers that become
available half way through, as part of a future enhancement to the
parallel infrastructure? Surely every caller to
CreateParallelContext() should not need to invent their own way of
doing this.

All I want is to be able to rely on nworkers_launched. That's not in
tension with this other goal/aspiration, and actually seems to
complement it.

-- 
Peter Geoghegan



Re: reducing isolation tests runtime

2018-01-24 Thread Tom Lane
Alvaro Herrera  writes:
> On the subject of test total time, we could paralelize isolation tests.
> Right now "make check" in src/test/isolation takes 1:16 on my machine.
> Test "timeouts" takes full 40s of that, with nothing running in parallel
> -- the machine is completely idle.

BTW, one small issue there is that the reason the timeouts test is so
slow is that we have to use multi-second timeouts to be sure slower
buildfarm critters (eg valgrind animals) will get the expected results.
So I'm worried that if the machine isn't otherwise idle, we will get
random failures.

We could parallelize the rest of those tests and leave timeouts in its own
group.  That cuts the payback a lot :-( but might still be worth doing.
Or maybe tweak things so that the buildfarm runs a serial schedule but
manual testing doesn't.  Or we could debate how important the timeout
tests really are ... or think harder about how to make them reproducible.

regards, tom lane



Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 5:52 PM, Peter Geoghegan  wrote:
>> If we made the Gather node wait only for workers that actually reached
>> the Gather -- either by using a Barrier or by some other technique --
>> then this would be a lot less fragile.  And the same kind of technique
>> would work for parallel CREATE INDEX.
>
> The use of a barrier has problems of its own [1], though, of which one
> is unique to the parallel_leader_participation=off case. I thought
> that you yourself agreed with this [2] -- do you?
>
> Another argument against using a barrier in this way is that it seems
> like way too much mechanism to solve a simple problem. Why should a
> client of parallel.h not be able to rely on nworkers_launched (perhaps
> only after "verifying it can be trusted")? That seem like a pretty
> reasonable requirement for clients to have for any framework for
> parallel imperative programming.

Well, I've been resisting that approach from the very beginning of
parallel query.  Eventually, I hope that we're going to go in the
direction of changing our mind about how many workers parallel
operations use "on the fly".  For example, if there are 8 parallel
workers available and 4 of them are in use, and you start a query (or
index build) that wants 6 but only gets 4, it would be nice if the
other 2 could join later after the other operation finishes and frees
some up.  That, of course, won't work very well if parallel operations
are coded in such a way that the number of workers must be nailed down
at the very beginning.

Now maybe all that seems like pie in the sky, and perhaps it is, but I
hold out hope.  For queries, there is another consideration, which is
that some queries may run with parallelism but actually finish quite
quickly - it's not desirable to make the leader wait for workers to
start when it could be busy computing.  That's a lesser consideration
for bulk operations like parallel CREATE INDEX, but even there I don't
think it's totally negligible.

For both reasons, it's much better, or so it seems to me, if parallel
operations are coded to work with the number of workers that show up,
rather than being inflexibly tied to a particular worker count.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: reducing isolation tests runtime

2018-01-24 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jan 24, 2018 at 6:10 PM, Alvaro Herrera  
> wrote:
>> On the subject of test total time, we could paralelize isolation tests.

> Oh, cool.  Yes, the time the isolation tests take to run is quite
> annoying.  I didn't realize it would be so easy to run it in parallel.

+1 to both --- I hadn't realized we had enough infrastructure to do this
in parallel, either.

regards, tom lane



Re: CONSTANT/NOT NULL/initializer properties for plpgsql record variables

2018-01-24 Thread Tom Lane
I wrote:
> I said a couple of times in recent threads that it wouldn't be too hard
> to implement $SUBJECT given the other patches I've been working on.

Here's a version rebased up to HEAD, with a trivial merge conflict fixed.

This now needs to be applied over the patches in
https://postgr.es/m/833.1516834...@sss.pgh.pa.us
and
https://postgr.es/m/8376.1516835...@sss.pgh.pa.us

regards, tom lane

diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile
index 2190eab..3ac64e2 100644
*** a/src/pl/plpgsql/src/Makefile
--- b/src/pl/plpgsql/src/Makefile
*** DATA = plpgsql.control plpgsql--1.0.sql 
*** 26,32 
  
  REGRESS_OPTS = --dbname=$(PL_TESTDB)
  
! REGRESS = plpgsql_call plpgsql_control plpgsql_record plpgsql_transaction
  
  all: all-lib
  
--- 26,33 
  
  REGRESS_OPTS = --dbname=$(PL_TESTDB)
  
! REGRESS = plpgsql_call plpgsql_control plpgsql_record \
! 	plpgsql_transaction plpgsql_varprops
  
  all: all-lib
  
diff --git a/src/pl/plpgsql/src/expected/plpgsql_varprops.out b/src/pl/plpgsql/src/expected/plpgsql_varprops.out
index ...109056c .
*** a/src/pl/plpgsql/src/expected/plpgsql_varprops.out
--- b/src/pl/plpgsql/src/expected/plpgsql_varprops.out
***
*** 0 
--- 1,300 
+ --
+ -- Tests for PL/pgSQL variable properties: CONSTANT, NOT NULL, initializers
+ --
+ create type var_record as (f1 int4, f2 int4);
+ create domain int_nn as int not null;
+ create domain var_record_nn as var_record not null;
+ create domain var_record_colnn as var_record check((value).f2 is not null);
+ -- CONSTANT
+ do $$
+ declare x constant int := 42;
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ NOTICE:  x = 42
+ do $$
+ declare x constant int;
+ begin
+   x := 42;  -- fail
+ end$$;
+ ERROR:  variable "x" is declared CONSTANT
+ LINE 4:   x := 42;  -- fail
+   ^
+ do $$
+ declare x constant int; y int;
+ begin
+   for x, y in select 1, 2 loop  -- fail
+   end loop;
+ end$$;
+ ERROR:  variable "x" is declared CONSTANT
+ LINE 4:   for x, y in select 1, 2 loop  -- fail
+   ^
+ do $$
+ declare x constant int[];
+ begin
+   x[1] := 42;  -- fail
+ end$$;
+ ERROR:  variable "x" is declared CONSTANT
+ LINE 4:   x[1] := 42;  -- fail
+   ^
+ do $$
+ declare x constant int[]; y int;
+ begin
+   for x[1], y in select 1, 2 loop  -- fail (currently, unsupported syntax)
+   end loop;
+ end$$;
+ ERROR:  syntax error at or near "["
+ LINE 4:   for x[1], y in select 1, 2 loop  -- fail (currently, unsup...
+^
+ do $$
+ declare x constant var_record;
+ begin
+   x.f1 := 42;  -- fail
+ end$$;
+ ERROR:  variable "x" is declared CONSTANT
+ LINE 4:   x.f1 := 42;  -- fail
+   ^
+ do $$
+ declare x constant var_record; y int;
+ begin
+   for x.f1, y in select 1, 2 loop  -- fail
+   end loop;
+ end$$;
+ ERROR:  variable "x" is declared CONSTANT
+ LINE 4:   for x.f1, y in select 1, 2 loop  -- fail
+   ^
+ -- initializer expressions
+ do $$
+ declare x int := sin(0);
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ NOTICE:  x = 0
+ do $$
+ declare x int := 1/0;  -- fail
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ ERROR:  division by zero
+ CONTEXT:  SQL statement "SELECT 1/0"
+ PL/pgSQL function inline_code_block line 3 during statement block local variable initialization
+ do $$
+ declare x bigint[] := array[1,3,5];
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ NOTICE:  x = {1,3,5}
+ do $$
+ declare x record := row(1,2,3);
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ NOTICE:  x = (1,2,3)
+ do $$
+ declare x var_record := row(1,2);
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ NOTICE:  x = (1,2)
+ -- NOT NULL
+ do $$
+ declare x int not null;  -- fail
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ ERROR:  variable "x" must have a default value, since it's declared NOT NULL
+ LINE 2: declare x int not null;  -- fail
+   ^
+ do $$
+ declare x int not null := 42;
+ begin
+   raise notice 'x = %', x;
+   x := null;  -- fail
+ end$$;
+ NOTICE:  x = 42
+ ERROR:  null value cannot be assigned to variable "x" declared NOT NULL
+ CONTEXT:  PL/pgSQL function inline_code_block line 5 at assignment
+ do $$
+ declare x int not null := null;  -- fail
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ ERROR:  null value cannot be assigned to variable "x" declared NOT NULL
+ CONTEXT:  PL/pgSQL function inline_code_block line 3 during statement block local variable initialization
+ do $$
+ declare x record not null;  -- fail
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ ERROR:  variable "x" must have a default value, since it's declared NOT NULL
+ LINE 2: declare x record not null;  -- fail
+  ^
+ do $$
+ declare x record not null := row(42);
+ begin
+   raise notice 'x = %', x;
+   x := row(null);  -- ok
+   raise notice 'x = %', x;
+   x := null;  -- fail
+ end$$;
+ NOTICE:  x = (42)
+ NOTICE:  x = ()
+ ERROR:  null value cannot be assigned to variable "x" declared NOT NULL
+ C

Re: reducing isolation tests runtime

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 6:10 PM, Alvaro Herrera  wrote:
> On the subject of test total time, we could paralelize isolation tests.
> Right now "make check" in src/test/isolation takes 1:16 on my machine.
> Test "timeouts" takes full 40s of that, with nothing running in parallel
> -- the machine is completely idle.
>
> Seems like we can have a lot of time back just by changing the schedule
> to use multiple tests per line (in particular, put the other slow tests
> together with timeouts), per the attached; with this new schedule,
> isolation takes 44 seconds in my machine -- a win of 32 seconds.  We can
> win a couple of additional second by grouping a few other lines, but
> this is the biggest win.
>
> (This needs to be adjusted because some table names in the specs
> conflict.)

Oh, cool.  Yes, the time the isolation tests take to run is quite
annoying.  I didn't realize it would be so easy to run it in parallel.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: plpgsql function startup-time improvements

2018-01-24 Thread Tom Lane
Pavel Stehule  writes:
> please, can you rebase all three patches necessary for patching?

Done.  These now need to be applied over
https://www.postgresql.org/message-id/833.1516834...@sss.pgh.pa.us

regards, tom lane

diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index 09ecaec..97cb763 100644
*** a/src/pl/plpgsql/src/pl_comp.c
--- b/src/pl/plpgsql/src/pl_comp.c
*** plpgsql_adddatum(PLpgSQL_datum *new)
*** 2183,2194 
--- 2183,2211 
  static void
  plpgsql_finish_datums(PLpgSQL_function *function)
  {
+ 	Size		copiable_size = 0;
  	int			i;
  
  	function->ndatums = plpgsql_nDatums;
  	function->datums = palloc(sizeof(PLpgSQL_datum *) * plpgsql_nDatums);
  	for (i = 0; i < plpgsql_nDatums; i++)
+ 	{
  		function->datums[i] = plpgsql_Datums[i];
+ 
+ 		/* This must agree with copy_plpgsql_datums on what is copiable */
+ 		switch (function->datums[i]->dtype)
+ 		{
+ 			case PLPGSQL_DTYPE_VAR:
+ copiable_size += MAXALIGN(sizeof(PLpgSQL_var));
+ break;
+ 			case PLPGSQL_DTYPE_REC:
+ copiable_size += MAXALIGN(sizeof(PLpgSQL_rec));
+ break;
+ 			default:
+ break;
+ 		}
+ 	}
+ 	function->copiable_size = copiable_size;
  }
  
  
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index e119744..e2d315c 100644
*** a/src/pl/plpgsql/src/pl_exec.c
--- b/src/pl/plpgsql/src/pl_exec.c
*** static HTAB *shared_cast_hash = NULL;
*** 235,241 
  static void coerce_function_result_tuple(PLpgSQL_execstate *estate,
  			 TupleDesc tupdesc);
  static void plpgsql_exec_error_callback(void *arg);
! static PLpgSQL_datum *copy_plpgsql_datum(PLpgSQL_datum *datum);
  static MemoryContext get_stmt_mcontext(PLpgSQL_execstate *estate);
  static void push_stmt_mcontext(PLpgSQL_execstate *estate);
  static void pop_stmt_mcontext(PLpgSQL_execstate *estate);
--- 235,242 
  static void coerce_function_result_tuple(PLpgSQL_execstate *estate,
  			 TupleDesc tupdesc);
  static void plpgsql_exec_error_callback(void *arg);
! static void copy_plpgsql_datums(PLpgSQL_execstate *estate,
! 	PLpgSQL_function *func);
  static MemoryContext get_stmt_mcontext(PLpgSQL_execstate *estate);
  static void push_stmt_mcontext(PLpgSQL_execstate *estate);
  static void pop_stmt_mcontext(PLpgSQL_execstate *estate);
*** plpgsql_exec_function(PLpgSQL_function *
*** 458,465 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	for (i = 0; i < estate.ndatums; i++)
! 		estate.datums[i] = copy_plpgsql_datum(func->datums[i]);
  
  	/*
  	 * Store the actual call argument values into the appropriate variables
--- 459,465 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	copy_plpgsql_datums(&estate, func);
  
  	/*
  	 * Store the actual call argument values into the appropriate variables
*** plpgsql_exec_trigger(PLpgSQL_function *f
*** 859,866 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	for (i = 0; i < estate.ndatums; i++)
! 		estate.datums[i] = copy_plpgsql_datum(func->datums[i]);
  
  	/*
  	 * Put the OLD and NEW tuples into record variables
--- 859,865 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	copy_plpgsql_datums(&estate, func);
  
  	/*
  	 * Put the OLD and NEW tuples into record variables
*** plpgsql_exec_event_trigger(PLpgSQL_funct
*** 1153,1159 
  {
  	PLpgSQL_execstate estate;
  	ErrorContextCallback plerrcontext;
- 	int			i;
  	int			rc;
  	PLpgSQL_var *var;
  
--- 1152,1157 
*** plpgsql_exec_event_trigger(PLpgSQL_funct
*** 1174,1181 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	for (i = 0; i < estate.ndatums; i++)
! 		estate.datums[i] = copy_plpgsql_datum(func->datums[i]);
  
  	/*
  	 * Assign the special tg_ variables
--- 1172,1178 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	copy_plpgsql_datums(&estate, func);
  
  	/*
  	 * Assign the special tg_ variables
*** plpgsql_exec_error_callback(void *arg)
*** 1290,1346 
   * Support function for initializing local execution variables
   * --
   */
! static PLpgSQL_datum *
! copy_plpgsql_datum(PLpgSQL_datum *datum)
  {
! 	PLpgSQL_datum *result;
  
! 	switch (datum->dtype)
! 	{
! 		case PLPGSQL_DTYPE_VAR:
! 			{
! PLpgSQL_var *new = palloc(sizeof(PLpgSQL_var));
  
! memcpy(new, datum, sizeof(PLpgSQL_var));
! /* should be preset to null/non-freeable */
! Assert(new->isnull);
! Assert(!new->freeval

reducing isolation tests runtime

2018-01-24 Thread Alvaro Herrera
On the subject of test total time, we could paralelize isolation tests.
Right now "make check" in src/test/isolation takes 1:16 on my machine.
Test "timeouts" takes full 40s of that, with nothing running in parallel
-- the machine is completely idle.

Seems like we can have a lot of time back just by changing the schedule
to use multiple tests per line (in particular, put the other slow tests
together with timeouts), per the attached; with this new schedule,
isolation takes 44 seconds in my machine -- a win of 32 seconds.  We can
win a couple of additional second by grouping a few other lines, but
this is the biggest win.

(This needs to be adjusted because some table names in the specs
conflict.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/test/isolation/isolation_schedule 
b/src/test/isolation/isolation_schedule
index 74d7d59546..f8dbd018e7 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -1,3 +1,4 @@
+test: timeouts tuplelock-update deadlock-hard deadlock-soft-2 alter-table-1 
receipt-report multiple-row-versions
 test: read-only-anomaly
 test: read-only-anomaly-2
 test: read-only-anomaly-3
@@ -6,26 +7,18 @@ test: read-write-unique-2
 test: read-write-unique-3
 test: read-write-unique-4
 test: simple-write-skew
-test: receipt-report
 test: temporal-range-integrity
 test: project-manager
 test: classroom-scheduling
 test: total-cash
 test: referential-integrity
 test: ri-trigger
-test: partial-index
-test: two-ids
-test: multiple-row-versions
-test: index-only-scan
+test: partial-index two-ids index-only-scan eval-plan-qual lock-update-delete 
lock-committed-update lock-committed-keyupdate tuplelock-conflict alter-table-2 
alter-table-3 create-trigger
 test: deadlock-simple
-test: deadlock-hard
 test: deadlock-soft
-test: deadlock-soft-2
 test: fk-contention
 test: fk-deadlock
 test: fk-deadlock2
-test: eval-plan-qual
-test: lock-update-delete
 test: lock-update-traversal
 test: insert-conflict-do-nothing
 test: insert-conflict-do-nothing-2
@@ -38,12 +31,8 @@ test: delete-abort-savept-2
 test: aborted-keyrevoke
 test: multixact-no-deadlock
 test: multixact-no-forget
-test: lock-committed-update
-test: lock-committed-keyupdate
 test: update-locked-tuple
 test: propagate-lock-delete
-test: tuplelock-conflict
-test: tuplelock-update
 test: freeze-the-dead
 test: nowait
 test: nowait-2
@@ -56,13 +45,8 @@ test: skip-locked-3
 test: skip-locked-4
 test: drop-index-concurrently-1
 test: multiple-cic
-test: alter-table-1
-test: alter-table-2
-test: alter-table-3
 test: alter-table-4
-test: create-trigger
 test: sequence-ddl
 test: async-notify
 test: vacuum-reltuples
-test: timeouts
 test: vacuum-concurrent-drop


Re: Possible performance regression with pg_dump of a large number of relations

2018-01-24 Thread Stephen Frost
Hi there!

* Luke Cowell (lcow...@gmail.com) wrote:
> Hi Stephen, thank you for putting this together.

Yeah, it needs more work, which I figured out after actually hacking
together a patch for it and I've just not gotten back to it yet.

> > If folks get a chance to take a look at the query and/or test, that'd be
> > great.  I'll try to work up an actual patch to pg_dump this weekend to
> > run it through the regression tests and see if anything breaks.
> 
> I'm not sure how I can help other than testing that this runs. I can confirm 
> that it runs on 10.1. It does not run on 9.5 or 9.6 and gives this error:

Thanks for offering to help!  Once I've got an actual patch together
that works well enough to get through the regression tests, I'll
definitely let you know.  The general premise still looks viable, but
the actual query needs to be reworked.

> > ERROR:  relation "pg_init_privs" does not exist
> > LINE 139: LEFT JOIN pg_init_privs pip

I certainly hope that works on 9.6, since that's when pg_init_privs was
added..

> I'm guessing that the error is not surprising and that the query is intended 
> for an upcoming release of postgres and wouldn't be backported to 9.6.x

Presuming I can make it work, the idea would be to back-port it to 9.6
and 10, since pg_init_privs and this code was added in 9.6.

Thanks again!

Stephen


signature.asc
Description: PGP signature


Re: Converting plpgsql to use DTYPE_REC for named composite types

2018-01-24 Thread Tom Lane
Here's a version of this rebased up to HEAD, fixing a couple of trivial
merge conflicts and incorporating the docs delta I posted separately.

(I'd supposed this patch was still OK because the patch tester said so,
but I now see that the tester was only testing the docs delta :-(.)

regards, tom lane



use-dtype-rec-for-all-composites-3.patch.gz
Description: use-dtype-rec-for-all-composites-3.patch.gz


Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Peter Geoghegan
On Wed, Jan 24, 2018 at 12:05 PM, Robert Haas  wrote:
> In Thomas's test case, he's using 4 workers, and if even one of those
> manages to start, then it'll probably do so after any fork failures
> have already occurred, and the error will be noticed when that process
> sends its first message to the leader through the error queue, because
> it'll send PROCSIG_PARALLEL_MESSAGE via all the queues.  If all of the
> workers fail to start, then that doesn't help.  But it still manages
> to detect the failure in that case because it reaches
> WaitForParallelWorkersToFinish, which we just patched.
>
> But how does that happen, anyway?  Shouldn't it get stuck waiting for
> the tuple queues to which nobody's attached yet?  The reason it
> doesn't is because
> ExecParallelCreateReaders() calls shm_mq_set_handle() for each queue,
> which causes the tuple queues to be regarded as detached if the worker
> fails to start.  A detached tuple queue, in general, is not an error
> condition: it just means that worker has no more tuples.

This explains all the confusion. Amit told me that using a tuple queue
made all the difference here. Even till, it seemed surprising that
we'd rely on that from such a long distance from within nodeGather.c.

> I guess that works, but it seems more like blind luck than good
> design.  Parallel CREATE INDEX fails to be as "lucky" as Gather
> because there's nothing in parallel CREATE INDEX that lets it skip
> waiting for a worker which doesn't start -- and in general it seems
> unacceptable to impose a coding requirement that all future parallel
> operations must fail to notice the difference between a worker that
> completed successfully and one that never ran at all.

+1.

> If we made the Gather node wait only for workers that actually reached
> the Gather -- either by using a Barrier or by some other technique --
> then this would be a lot less fragile.  And the same kind of technique
> would work for parallel CREATE INDEX.

The use of a barrier has problems of its own [1], though, of which one
is unique to the parallel_leader_participation=off case. I thought
that you yourself agreed with this [2] -- do you?

Another argument against using a barrier in this way is that it seems
like way too much mechanism to solve a simple problem. Why should a
client of parallel.h not be able to rely on nworkers_launched (perhaps
only after "verifying it can be trusted")? That seem like a pretty
reasonable requirement for clients to have for any framework for
parallel imperative programming.

I think that we should implement "some other technique", instead of
using a barrier. As I've said, Amit's WaitForParallelWorkersToAttach()
idea seems like a promising "other technique".

[1] 
https://www.postgresql.org/message-id/caa4ek1+a0of4m231vbgpr_0ygg_bnmrgzlib7wqde-fybsy...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CA+TgmoaF8UA8v8hP=ccoquc50pucpc8abj-_yyc++ygggjw...@mail.gmail.com
-- 
Peter Geoghegan



Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> Fair point, but doesn't it apply equally to non-default ACLs on any
> >> other system objects?  If you tweaked the permissions on say pg_ls_dir(),
> >> then dump, then tweak them some more, you're going to get uncertain
> >> results if you load that dump back into this database ... with or without
> >> --clean, because we certainly aren't going to drop pinned objects.
> 
> > Yes, that's certainly true, the public schema is the only "special"
> > animal in this regard and making it less special (and more like
> > pg_ls_dir()) would definitely be nice.
> 
> I wonder if it'd be worth the trouble to invent a variant of REVOKE
> that means "restore this object's permissions to default" --- that is,
> either the ACL recorded in pg_init_privs if there is one, or NULL if
> there's no pg_init_privs entry.  Then you could imagine pg_dump emitting
> that command before trying to assign an ACL to any object it hadn't
> created earlier in the run, rather than guessing about the current state
> of the object's ACL.  (I'm not volunteering.)

I actually like that idea quite a bit..  Not really high on my priority
list though.

> >> I think we could jigger things so that we dump the definition of these
> >> special quasi-system objects only if their ACLs are not default, but
> >> it's not clear to me that that's really an improvement in the long run.
> >> Seems like it's just making them even wartier.
> 
> > Yeah, that would be worse, I agree.
> 
> So are we at a consensus yet?

You had me at "make public less special", I was just trying to make sure
we all understand what that means.

+1 from me for moving forward.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: JIT compiling with LLVM v9.0

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 14:06:30 -0800, Andres Freund wrote:
> > In LLVM 5.0, it looks like DebugInfo.h is not available in llvm-c, only as 
> > a C
> > ++ API in llvm/IR/DebugInfo.h.
> 
> Hm, I compiled against 5.0 quite recently, but added the stripping of
> debuginfo lateron.  I'll add a fallback method, thanks for pointing that
> out!

Went more with your fix, there's not much point in using the C API
here. Should probably remove the use of it nearly entirely from the .cpp
file (save for wrap/unwrap() use). But man, the 'class Error' usage is
one major ugly pain.


> > But I still could not build because the LLVM API changed between 5.0 and 
> > 6.0 
> > regarding value info SummaryList. 
> 
> Hm, thought these changes were from before my 5.0 test. But the code
> evolved heavily, so I might misremember. Let me see.

Ah, that one was actually easier to fix. There's no need to get the base
object at all, so it's just a one-line change.


> Thanks, I'll try to push fixes into the tree soon-ish..

Pushed.

Thanks again for looking!

- Andres



Re: JIT compiling with LLVM v9.0

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 22:35:08 +0100, Pierre Ducroquet wrote:
> I tried to build on Debian sid, using GCC 7 and LLVM 5. I used the following 
> to compile, using your branch @3195c2821d :

Thanks!

> $ export LLVM_CONFIG=/usr/bin/llvm-config-5.0
> $ ./configure --with-llvm
> $ make
> 
> And I had the following build error :
> llvmjit_wrap.cpp:32:10: fatal error: llvm-c/DebugInfo.h: No such file or 
> directory
>  #include "llvm-c/DebugInfo.h"
>   ^~~~
> compilation terminated.
> 
> In LLVM 5.0, it looks like DebugInfo.h is not available in llvm-c, only as a C
> ++ API in llvm/IR/DebugInfo.h.

Hm, I compiled against 5.0 quite recently, but added the stripping of
debuginfo lateron.  I'll add a fallback method, thanks for pointing that
out!


> But I still could not build because the LLVM API changed between 5.0 and 6.0 
> regarding value info SummaryList. 

Hm, thought these changes were from before my 5.0 test. But the code
evolved heavily, so I might misremember. Let me see.


Thanks, I'll try to push fixes into the tree soon-ish..


> I'm not sure how to test everything properly, so the patch is attached for
> both these issues, do as you wish with it… :)

What I do for testing is running postgres' tests against a started
server that has all cost based behaviour turned off (which makes no
sense from a runtime optimization perspective, but increases
coverage...).

The flags I pass to the server are:
  -c jit_expressions=1 -c jit_tuple_deforming=1  -c jit_perform_inlining=1 -c 
jit_above_cost=0 -c jit_optimize_above_cost=0
then I run
  make -s installcheck-parallel
to see whether things pass.  The flags makes the tests slow-ish, but
tests everything under jit. In particular errors.sql's recursion check
takes a while...

Obviously none of the standard tests are interesting from a performance
perspective...

FWIW, here's an shortened excerpt of the debugging output of TPCH query:

DEBUG:  checking inlinability of ExecAggInitGroup
DEBUG:  considering extern function datumCopy at 75 for inlining
DEBUG:  inline top function ExecAggInitGroup total_instcount: 24, partial: 21

so the inliner found a reference to ExecAggInitGroup, inlined it, and
scheduled to checkout datumCopy, externally referenced from
ExecAggInitGroup, later.

DEBUG:  uneligible to import errstart due to early threshold: 150 vs 37

elog stuff wasn't inlined because errstart has 150 insn, but at this
point the limit was 37 (aka 150 / 2 / 2). Early means this was decided
based on the summary.  There's also 'late' checks preventing inlining if
dependencies of the inlined variable (local static functions, constant
static global variables) make it bigger than the summary knows about.

Then we get to execute the importing:
DEBUG:  performing import of postgres/utils/fmgr/fmgr.bc pg_detoast_datum, 
pg_detoast_datum_packed
DEBUG:  performing import of postgres/utils/adt/arrayfuncs.bc construct_array
DEBUG:  performing import of postgres/utils/error/assert.bc 
ExceptionalCondition, .str.1, .str
DEBUG:  performing import of postgres/utils/adt/expandeddatum.bc 
EOH_flatten_into, DeleteExpandedObject, .str.1, .str.2, .str.4, 
EOH_get_flat_size
DEBUG:  performing import of postgres/utils/adt/int8.bc __func__.overflowerr, 
.str, .str.12, int8inc, overflowerr, pg_add_s64_overflow
...
DEBUG:  performing import of postgres/utils/adt/date.bc date_le_timestamp,  
date2timestamp,  .str,  __func__.date2timestamp,  .str.26

And there's a timing summary (debugging build)
DEBUG:  time to inline: 0.145s
DEBUG:  time to opt: 0.156s
DEBUG:  time to emit: 0.078s


Same debugging build:

tpch_10[6930][1]=# set jit_expressions = 1;
tpch_10[6930][1]=# \i ~/tmp/tpch/pg-tpch/queries/q01.sql 
...
Time: 28442.870 ms (00:28.443)

tpch_10[6930][1]=# set jit_expressions = 0;
tpch_10[6930][1]=# \i ~/tmp/tpch/pg-tpch/queries/q01.sql 
...
Time: 70357.830 ms (01:10.358)
tpch_10[6930][1]=# show max_parallel_workers_per_gather;
┌─┐
│ max_parallel_workers_per_gather │
├─┤
│ 0   │
└─┘

Now admittedly a debugging/assertion enabled build isn't quite a fair
fight, but it's not that much smaller a win without that.

- Andres



Re: [HACKERS] UPDATE of partition key

2018-01-24 Thread Thomas Munro
On Thu, Jan 25, 2018 at 10:39 AM, Robert Haas  wrote:
> On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> Tom, do you want to double-check that this fixes it for you?
>>
>> I can confirm that a valgrind run succeeded for me with the patch
>> in place.
>
> Committed.  Sorry for the delay.

FYI I'm planning to look into adding a valgrind check to the
commitfest CI thing I run so we can catch these earlier without
committer involvement.  It's super slow because of all those pesky
regression tests so I'll probably need to improve the scheduling logic
a bit to make it useful first (prioritising new patches or something,
since otherwise it'll take up to multiple days to get around to
valgrind-testing any given patch...).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Possible performance regression with pg_dump of a large number of relations

2018-01-24 Thread Luke Cowell
Hi Stephen, thank you for putting this together.

> If folks get a chance to take a look at the query and/or test, that'd be
> great.  I'll try to work up an actual patch to pg_dump this weekend to
> run it through the regression tests and see if anything breaks.

I'm not sure how I can help other than testing that this runs. I can confirm 
that it runs on 10.1. It does not run on 9.5 or 9.6 and gives this error:
> ERROR:  relation "pg_init_privs" does not exist
> LINE 139: LEFT JOIN pg_init_privs pip

I'm guessing that the error is not surprising and that the query is intended 
for an upcoming release of postgres and wouldn't be backported to 9.6.x

Luke


Re: documentation is now XML

2018-01-24 Thread Peter Eisentraut
On 1/24/18 09:45, Bruce Momjian wrote:
> So we are not
> using TeX anymore for PG11+ docs?

as of PG10

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] UPDATE of partition key

2018-01-24 Thread Robert Haas
On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> Tom, do you want to double-check that this fixes it for you?
>
> I can confirm that a valgrind run succeeded for me with the patch
> in place.

Committed.  Sorry for the delay.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: JIT compiling with LLVM v9.0

2018-01-24 Thread Pierre Ducroquet
On Wednesday, January 24, 2018 8:20:38 AM CET Andres Freund wrote:
> As the patchset is large (500kb) and I'm still quickly evolving it, I do
> not yet want to attach it. The git tree is at
>   https://git.postgresql.org/git/users/andresfreund/postgres.git
> in the jit branch
>  
> https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shor
> tlog;h=refs/heads/jit
> 
> to build --with-llvm has to be passed to configure, llvm-config either
> needs to be in PATH or provided with LLVM_CONFIG to make. A c++ compiler
> and clang need to be available under common names or provided via CXX /
> CLANG respectively.
> 
> Regards,
> 
> Andres Freund

Hi

I tried to build on Debian sid, using GCC 7 and LLVM 5. I used the following 
to compile, using your branch @3195c2821d :

$ export LLVM_CONFIG=/usr/bin/llvm-config-5.0
$ ./configure --with-llvm
$ make

And I had the following build error :
llvmjit_wrap.cpp:32:10: fatal error: llvm-c/DebugInfo.h: No such file or 
directory
 #include "llvm-c/DebugInfo.h"
  ^~~~
compilation terminated.

In LLVM 5.0, it looks like DebugInfo.h is not available in llvm-c, only as a C
++ API in llvm/IR/DebugInfo.h.

For 'sport' (I have not played with LLVM API since more than one year), I 
tried to fix it, changing it to the C++ include.

The DebugInfo related one was easy, only one function was used.
But I still could not build because the LLVM API changed between 5.0 and 6.0 
regarding value info SummaryList. 

llvmjit_wrap.cpp: In function 
‘std::unique_ptr > > 
llvm_build_inline_plan(llvm::Module*)’:
llvmjit_wrap.cpp:285:48: error: ‘class llvm::GlobalValueSummary’ has no member 
named ‘getBaseObject’
fs = llvm::cast(gvs->getBaseObject());
^

That one was a bit uglier.

I'm not sure how to test everything properly, so the patch is attached for 
both these issues, do as you wish with it… :)

Regards

 Pierre Ducroquet

>From fdfea09dd7410d6ed7ad54df1ba3092bd0eecb92 Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet 
Date: Wed, 24 Jan 2018 22:28:34 +0100
Subject: [PATCH] Allow building with LLVM 5.0

---
 src/backend/lib/llvmjit_wrap.cpp | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/backend/lib/llvmjit_wrap.cpp b/src/backend/lib/llvmjit_wrap.cpp
index b745aec4fe..7961148a85 100644
--- a/src/backend/lib/llvmjit_wrap.cpp
+++ b/src/backend/lib/llvmjit_wrap.cpp
@@ -29,7 +29,6 @@ extern "C"
 
 #include "llvm-c/Core.h"
 #include "llvm-c/BitReader.h"
-#include "llvm-c/DebugInfo.h"
 
 #include 
 #include 
@@ -50,6 +49,7 @@ extern "C"
 #include "llvm/Analysis/ModuleSummaryAnalysis.h"
 #include "llvm/Bitcode/BitcodeReader.h"
 #include "llvm/IR/CallSite.h"
+#include "llvm/IR/DebugInfo.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/ModuleSummaryIndex.h"
 #include "llvm/Linker/IRMover.h"
@@ -218,6 +218,13 @@ llvm_inline(LLVMModuleRef M)
 	llvm_execute_inline_plan(mod, globalsToInline.get());
 }
 
+
+inline llvm::GlobalValueSummary *GlobalValueSummary__getBaseObject(llvm::GlobalValueSummary *gvs) {
+  if (auto *AS = llvm::dyn_cast(gvs))
+return &AS->getAliasee();
+  return gvs;
+}
+
 /*
  * Build information necessary for inlining external function references in
  * mod.
@@ -282,7 +289,7 @@ llvm_build_inline_plan(llvm::Module *mod)
 			const llvm::Module *defMod;
 			llvm::Function *funcDef;
 
-			fs = llvm::cast(gvs->getBaseObject());
+			fs = llvm::cast(GlobalValueSummary__getBaseObject(gvs.get()));
 			elog(DEBUG2, "func %s might be in %s",
  funcName.data(),
  modPath.data());
@@ -476,7 +483,7 @@ load_module(llvm::StringRef Identifier)
 	 * code. Until that changes, not much point in wasting memory and cycles
 	 * on processing debuginfo.
 	 */
-	LLVMStripModuleDebugInfo(mod);
+	llvm::StripDebugInfo(*llvm::unwrap(mod));
 
 	return std::unique_ptr(llvm::unwrap(mod));
 }
-- 
2.15.1



Re: PATCH: Exclude unlogged tables from base backups

2018-01-24 Thread David Steele
On 1/24/18 4:02 PM, Adam Brightwell wrote:
>>> If a new unlogged relation is created after constructed the
>>> unloggedHash before sending file, we cannot exclude such relation. It
>>> would not be problem if the taking backup is not long because the new
>>> unlogged relation unlikely becomes so large. However, if takeing a
>>> backup takes a long time, we could include large main fork in the
>>> backup.
>>
>> This is a good point.  It's per database directory which makes it a
>> little better, but maybe not by much.
>>
>> Three options here:
>>
>> 1) Leave it as is knowing that unlogged relations created during the
>> backup may be copied and document it that way.
>>
>> 2) Construct a list for SendDir() to work against so the gap between
>> creating that and creating the unlogged hash is as small as possible.
>> The downside here is that the list may be very large and take up a lot
>> of memory.
>>
>> 3) Check each file that looks like a relation in the loop to see if it
>> has an init fork.  This might affect performance since an
>> opendir/readdir loop would be required for every relation.
>>
>> Personally, I'm in favor of #1, at least for the time being.  I've
>> updated the docs as indicated in case you and Adam agree.
> 
> I agree with #1 and feel the updated docs are reasonable and
> sufficient to address this case for now.
> 
> I have retested these patches against master at d6ab720360.
> 
> All test succeed.
> 
> Marking "Ready for Committer".

Thanks, Adam!

Actually, I was talking to Stephen about this it seems like #3 would be
more practical if we just stat'd the init fork for each relation file
found.  I doubt the stat would add a lot of overhead and we can track
each unlogged relation in a hash table to reduce overhead even more.

I'll look at that tomorrow and see if I can work out something practical.

-- 
-David
da...@pgmasters.net



Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Fair point, but doesn't it apply equally to non-default ACLs on any
>> other system objects?  If you tweaked the permissions on say pg_ls_dir(),
>> then dump, then tweak them some more, you're going to get uncertain
>> results if you load that dump back into this database ... with or without
>> --clean, because we certainly aren't going to drop pinned objects.

> Yes, that's certainly true, the public schema is the only "special"
> animal in this regard and making it less special (and more like
> pg_ls_dir()) would definitely be nice.

I wonder if it'd be worth the trouble to invent a variant of REVOKE
that means "restore this object's permissions to default" --- that is,
either the ACL recorded in pg_init_privs if there is one, or NULL if
there's no pg_init_privs entry.  Then you could imagine pg_dump emitting
that command before trying to assign an ACL to any object it hadn't
created earlier in the run, rather than guessing about the current state
of the object's ACL.  (I'm not volunteering.)

>> I think we could jigger things so that we dump the definition of these
>> special quasi-system objects only if their ACLs are not default, but
>> it's not clear to me that that's really an improvement in the long run.
>> Seems like it's just making them even wartier.

> Yeah, that would be worse, I agree.

So are we at a consensus yet?

regards, tom lane



Re: PATCH: Exclude unlogged tables from base backups

2018-01-24 Thread Adam Brightwell
> I agree with #1 and feel the updated docs are reasonable and
> sufficient to address this case for now.
>
> I have retested these patches against master at d6ab720360.
>
> All test succeed.
>
> Marking "Ready for Committer".

Actually, marked it "Ready for Review" to wait for Masahiko to comment/agree.

Masahiko,

If you agree with the above, would you mind updating the status accordingly?

-Adam



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 4:01 PM, Tom Lane  wrote:
> The progress-display output of pg_regress would need a complete rethink
> anyhow.  First thought is to emit two lines per test, one when we
> launch it and one when it finishes and we check the results:
>
> foreign_data: launched
> ...
> foreign_data: ok   (or FAILED)

I suspect that first line would only be interesting on the very rare
occasions when test hangs.  I'm not sure it would be good to add that
much chatter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 2:31 PM, Tom Lane  wrote:
> I find that to be a completely bogus straw-man argument.  The point of
> looking at the prairiedog time series is just to see a data series in
> which the noise level is small enough to discern the signal.  If anyone's
> got years worth of data off a more modern machine, and they can extract
> a signal from that, by all means let's consider that data instead.  But
> there's no clear argument (or at least you have not made one) that says
> that prairiedog's relative timings don't match what we'd get on more
> modern machines.

There is no need to collect years of data in order to tell whether or
not the time to run the tests has increased by as much on developer
machines as it has on prairiedog.  You showed the time going from 3:36
to 8:09 between 2014 and the present.  That is a 2.26x increase.  It
is obvious from the numbers I posted before that no such increase has
taken place in the time it takes to run 'make check' on my relatively
modern laptop.  Whatever difference exists is measured in
milliseconds.

> so join has gotten about 1 second slower since v10, and that time is
> coming entirely out of developers' hides despite parallelism because
> it was already the slowest in its group.
>
> So I continue to maintain that an unreasonable fraction of the total
> resources devoted to the regular regression tests is going into these
> new hashjoin tests.

I think there is an affirmative desire on the part of many
contributors to have newer features tested more thoroughly than old
ones were.  That will tend to mean that features added more recently
have test suites that are longer-running compared to the value of the
feature they test than what we had in the past.  When this has been
discussed at developer meetings, everyone except you (and to a lesser
extent me) has been in favor of this.  Even if that meant that you had
to wait 1 extra second every time you run 'make check', I would judge
that worthwhile.  But it probably doesn't, because there are a lot of
things that can be done to improve this situation, such as...

> Based on these numbers, it seems like one easy thing we could do to
> reduce parallel check time is to split the plpgsql test into several
> scripts that could run in parallel.  But independently of that,
> I think we need to make an effort to push hashjoin's time back down.

...this.  Also, the same technique could probably be applied to the
join test itself.  I think Thomas just added the tests to that file
because it already existed, but there's nothing to say that the file
couldn't be split into several chunks.  On a quick look, it looks to
me as though that file is testing a lot of pretty different things,
and it's one of the largest test case files, accounting for ~3% of the
total test suite by itself.

Another thing you could do is consider applying the patch Thomas
already posted to reduce the size of the tables involved.  The problem
is that, for you and the buildfarm to be happy, the tests have to (1)
run near-instantaneously even on thoroughly obsolete hardware, (2)
give exactly the same answers on 32-bit systems, 64-bit systems,
Linux, Windows, AIX, HP-UX, etc., and (3) give those same exact
answers 100% deterministically on all of those platforms.  Parallel
query is inherently non-deterministic about things like how much work
goes to each worker, and I think that really small tests will tend to
show more edge cases like one worker not doing anything.  So it might
be that if we cut down the sizes of the test cases we'll spend more
time troubleshooting the resulting instability than any developer time
we would've saved by reducing the runtime.  But we can try it.

>> One caveat is that old machines also
>> somewhat approximate testing with more instrumentation / debugging
>> enabled (say valgrind, CLOBBER_CACHE_ALWAYS, etc). So removing excessive
>> test overhead has still quite some benefits. But I definitely do not
>> want to lower coverage to achieve it.
>
> I don't want to lower coverage either.  I do want some effort to be
> spent on achieving test coverage intelligently, rather than just throwing
> large test cases at the code without consideration of the costs.

I don't believe that any such thing is occurring, and I think it's
wrong of you to imply that these test cases were added
unintelligently.  To me, that seems like an ad hominum attack on both
Thomas (who spent a year or more developing the feature those test
cases exercise) and Andres (who committed them).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 15:58:16 -0500, Tom Lane wrote:
> Yeah.  We already have topo sort code in pg_dump, maybe we could push that
> into someplace like src/common or src/fe_utils?  Although pg_dump hasn't
> got any need for edge weights, so maybe sharing code isn't worth it.

I suspect it may be more work to share than worth it, but either way, it
shouldn't be too hard.  Hm, isn't dbObjectTypePriority kinda an edge
weight? Seems like we properly could implement it as that.


> We could flush the existing schedule files and use a simple format like
>   testname: list of earlier tests it depends on
> (I guess there would be more properties than just the dependencies,
> but still not hard to parse.)

Yea, I think there'd need to be a few more. There's some tests that use
multiple connections, and I suspect it'll be useful to have "implicit"
ordering dependencies for a few test, like a "barrier". Otherwise
e.g. the tablespace test will be annoying to order.


> > If we keep the timings from an earlier
> > run somwhere, we can use the timing of runs as edge weights, making the
> > schedule better.
> 
> I think we could just use constant values hand-coded in the schedule file.
> It might occasionally be worth updating them, but realistically it's not
> going to matter that they be very accurate.  Probably weights like 1, 2,
> and 3 would be plenty ;-)

The reason I like the idea of using prior tests as scheduling input is
that the slowness actually depends a lot on the type of machine its run
on, and more importantly on things like valgrind, CCA, fsync=on/off,
jit=on/off (far most expensive tests is e.g. the recursion test in
errors.sql :)).

Greetings,

Andres Freund



Re: PATCH: Exclude unlogged tables from base backups

2018-01-24 Thread Adam Brightwell
>> If a new unlogged relation is created after constructed the
>> unloggedHash before sending file, we cannot exclude such relation. It
>> would not be problem if the taking backup is not long because the new
>> unlogged relation unlikely becomes so large. However, if takeing a
>> backup takes a long time, we could include large main fork in the
>> backup.
>
> This is a good point.  It's per database directory which makes it a
> little better, but maybe not by much.
>
> Three options here:
>
> 1) Leave it as is knowing that unlogged relations created during the
> backup may be copied and document it that way.
>
> 2) Construct a list for SendDir() to work against so the gap between
> creating that and creating the unlogged hash is as small as possible.
> The downside here is that the list may be very large and take up a lot
> of memory.
>
> 3) Check each file that looks like a relation in the loop to see if it
> has an init fork.  This might affect performance since an
> opendir/readdir loop would be required for every relation.
>
> Personally, I'm in favor of #1, at least for the time being.  I've
> updated the docs as indicated in case you and Adam agree.

I agree with #1 and feel the updated docs are reasonable and
sufficient to address this case for now.

I have retested these patches against master at d6ab720360.

All test succeed.

Marking "Ready for Committer".

-Adam



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Tom Lane
Andres Freund  writes:
> On 2018-01-24 15:36:35 -0500, Tom Lane wrote:
>> I'm concerned that we'd end up with a higher number of irreproducible
>> test failures with no good way to investigate them.

> Hm. We probably should dump the used ordering of tests somwhere upon
> failure, to make it easier to debug.

The progress-display output of pg_regress would need a complete rethink
anyhow.  First thought is to emit two lines per test, one when we
launch it and one when it finishes and we check the results:

foreign_data: launched
...
foreign_data: ok   (or FAILED)

regards, tom lane



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Tom Lane
Andres Freund  writes:
> On 2018-01-24 17:18:26 -0300, Alvaro Herrera wrote:
>> Yeah, I proposed this a decade ago but never had the wits to write the
>> code.

> It shouldn't be too hard, right? Leaving defining the file format,
> parsing it, creating the new schedule with depencencies and adapting
> tests aside (hah), it mostly seems a relatively simple graph ordering /
> topological sort problem, right?

Yeah.  We already have topo sort code in pg_dump, maybe we could push that
into someplace like src/common or src/fe_utils?  Although pg_dump hasn't
got any need for edge weights, so maybe sharing code isn't worth it.

We could flush the existing schedule files and use a simple format like
testname: list of earlier tests it depends on
(I guess there would be more properties than just the dependencies,
but still not hard to parse.)

> If we keep the timings from an earlier
> run somwhere, we can use the timing of runs as edge weights, making the
> schedule better.

I think we could just use constant values hand-coded in the schedule file.
It might occasionally be worth updating them, but realistically it's not
going to matter that they be very accurate.  Probably weights like 1, 2,
and 3 would be plenty ;-)

regards, tom lane



Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> No, if you have a nondefault ACL, that will still get applied.  This
> >> arrangement would drop comment changes, but I can't get excited about
> >> that; it's certainly far less of an inconvenience in that scenario
> >> than dumping the comment is in non-superuser-restore scenarios.
> 
> > That nondefault ACL from the system the pg_dump was run on will get
> > applied *over-top* of whatever the current ACL on the system that the
> > restore is being run on, which may or may not be what's expected.
> 
> Fair point, but doesn't it apply equally to non-default ACLs on any
> other system objects?  If you tweaked the permissions on say pg_ls_dir(),
> then dump, then tweak them some more, you're going to get uncertain
> results if you load that dump back into this database ... with or without
> --clean, because we certainly aren't going to drop pinned objects.

Yes, that's certainly true, the public schema is the only "special"
animal in this regard and making it less special (and more like
pg_ls_dir()) would definitely be nice.

> I think we could jigger things so that we dump the definition of these
> special quasi-system objects only if their ACLs are not default, but
> it's not clear to me that that's really an improvement in the long run.
> Seems like it's just making them even wartier.

Yeah, that would be worse, I agree.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> No, if you have a nondefault ACL, that will still get applied.  This
>> arrangement would drop comment changes, but I can't get excited about
>> that; it's certainly far less of an inconvenience in that scenario
>> than dumping the comment is in non-superuser-restore scenarios.

> That nondefault ACL from the system the pg_dump was run on will get
> applied *over-top* of whatever the current ACL on the system that the
> restore is being run on, which may or may not be what's expected.

Fair point, but doesn't it apply equally to non-default ACLs on any
other system objects?  If you tweaked the permissions on say pg_ls_dir(),
then dump, then tweak them some more, you're going to get uncertain
results if you load that dump back into this database ... with or without
--clean, because we certainly aren't going to drop pinned objects.

I think we could jigger things so that we dump the definition of these
special quasi-system objects only if their ACLs are not default, but
it's not clear to me that that's really an improvement in the long run.
Seems like it's just making them even wartier.

regards, tom lane



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 15:36:35 -0500, Tom Lane wrote:
> There'd be a lot of followup work to sanitize the tests better.  For
> instance, if two tests transiently create tables named "foo", it doesn't
> matter as long as they're not in the same group.  It would matter with
> this.

Right. I suspect we'd initially end up with a schedule that'd had
dependencies pretty similar to what we have now as groups. I suspect
that even with a very small number of changes we'd get a lot better
timings.


> There are things we could do to mitigate that --- one attractive idea
> is to have each test create its own schema for transient objects.
> The limiting factor is that we don't want the test scripts to change
> so much that back-patching tests becomes impossible.  (Or at least,
> I'd not like that.)

I think a lot of temporary, potentially conflicting, objects are already
created as TEMPORARY. Adding a few more temporary markers shouldn't be
too hard.

> I'm concerned that we'd end up with a higher number of irreproducible
> test failures with no good way to investigate them.

Hm. We probably should dump the used ordering of tests somwhere upon
failure, to make it easier to debug.

Greetings,

Andres Freund



Re: copy.c allocation constant

2018-01-24 Thread Thomas Munro
On Thu, Jan 25, 2018 at 9:35 AM, Bruce Momjian  wrote:
> The BSD memory allocator used to allocate in powers of two, and keep the
> header in a separate location.  They did this so they could combine two
> free, identically-sized memory blocks into a single one that was double
> the size.  I have no idea how it works now.

According to "man malloc" on my FreeBSD 11.1 box (which uses jemalloc,
I think NetBSD and OpenBSD use something else), the code Andrew showed
at the top of this thread will waste ~16kB (because it'll round up to
80kB) and nodeHash.c will waste ~8kB for every ~32kB chunk of tuples
as described in that other thread.

   +--+
   |Category   Spacing   Size |
   |Small   lg   [8]  |
   |16   [16, 32, 48, 64, 80, 96, 112, 128]   |
   |32   [160, 192, 224, 256] |
   |64   [320, 384, 448, 512] |
   |   128   [640, 768, 896, 1024]|
   |   256   [1280, 1536, 1792, 2048] |
   |   512   [2560, 3072, 3584, 4096] |
   | 1 KiB   [5 KiB, 6 KiB, 7 KiB, 8 KiB] |
   | 2 KiB   [10 KiB, 12 KiB, 14 KiB] |
   |Large2 KiB   [16 KiB] |
   | 4 KiB   [20 KiB, 24 KiB, 28 KiB, 32 KiB] |
   | 8 KiB   [40 KiB, 48 KiB, 54 KiB, 64 KiB] |
   |16 KiB   [80 KiB, 96 KiB, 112 KiB, 128 KiB]   |
   |32 KiB   [160 KiB, 192 KiB, 224 KiB, 256 KiB] |
   |64 KiB   [320 KiB, 384 KiB, 448 KiB, 512 KiB] |
   |   128 KiB   [640 KiB, 768 KiB, 896 KiB, 1 MiB]   |
   |   256 KiB   [1280 KiB, 1536 KiB, 1792 KiB]   |
   |Huge   256 KiB   [2 MiB]  |
   |   512 KiB   [2560 KiB, 3 MiB, 3584 KiB, 4 MiB]   |
   | 1 MiB   [5 MiB, 6 MiB, 7 MiB, 8 MiB] |
   | 2 MiB   [10 MiB, 12 MiB, 14 MiB, 16 MiB] |
   | 4 MiB   [20 MiB, 24 MiB, 28 MiB, 32 MiB] |
   | 8 MiB   [40 MiB, 48 MiB, 56 MiB, 64 MiB] |
   |   ...   ...  |
   |   512 PiB   [2560 PiB, 3 EiB, 3584 PiB, 4 EiB]   |
   | 1 EiB   [5 EiB, 6 EiB, 7 EiB]|
   +--+

-- 
Thomas Munro
http://www.enterprisedb.com



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 17:18:26 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > Besides larger groups, starting the next test(s) earlier, another way to
> > gain pretty large improvements would be a test schedule feature that
> > allowed to stat dependencies between tests. So instead of manually
> > grouping the schedule, have 'numerology' state that it depends on int2,
> > int4, int8, float4, float8, which means it can actually be started
> > earlier than it currently can in many cases.
> 
> Yeah, I proposed this a decade ago but never had the wits to write the
> code.

It shouldn't be too hard, right? Leaving defining the file format,
parsing it, creating the new schedule with depencencies and adapting
tests aside (hah), it mostly seems a relatively simple graph ordering /
topological sort problem, right?  If we keep the timings from an earlier
run somwhere, we can use the timing of runs as edge weights, making the
schedule better.


> It would be very useful for running tests standalone, too -- much as I
> dislike 'make installcheck' taking half a minute, I dislike much more
> having to take 5 minutes each time to figure out that create_table
> depends on box, polygon, create_function, yadda yadda.

Oh, that's a good point.

- Andres



Re: WIP Patch: Precalculate stable functions, infrastructure v1

2018-01-24 Thread Andres Freund
On 2018-01-24 15:10:56 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-01-16 17:05:01 -0500, Tom Lane wrote:
> >> I'm curious to know whether Andres has some other ideas, or whether he
> >> feels this is all a big wart on the compiled-expression concept.
>
> > I don't have too many "artistic" concerns from the compiled expression
> > POV. The biggest issue I see is that it'll make it a bit harder to
> > separate out the expression compilation phase from the expression
> > instantiation phase - something I think we definitely want.
>
> Hmm, there's no such distinction now, so could you explain what you
> have in mind there?

There's a few related concerns I have:
- For OLTP workloads using prepared statements, we often spend the
  majority of the time doing ExecInitExpr() and related tasks like
  computing tupledescs.
- For OLTP workloads the low allocation density for things hanging off
  ExprState's and PlanState nodes is a significant concern. The number
  of allocations cause overhead, the overhead wastes memory and lowers
  cache hit ratios.
- For JIT we currently end up encoding specific pointer values into the
  generated code. As these obviously prevent reuse of the generated
  function, this noticeably reduces the applicability of JITing to fewer
  usecases. JITing is actually quite beneficial for a lot of OLTP
  workloads too, but it's too expensive to do every query.

To address these, I think we may want to split the the division of labor
a bit. Expression instantiation (i.e. ExecReadyExpr()) should happen at
executor startup, but in a lot of cases "compiling" the steps itself
should happen at plan time. Obviously that means the steps themselves
can't contain plain pointers, as the per-execution memory will be
located in different places.  So I think what we should have is that
expression initialization just computes the size of required memory for
all steps and puts *offsets* into that in the steps. After that
expression instantiation either leaves them alone and evaluation uses
relative pointers (cheap-ish e.g. on x86 due to lea), or just turn the
relative pointers into absolute ones.
That means that all the memory for all steps of an ExprState would be
allocated in one chunk, reducing allocation overhead and increasing
cache hit ratios considerably.

I've experimented a bit with a rough rough hack of the above (purely at
execution time), and it doesn't seem too hard.


> Keeping the stored value of a CachedExpr in a Param slot is an
> interesting idea indeed.

We keep coming back to this, IIRC we had a pretty similar discussion
around redesigning caseValue_datum/isNull domainValue_datum/isNull to be
less ugly. There also was
https://www.postgresql.org/message-id/20171116182208.kcvf75nfaldv3...@alap3.anarazel.de
where we discussed using something similar to PARAM_EXEC Param nodes to
allow inlining of volatile functions.

ISTM, there might be some value to consider all of them in the design of
the new mechanism.

Greetings,

Andres Freund



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Tom Lane
Alvaro Herrera  writes:
> Andres Freund wrote:
>> Besides larger groups, starting the next test(s) earlier, another way to
>> gain pretty large improvements would be a test schedule feature that
>> allowed to stat dependencies between tests. So instead of manually
>> grouping the schedule, have 'numerology' state that it depends on int2,
>> int4, int8, float4, float8, which means it can actually be started
>> earlier than it currently can in many cases.

> Yeah, I proposed this a decade ago but never had the wits to write the
> code.

There'd be a lot of followup work to sanitize the tests better.  For
instance, if two tests transiently create tables named "foo", it doesn't
matter as long as they're not in the same group.  It would matter with
this.

There are things we could do to mitigate that --- one attractive idea
is to have each test create its own schema for transient objects.
The limiting factor is that we don't want the test scripts to change
so much that back-patching tests becomes impossible.  (Or at least,
I'd not like that.)

I'm concerned that we'd end up with a higher number of irreproducible
test failures with no good way to investigate them.

> It would be very useful for running tests standalone, too -- much as I
> dislike 'make installcheck' taking half a minute, I dislike much more
> having to take 5 minutes each time to figure out that create_table
> depends on box, polygon, create_function, yadda yadda.

Yeah, if we could say "run just this test and its needed precursors",
that'd be a huge win in a lot of situations.

Definitely seems like an idea worth pursuing.

regards, tom lane



Re: copy.c allocation constant

2018-01-24 Thread Bruce Momjian
On Thu, Jan 25, 2018 at 09:30:54AM +1300, Thomas Munro wrote:
> On Thu, Jan 25, 2018 at 7:19 AM, Robert Haas  wrote:
> > On Wed, Jan 24, 2018 at 12:25 PM, Tomas Vondra
> >  wrote:
> >> At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator
> >> with similar ideas (freelists, ...) so hopefully it's fine too.
> >>
> >> And then there are the systems without glibc, or with other libc
> >> implementations. No idea about those.
> >
> > My guess is that a fairly common pattern for larger chunks will be to
> > round the size up to a multiple of 4kB, the usual memory page size.
> 
> See also this discussion:
> 
> https://www.postgresql.org/message-id/flat/CAEepm%3D1bRyd%2B_W9eW-QmP1RGP03ti48zgd%3DK11Q6o4edQLgkcg%40mail.gmail.com#CAEepm=1bRyd+_W9eW-QmP1RGP03ti48zgd=k11q6o4edqlg...@mail.gmail.com
> 
> TL;DR glibc doesn't actually round up like that below 128kB, but many
> others including FreeBSD, macOS etc round up to various page sizes or
> size classes including 8kB (!), 512 bytes.  I find this a bit
> frustrating because it means that the most popular libc implementation
> doesn't have the problem so this kind of thing probably isn't a high
> priority, but probably on most other Unices (and I have no clue for
> Windows) including my current favourite we waste a bunch of memory.

The BSD memory allocator used to allocate in powers of two, and keep the
header in a separate location.  They did this so they could combine two
free, identically-sized memory blocks into a single one that was double
the size.  I have no idea how it works now.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: copy.c allocation constant

2018-01-24 Thread Thomas Munro
On Thu, Jan 25, 2018 at 7:19 AM, Robert Haas  wrote:
> On Wed, Jan 24, 2018 at 12:25 PM, Tomas Vondra
>  wrote:
>> At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator
>> with similar ideas (freelists, ...) so hopefully it's fine too.
>>
>> And then there are the systems without glibc, or with other libc
>> implementations. No idea about those.
>
> My guess is that a fairly common pattern for larger chunks will be to
> round the size up to a multiple of 4kB, the usual memory page size.

See also this discussion:

https://www.postgresql.org/message-id/flat/CAEepm%3D1bRyd%2B_W9eW-QmP1RGP03ti48zgd%3DK11Q6o4edQLgkcg%40mail.gmail.com#CAEepm=1bRyd+_W9eW-QmP1RGP03ti48zgd=k11q6o4edqlg...@mail.gmail.com

TL;DR glibc doesn't actually round up like that below 128kB, but many
others including FreeBSD, macOS etc round up to various page sizes or
size classes including 8kB (!), 512 bytes.  I find this a bit
frustrating because it means that the most popular libc implementation
doesn't have the problem so this kind of thing probably isn't a high
priority, but probably on most other Unices (and I have no clue for
Windows) including my current favourite we waste a bunch of memory.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Peter Geoghegan
On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro
 wrote:
> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan  wrote:
>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>
>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>> real problem. Sure enough, the coding pattern within
>> _bt_leader_heapscan() can cause us to wait forever even with commit
>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>> consequence of the patch not using tuple queues (it uses the new
>> tuplesort sharing thing instead).
>
> Just curious: does the attached also help?

I can still reproduce the problem without the fix I described (which
does work), using your patch instead.

Offhand, I suspect that the way you set ParallelMessagePending may not
always leave it set when it should be.

>> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
>> within _bt_leader_heapscan() before waiting on our condition variable
>> fixes the problem -- errors are reliably propagated, and we never end
>> up waiting forever.
>
> That does seem like a nice, simple solution and I am not against it.
> The niggling thing that bothers me about it, though, is that it
> requires the client of parallel.c to follow a slightly complicated
> protocol or risk a rare obscure failure mode, and recognise the cases
> where that's necessary.  Specifically, if you're not blocking in a
> shm_mq wait loop, then you must make a call to this new interface
> before you do any other kind of latch wait, but if you get that wrong
> you'll probably not notice since fork failure is rare!  It seems like
> it'd be nicer if we could figure out a way to make it so that any
> latch/CFI loop would automatically be safe against fork failure.

It would certainly be nicer, but I don't see much risk if we add a
comment next to nworkers_launched that said: Don't trust this until
you've called (Amit's proposed) WaitForParallelWorkersToAttach()
function, unless you're using the tuple queue infrastructure, which
lets you not need to directly care about the distinction between a
launched worker never starting, and a launched worker successfully
completing.

While I agree with what Robert said on the other thread -- "I guess
that works, but it seems more like blind luck than good design.
Parallel CREATE INDEX fails to be as "lucky" as Gather" -- that
doesn't mean that that situation cannot be formalized. And even if it
isn't formalized, then I think that that will probably be because
Gather ends up doing almost the same thing.

-- 
Peter Geoghegan



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Alvaro Herrera
Thomas Munro wrote:
> On Wed, Jan 24, 2018 at 12:10 PM, Tom Lane  wrote:

> > However, the trend over the last two months is very bad, and I do
> > not think that we can point to any large improvement in test
> > coverage that someone committed since November.
> 
> I'm not sure if coverage.postgresql.org has a way to view historical
> reports so we can see the actual percentage change,

It does not.  I have had that in my to-do list for a while, but haven't
gotten around to it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Alvaro Herrera
Andres Freund wrote:

> Besides larger groups, starting the next test(s) earlier, another way to
> gain pretty large improvements would be a test schedule feature that
> allowed to stat dependencies between tests. So instead of manually
> grouping the schedule, have 'numerology' state that it depends on int2,
> int4, int8, float4, float8, which means it can actually be started
> earlier than it currently can in many cases.

Yeah, I proposed this a decade ago but never had the wits to write the
code.

It would be very useful for running tests standalone, too -- much as I
dislike 'make installcheck' taking half a minute, I dislike much more
having to take 5 minutes each time to figure out that create_table
depends on box, polygon, create_function, yadda yadda.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Thomas Munro
On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan  wrote:
> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>
> 1. The problem of fork failure causing nbtsort.c to wait forever is a
> real problem. Sure enough, the coding pattern within
> _bt_leader_heapscan() can cause us to wait forever even with commit
> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
> consequence of the patch not using tuple queues (it uses the new
> tuplesort sharing thing instead).

Just curious: does the attached also help?

> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
> within _bt_leader_heapscan() before waiting on our condition variable
> fixes the problem -- errors are reliably propagated, and we never end
> up waiting forever.

That does seem like a nice, simple solution and I am not against it.
The niggling thing that bothers me about it, though, is that it
requires the client of parallel.c to follow a slightly complicated
protocol or risk a rare obscure failure mode, and recognise the cases
where that's necessary.  Specifically, if you're not blocking in a
shm_mq wait loop, then you must make a call to this new interface
before you do any other kind of latch wait, but if you get that wrong
you'll probably not notice since fork failure is rare!  It seems like
it'd be nicer if we could figure out a way to make it so that any
latch/CFI loop would automatically be safe against fork failure.  The
attached (if it actually works, I dunno) is the worst way, but I
wonder if there is some way to traffic just a teensy bit more
information from postmaster to leader so that it could be efficient...

-- 
Thomas Munro
http://www.enterprisedb.com


pessimistic-fork-failure-detector-v2.patch
Description: Binary data


Re: copy.c allocation constant

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 17:07:01 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > glibc's malloc does add a header. My half-informed suspicion is that
> > most newer malloc backing allocators will have a header, because
> > maintaining a shared lookup-by-address table is pretty expensive to
> > maintain. A bit of metadata indicating size and/or source of the
> > allocation makes using thread-local information a lot easier.
>
> Sounds like it'd be smart to allocate something closer to
> M_MMAP_THRESHOLD (which with typical values would be about double the
> amount of memory the current RAW_BUF_SIZE value), minus a few dozen
> bytes to allow for palloc's and malloc's respective headers.  That way
> we bet for a mmap()ed allocation with minimum space wastage across all
> layers.

In general there might be cases where that's worthwhile, although
M_MMAP_THRESHOLD IIRC isn't a constant anymore, complicating things. The
specific case this thread is discussing doesn't seem worthy of such
attention though, there's afaict no actual practical problem here.


> Not sure whether we want to try to minimize wastage through
> clever use of malloc_usable_size() on each backend's first allocation --
> that'd probably just be overengineering.

Likely also dangerous, I don't think this is a constant.

Greetings,

Andres Freund



Re: WIP Patch: Precalculate stable functions, infrastructure v1

2018-01-24 Thread Tom Lane
Andres Freund  writes:
> On 2018-01-16 17:05:01 -0500, Tom Lane wrote:
>> I'm curious to know whether Andres has some other ideas, or whether he
>> feels this is all a big wart on the compiled-expression concept.

> I don't have too many "artistic" concerns from the compiled expression
> POV. The biggest issue I see is that it'll make it a bit harder to
> separate out the expression compilation phase from the expression
> instantiation phase - something I think we definitely want.

Hmm, there's no such distinction now, so could you explain what you
have in mind there?

>> I don't think there are any existing cases where we keep any
>> meaningful state across executions of a compiled-expression data
>> structure; maybe that's a bad idea in itself.

> Storing all cached values in an EState or ExprContext (the
> latter referring to the former) somewhat alike the values for Param's
> sounds a lot more reasonable to me.
> ...
> This all reminds me a lot of the infrastructure for Params...

Yeah, one thing I was thinking about in connection with this is the
stuff associated with propagating changes in outer-reference Params
(the extParam/allParam/chgParam mess).  I wonder if we could find
a way to unify that with this feature.

Keeping the stored value of a CachedExpr in a Param slot is an
interesting idea indeed.

regards, tom lane



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 14:31:47 -0500, Tom Lane wrote:
> However ... if you spend any time looking at the behavior of that,
> the hashjoin tests are still problematic.

I think my main problem with your arguments is that you basically seem
to say that one of the more complex features in postgres can't increase
the test time. And I just don't agree with that.  If we can reduce some
unnecessary overhead (as Thomas iirc has done somwhere nearby) - great,
if we can hide the overhead by scheduling the test better or breaking it
up - also great. But if that's a good chunk of work I think it's
entirely reasonable to not necessarily consider that the best use of
time.

It doesn't seem too surprising that a test that relies on starting
multiple background processes in multiple places will be among the more
expensive ones.  We clearly would e.g. benefit from being able to reuse
workers, to avoid constantly starting/stopping them.


> (The overall runtime for "make installcheck-parallel" on this machine
> is about 17.3 seconds right now.)  The next slowest test script in
> the join test's group is "update", at 0.373 seconds; so over 1.5 sec
> of the total 17.3 sec runtime is being spent solely in the join script.

Might be worth breaking up join a bit, that won't get rid of all the
wall time overhead, but should reduce it. Reordering to run parallel to
other slow tests might also be worthwhile.


> So I continue to maintain that an unreasonable fraction of the total
> resources devoted to the regular regression tests is going into these
> new hashjoin tests.

> > One caveat is that old machines also
> > somewhat approximate testing with more instrumentation / debugging
> > enabled (say valgrind, CLOBBER_CACHE_ALWAYS, etc). So removing excessive
> > test overhead has still quite some benefits. But I definitely do not
> > want to lower coverage to achieve it.
> 
> I don't want to lower coverage either.  I do want some effort to be
> spent on achieving test coverage intelligently, rather than just throwing
> large test cases at the code without consideration of the costs.

I think this accusation is unfair. Are you really suggesting that nobody
else cares about the runtime of the new tests? Just because other
people's tradeoffs come down at a somewhat different place, doesn't mean
they add tests "without consideration of the costs".


> Based on these numbers, it seems like one easy thing we could do to
> reduce parallel check time is to split the plpgsql test into several
> scripts that could run in parallel.  But independently of that,
> I think we need to make an effort to push hashjoin's time back down.

If we had a dependency based system as I suggested nearby, we could have
pg_regress order the tests so that the slowest ones that have
dependencies fulfilled are started first...

Greetings,

Andres Freund



Re: copy.c allocation constant

2018-01-24 Thread Alvaro Herrera
Andres Freund wrote:
> On 2018-01-24 14:25:37 -0500, Robert Haas wrote:
> > On Wed, Jan 24, 2018 at 1:43 PM, Andres Freund  wrote:
> > > Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most
> > > platforms though. From man mallopt:
> > >  Balancing  these  factors  leads  to a default setting of 128*1024 for 
> > > the M_MMAP_THRESHOLD parameter.
> > > Additionally, even when malloc() chooses to use mmap() to back an
> > > allocation, it'll still needs a header to know the size of the
> > > allocation and such. So exactly using a size of a multiple of 4KB will
> > > still leave you with wasted space.  Due to the latter I can't see it
> > > mattering whether or not we add +1 to a power-of-two size.
> > 
> > Well, it depends on how it works.  dsa_allocate, for example, never
> > adds a header to the size of the allocation.
> 
> glibc's malloc does add a header. My half-informed suspicion is that
> most newer malloc backing allocators will have a header, because
> maintaining a shared lookup-by-address table is pretty expensive to
> maintain. A bit of metadata indicating size and/or source of the
> allocation makes using thread-local information a lot easier.

Sounds like it'd be smart to allocate something closer to
M_MMAP_THRESHOLD (which with typical values would be about double the
amount of memory the current RAW_BUF_SIZE value), minus a few dozen
bytes to allow for palloc's and malloc's respective headers.  That way
we bet for a mmap()ed allocation with minimum space wastage across all
layers.  Not sure whether we want to try to minimize wastage through
clever use of malloc_usable_size() on each backend's first allocation --
that'd probably just be overengineering.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Robert Haas
On Tue, Jan 23, 2018 at 9:45 PM, Amit Kapila  wrote:
> Hmm, I think that case will be addressed because tuple queues can
> detect if the leader is not attached.  It does in code path
> shm_mq_receive->shm_mq_counterparty_gone.  In
> shm_mq_counterparty_gone, it can detect if the worker is gone by using
> GetBackgroundWorkerPid.  Moreover, I have manually tested this
> particular case before saying your patch is fine.  Do you have some
> other case in mind which I am missing?

Oh, right.  I had forgotten that I taught shm_mq_attach() to take an
optional BackgroundWorkerHandle precisely to solve this kind of
problem.  I can't decide whether to feel smart because I did that or
dumb because I forgot that I did it.

But I think there's still a problem: given recent commits, that will
cause us to ERROR out if we tried to read from a tuple queue for a
worker which failed to start, or that starts and then dies before
attaching to the queue.  But there's no guarantee we'll try to read
from that queue in the first place.  Normally, that gets triggered
when we get PROCSIG_PARALLEL_MESSAGE, but if the worker never started
up in the first place, it can't send that, and the postmaster won't.

In Thomas's test case, he's using 4 workers, and if even one of those
manages to start, then it'll probably do so after any fork failures
have already occurred, and the error will be noticed when that process
sends its first message to the leader through the error queue, because
it'll send PROCSIG_PARALLEL_MESSAGE via all the queues.  If all of the
workers fail to start, then that doesn't help.  But it still manages
to detect the failure in that case because it reaches
WaitForParallelWorkersToFinish, which we just patched.

But how does that happen, anyway?  Shouldn't it get stuck waiting for
the tuple queues to which nobody's attached yet?  The reason it
doesn't is because
ExecParallelCreateReaders() calls shm_mq_set_handle() for each queue,
which causes the tuple queues to be regarded as detached if the worker
fails to start.  A detached tuple queue, in general, is not an error
condition: it just means that worker has no more tuples.  So I think
what happens is as follows.  If parallel_leader_participation = on,
then the leader runs the whole plan, producing however many tuples the
plan should have generated, and then afterwards waits for the workers
to finish, tripping an error.  If parallel_leader_participation = off,
the leader doesn't try to run the plan and is perfectly happy with the
fact that every worker has produced no tuples; since, as far as it
knows, the plan is now fully executed, it waits for them to shut down,
tripping an error.

I guess that works, but it seems more like blind luck than good
design.  Parallel CREATE INDEX fails to be as "lucky" as Gather
because there's nothing in parallel CREATE INDEX that lets it skip
waiting for a worker which doesn't start -- and in general it seems
unacceptable to impose a coding requirement that all future parallel
operations must fail to notice the difference between a worker that
completed successfully and one that never ran at all.

If we made the Gather node wait only for workers that actually reached
the Gather -- either by using a Barrier or by some other technique --
then this would be a lot less fragile.  And the same kind of technique
would work for parallel CREATE INDEX.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Peter Geoghegan
On Wed, Jan 24, 2018 at 1:57 AM, Thomas Munro
 wrote:
> On Wed, Jan 24, 2018 at 5:25 PM, Thomas Munro
>  wrote:
>> If there were some way for the postmaster to cause reason
>> PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of
>> just notification via kill(SIGUSR1) when it fails to fork a parallel
>> worker, we'd get (1) for free in any latch/CFI loop code.  But I
>> understand that we can't do that by project edict.
>
> Based on the above observation, here is a terrible idea you'll all
> hate.  It is pessimistic and expensive: it thinks that every latch
> wake might be the postmaster telling us it's failed to fork() a
> parallel worker, until we've seen a sign of life on every worker's
> error queue.  Untested illustration code only.  This is the only way
> I've come up with to discover fork failure in any latch/CFI loop (ie
> without requiring client code to explicitly try to read either error
> or tuple queues).

The question, I suppose, is how expensive this is in the real world.
If it's actually not a cost that anybody is likely to notice, then I
think we should pursue this approach. I wouldn't put too much weight
on keeping this simple for users of the parallel infrastructure,
though, because something like Amit's WaitForParallelWorkersToAttach()
idea still seems acceptable. "Call this function before trusting the
finality of nworkers_launched" isn't too onerous a rule to have to
follow.

-- 
Peter Geoghegan



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Tom Lane
I wrote:
> I find that to be a completely bogus straw-man argument.  The point of
> looking at the prairiedog time series is just to see a data series in
> which the noise level is small enough to discern the signal.  If anyone's
> got years worth of data off a more modern machine, and they can extract
> a signal from that, by all means let's consider that data instead.

Just to make the point, I scraped the numbers for skink's "installcheck"
and "check" steps, which are data series I imagine at least Andres will
concede are worth paying attention to.  I made no attempt to clean
outliers, so these curves are pretty noisy, but I think there is a very
clear upward bump since mid-December.

regards, tom lane



Re: copy.c allocation constant

2018-01-24 Thread Andres Freund
On 2018-01-24 14:25:37 -0500, Robert Haas wrote:
> On Wed, Jan 24, 2018 at 1:43 PM, Andres Freund  wrote:
> > Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most
> > platforms though. From man mallopt:
> >  Balancing  these  factors  leads  to a default setting of 128*1024 for the 
> > M_MMAP_THRESHOLD parameter.
> > Additionally, even when malloc() chooses to use mmap() to back an
> > allocation, it'll still needs a header to know the size of the
> > allocation and such. So exactly using a size of a multiple of 4KB will
> > still leave you with wasted space.  Due to the latter I can't see it
> > mattering whether or not we add +1 to a power-of-two size.
> 
> Well, it depends on how it works.  dsa_allocate, for example, never
> adds a header to the size of the allocation.

glibc's malloc does add a header. My half-informed suspicion is that
most newer malloc backing allocators will have a header, because
maintaining a shared lookup-by-address table is pretty expensive to
maintain. A bit of metadata indicating size and/or source of the
allocation makes using thread-local information a lot easier.


> Allocations < 8kB are
> bucketed by size class and stored in superblocks carved up into
> equal-sized chunks.  Allocations > 8kB are rounded to a multiple of
> the 4kB page size and we grab that many consecutive free pages.  I
> didn't make those behaviors up; I copied them from elsewhere.  Some
> other allocator I read about did small-medium-large allocations: large
> with mmap(), medium with multiples of the page size, small with
> closely-spaced size classes.

Sure - all I'm trying to say that it likely won't matter whether we use
power-of-two or power-of-two + 1, because it seems likely that due to
overhead considerations we'll likely not quite fit into a size class
anyway.


> It doesn't seem like a particularly good idea to take a 64kB+1 byte
> allocation, stick a header on it, and pack it tightly up against other
> allocations on both sides.  Seems like that could lead to
> fragmentation problems.  Is that really what it does?

No, I'm fairly sure it's not.

Greetings,

Andres Freund



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Peter Geoghegan
On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila  wrote:
> Right, but what if the worker dies due to something proc_exit(1) or
> something like that before calling BarrierArriveAndWait.  I think this
> is part of the problem we have solved in
> WaitForParallelWorkersToFinish such that if the worker exits abruptly
> at any point due to some reason, the system should not hang.

I have used Thomas' chaos-monkey-fork-process.patch to verify:

1. The problem of fork failure causing nbtsort.c to wait forever is a
real problem. Sure enough, the coding pattern within
_bt_leader_heapscan() can cause us to wait forever even with commit
2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
consequence of the patch not using tuple queues (it uses the new
tuplesort sharing thing instead).

2. Simply adding a single call to WaitForParallelWorkersToFinish()
within _bt_leader_heapscan() before waiting on our condition variable
fixes the problem -- errors are reliably propagated, and we never end
up waiting forever.

3. This short term fix works just as well with
parallel_leader_participation=off.

At this point, my preferred solution is for someone to go implement
Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
like the logical person for the job). Once that's committed, I can
post a new version of the patch that uses that new infrastructure --
I'll add a call to the new function, without changing anything else.
Failing that, we could actually just use
WaitForParallelWorkersToFinish(). I still don't want to use a barrier,
mostly because it complicates  parallel_leader_participation=off,
something that Amit is in agreement with [2][3].

For now, I am waiting for feedback from Robert on next steps.

[1] 
https://postgr.es/m/CAH2-Wzm6dF=g9lywthgcqzrc4dzbe-8tv28yvg0xj8q6e4+...@mail.gmail.com
[2] 
https://postgr.es/m/CAA4eK1LEFd28p1kw2Fst9LzgBgfMbDEq9wPh9jWFC0ye6ce62A%40mail.gmail.com
[3] 
https://postgr.es/m/caa4ek1+a0of4m231vbgpr_0ygg_bnmrgzlib7wqde-fybsy...@mail.gmail.com
-- 
Peter Geoghegan



Re: pgindent run?

2018-01-24 Thread Andres Freund


On January 24, 2018 11:34:07 AM PST, Tom Lane  wrote:
>Andres Freund  writes:
>> There'd be one or two edge cases of bad formatting, but the
>> end result would be far less painful than what we have today, were
>> basically nobody can format their patches without a lot of manual
>> cherry-picking or large scale unrelated whitespace changes.
>
>IMO, the big problem here is that people commit large changes without
>having pgindent'd them first.

Well, I think it'd really have to be every patch that's indented properly. And 
that's hard given the way typedefs.list is maintained.  Without most new 
typedefs included one continually reindents with a lot of damage.

It'd be less bad is we automated the maintenance of the lists so
a) patch authors can automatically add to the list and include that in commits
b) the committed list gets updated automatically every few days based on bf 
results
c) there's automated whole tree pgindent runs every few days.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: pgindent run?

2018-01-24 Thread Tom Lane
Andres Freund  writes:
> FWIW, I think this problem could just as well be addressed with a few
> printing heuristics instead of actually needing an actual list of
> typedefs.

Step right up and implement that, and we'd all be happier.  Certainly
the typedefs list is a pain in the rear.

> There'd be one or two edge cases of bad formatting, but the
> end result would be far less painful than what we have today, were
> basically nobody can format their patches without a lot of manual
> cherry-picking or large scale unrelated whitespace changes.

IMO, the big problem here is that people commit large changes without
having pgindent'd them first.

regards, tom lane



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Tom Lane
Andres Freund  writes:
> On 2018-01-24 13:11:22 -0500, Robert Haas wrote:
>> Now, how much should we care about the performance of software with a
>> planned release date of 2018 on hardware discontinued in 2001,
>> hardware that is apparently about 20 times slower than a modern
>> laptop?  Some, perhaps, but maybe not a whole lot.  Removing tests
>> that have found actual bugs because they cost runtime on ancient
>> systems that nobody uses for serious work doesn't make sense to me.

> I again agree with the sentiment.

I find that to be a completely bogus straw-man argument.  The point of
looking at the prairiedog time series is just to see a data series in
which the noise level is small enough to discern the signal.  If anyone's
got years worth of data off a more modern machine, and they can extract
a signal from that, by all means let's consider that data instead.  But
there's no clear argument (or at least you have not made one) that says
that prairiedog's relative timings don't match what we'd get on more
modern machines.

Now, what *is* a relevant objection is that most of us care more about
the runtime of the parallelized regression tests than serial tests.
I did not use prairiedog's "make check" timings in these graphs because
that would include "make install" and initdb timings, adding noise and
overhead and probably making it harder to see what's going on.  But
it's perfectly fair to want to optimize that case not the serial case.

However ... if you spend any time looking at the behavior of that,
the hashjoin tests are still problematic.  I instrumented the parallel
tests by turning on log_disconnections so as to get per-test-script
timings, and what I find to be the slowest steps on my development
workstation are

[pg_regress/rowsecurity] 0.517
[pg_regress/partition_join] 0.535
[pg_regress/updatable_views] 0.546
[pg_regress/stats] 0.566
[pg_regress/triggers] 0.618
[pg_regress/foreign_data] 0.642
[pg_regress/stats_ext] 0.670
[pg_regress/select_parallel] 0.828
[pg_regress/create_index] 0.916
[pg_regress/alter_table] 1.187
[pg_regress/gist] 1.283
[pg_regress/join] 1.923
[pg_regress/plpgsql] 3.100

(The overall runtime for "make installcheck-parallel" on this machine
is about 17.3 seconds right now.)  The next slowest test script in
the join test's group is "update", at 0.373 seconds; so over 1.5 sec
of the total 17.3 sec runtime is being spent solely in the join script.

Running the same test on the v10 branch, the slowest steps are

[pg_regress/join] 0.521
[pg_regress/rowsecurity] 0.521
[pg_regress/updatable_views] 0.554
[pg_regress/triggers] 0.624
[pg_regress/foreign_data] 0.647
[pg_regress/stats_ext] 0.675
[pg_regress/select_parallel] 0.690
[pg_regress/create_index] 0.928
[pg_regress/gist] 1.020
[pg_regress/alter_table] 1.120
[pg_regress/plpgsql] 3.217

so join has gotten about 1 second slower since v10, and that time is
coming entirely out of developers' hides despite parallelism because
it was already the slowest in its group.

So I continue to maintain that an unreasonable fraction of the total
resources devoted to the regular regression tests is going into these
new hashjoin tests.

Based on these numbers, it seems like one easy thing we could do to
reduce parallel check time is to split the plpgsql test into several
scripts that could run in parallel.  But independently of that,
I think we need to make an effort to push hashjoin's time back down.

> One caveat is that old machines also
> somewhat approximate testing with more instrumentation / debugging
> enabled (say valgrind, CLOBBER_CACHE_ALWAYS, etc). So removing excessive
> test overhead has still quite some benefits. But I definitely do not
> want to lower coverage to achieve it.

I don't want to lower coverage either.  I do want some effort to be
spent on achieving test coverage intelligently, rather than just throwing
large test cases at the code without consideration of the costs.

regards, tom lane



Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > I'm afraid we may still get some push-back from existing users of
> > --clean since, with the change you're proposing, we wouldn't be cleaning
> > up anything that's been done to the public schema when it comes to
> > comment changes or ACL changes, right?
> 
> No, if you have a nondefault ACL, that will still get applied.  This
> arrangement would drop comment changes, but I can't get excited about
> that; it's certainly far less of an inconvenience in that scenario
> than dumping the comment is in non-superuser-restore scenarios.

That nondefault ACL from the system the pg_dump was run on will get
applied *over-top* of whatever the current ACL on the system that the
restore is being run on, which may or may not be what's expected.

That's different from the case where the public schema is dropped and
recreated because, in that case, the schema will start out with an
empty ACL and the resulting ACL should end up matching what was on the
source system (ignoring the current issue with a non-clean pg_dump into
a custom format dump being used with a pg_restore --clean).

Just to be clear, I'm not suggesting that's a huge deal, but it's
certainly a change and something we should at least recognize and
contemplate.

I'm also not really worried about losing the comment.  I definitely
don't think it's worth putting in the kind of infrastructure that we put
in for init ACLs to handle comments being changed on initdb-time
objects.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [HACKERS] Proposal: Local indexes for partitioned table

2018-01-24 Thread Alvaro Herrera
I think this is the right fix for this problem.  I wonder about
exploring other callers of RelationGetIndexList to see who else could be
confused ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 53506fd3a9f0cb81334024bf5a0e8856fd8e5e82 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 24 Jan 2018 14:40:26 -0300
Subject: [PATCH] Ignore partitioned indexes in get_relation_info

---
 src/backend/optimizer/util/plancat.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index 8c60b35068..60f21711f4 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -208,6 +208,16 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, 
bool inhparent,
}
 
/*
+* Ignore partitioned indexes, since they are not 
usable for
+* queries.
+*/
+   if (indexRelation->rd_rel->relkind == 
RELKIND_PARTITIONED_INDEX)
+   {
+   index_close(indexRelation, NoLock);
+   continue;
+   }
+
+   /*
 * If the index is valid, but cannot yet be used, 
ignore it; but
 * mark the plan we are generating as transient. See
 * src/backend/access/heap/README.HOT for discussion.
-- 
2.11.0



Re: Foreign keys and partitioned tables

2018-01-24 Thread Alvaro Herrera
Alvaro Herrera wrote:
> This patch enables foreign key constraints to and from partitioned
> tables.

This version is rebased on current master.

0001: fix for a get_relation_info bug in current master.
  Posted in <20180124174134.ma4ui2kczmqwb4um@alvherre.pgsql>
0002: Allows local partitioned index to be unique;
  Posted in <2018015559.7pbzomvgp5iwmath@alvherre.pgsql>
0003: Allows FOR EACH ROW triggers on partitioned tables;
  Posted in <20180123221027.2qenwwpvgplrrx3d@alvherre.pgsql>

0004: the actual matter of this thread.
0005: bugfix for 0004, after recent changes I introduced in 0004.
  It's separate because I am undecided about it being the best
  approach; maybe further changes in 0003 are a better approach.

No further changes from the version I posted upthread.  Tests pass.  I'm
going to review this code now to see what further changes are needed (at
the very least, I think some dependency changes are in order; plus need
to add a few more tests for various ri_triggers.c code paths.)

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 53506fd3a9f0cb81334024bf5a0e8856fd8e5e82 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 24 Jan 2018 14:40:26 -0300
Subject: [PATCH v2 1/5] Ignore partitioned indexes in get_relation_info

---
 src/backend/optimizer/util/plancat.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index 8c60b35068..60f21711f4 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -208,6 +208,16 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, 
bool inhparent,
}
 
/*
+* Ignore partitioned indexes, since they are not 
usable for
+* queries.
+*/
+   if (indexRelation->rd_rel->relkind == 
RELKIND_PARTITIONED_INDEX)
+   {
+   index_close(indexRelation, NoLock);
+   continue;
+   }
+
+   /*
 * If the index is valid, but cannot yet be used, 
ignore it; but
 * mark the plan we are generating as transient. See
 * src/backend/access/heap/README.HOT for discussion.
-- 
2.11.0

>From f67c07baa0a211a42e42b1480c4ffd47b0a7fb06 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 6 Nov 2017 17:04:55 +0100
Subject: [PATCH v2 2/5] allow indexes on partitioned tables to be unique

---
 doc/src/sgml/ref/alter_table.sgml |   9 +-
 doc/src/sgml/ref/create_table.sgml|  16 +-
 src/backend/bootstrap/bootparse.y |   2 +
 src/backend/catalog/index.c   |  45 -
 src/backend/catalog/pg_constraint.c   |  76 
 src/backend/catalog/toasting.c|   4 +-
 src/backend/commands/indexcmds.c  | 125 +++--
 src/backend/commands/tablecmds.c  |  62 ++-
 src/backend/parser/analyze.c  |   7 +
 src/backend/parser/parse_utilcmd.c|  31 +---
 src/backend/tcop/utility.c|   1 +
 src/include/catalog/index.h   |   5 +-
 src/include/catalog/pg_constraint_fn.h|   4 +-
 src/include/commands/defrem.h |   1 +
 src/include/parser/parse_utilcmd.h|   3 +-
 src/test/regress/expected/alter_table.out |   8 -
 src/test/regress/expected/create_table.out|  12 --
 src/test/regress/expected/indexing.out| 254 +-
 src/test/regress/expected/insert_conflict.out |   2 +-
 src/test/regress/sql/alter_table.sql  |   2 -
 src/test/regress/sql/create_table.sql |   8 -
 src/test/regress/sql/indexing.sql | 151 ++-
 22 files changed, 740 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index 286c7a8589..c00fd09fe1 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -804,8 +804,9 @@ ALTER TABLE [ IF EXISTS ] name
   This form attaches an existing table (which might itself be partitioned)
   as a partition of the target table. The table can be attached
   as a partition for specific values using FOR VALUES
-   or as a default partition by using DEFAULT
-  .  For each index in the target table, a corresponding
+   or as a default partition by using
+  DEFAULT.
+  For each index in the target table, a corresponding
   one will be created in the attached table; or, if an equivalent
   index already exists, will be attached to the target table's index,
   as if ALTER INDEX ATTACH PARTITION had been executed.
@@ -820,8 +821,10 @@ ALTE

Re: copy.c allocation constant

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 1:43 PM, Andres Freund  wrote:
> Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most
> platforms though. From man mallopt:
>  Balancing  these  factors  leads  to a default setting of 128*1024 for the 
> M_MMAP_THRESHOLD parameter.
> Additionally, even when malloc() chooses to use mmap() to back an
> allocation, it'll still needs a header to know the size of the
> allocation and such. So exactly using a size of a multiple of 4KB will
> still leave you with wasted space.  Due to the latter I can't see it
> mattering whether or not we add +1 to a power-of-two size.

Well, it depends on how it works.  dsa_allocate, for example, never
adds a header to the size of the allocation.  Allocations < 8kB are
bucketed by size class and stored in superblocks carved up into
equal-sized chunks.  Allocations > 8kB are rounded to a multiple of
the 4kB page size and we grab that many consecutive free pages.  I
didn't make those behaviors up; I copied them from elsewhere.  Some
other allocator I read about did small-medium-large allocations: large
with mmap(), medium with multiples of the page size, small with
closely-spaced size classes.

It doesn't seem like a particularly good idea to take a 64kB+1 byte
allocation, stick a header on it, and pack it tightly up against other
allocations on both sides.  Seems like that could lead to
fragmentation problems.  Is that really what it does?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Would a BGW need shmem_access or database_connection to enumerate databases?

2018-01-24 Thread Bruce Momjian
On Wed, Jan 24, 2018 at 01:48:05PM -0500, Chapman Flack wrote:
> Thanks! I had actually registered that one (with a related one)
> for CF 2018-03, having missed the deadline for -01:
> 
> https://commitfest.postgresql.org/17/1467/

OK, thanks.  I added a commitfest item annotiation to say that the doc
part of this entry has been applied.

---


> 
> -Chap
> 
> On 01/24/2018 01:20 PM, Bruce Momjian wrote:
> > On Thu, Dec 14, 2017 at 06:12:35PM -0500, Chapman Flack wrote:
> >> On 12/04/2017 09:13 AM, Craig Ringer wrote:
> >>> On 1 December 2017 at 23:04, Chapman Flack  wrote:
>  Can I call RegisterDynamicBackgroundWorker when not in the postmaster,
>  but also not in a "regular backend", but rather another BGW?
> 
> >>> Yes. BDR does it a lot.
> >>
> >> Would this doc patch be acceptable to clarify that, in case
> >> I'm not the last person who might wonder?
> > 
> > Thanks, patch applied to head.
> > 
> > ---
> > 
> >> >From 3308ef5647e8ce4a84855b4d0ca09ba6aeb7 Mon Sep 17 00:00:00 2001
> >> From: Chapman Flack 
> >> Date: Thu, 14 Dec 2017 18:09:14 -0500
> >> Subject: [PATCH] Clarify that a BGW can register a dynamic BGW.
> >>
> >> ---
> >>  doc/src/sgml/bgworker.sgml | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
> >> index 4bc2b69..e490bb8 100644
> >> --- a/doc/src/sgml/bgworker.sgml
> >> +++ b/doc/src/sgml/bgworker.sgml
> >> @@ -41,7 +41,7 @@
> >>*worker, BackgroundWorkerHandle **handle).  Unlike
> >>RegisterBackgroundWorker, which can only be called 
> >> from within
> >>the postmaster, RegisterDynamicBackgroundWorker 
> >> must be
> >> -  called from a regular backend.
> >> +  called from a regular backend, possibly another background worker.
> >>   
> >>  
> >>   
> >> -- 
> >> 1.8.3.1
> >>
> > 
> > 
> 
> 

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: WIP Patch: Precalculate stable functions, infrastructure v1

2018-01-24 Thread Andres Freund
Hi,


On 2018-01-16 17:05:01 -0500, Tom Lane wrote:
> [ I'm sending this comment separately because I think it's an issue
> Andres might take an interest in. ]

Thanks for that. I indeed am interested. Sorry for the late response,
was very deep into the JIT patch.


> Marina Polyakova  writes:
> > [ v7-0001-Precalculate-stable-and-immutable-functions.patch ]
> 
> Another thing that's bothering me is that the execution semantics
> you're proposing for CachedExpr seem rather inflexible.  AFAICS, once a
> CachedExpr has run once, it will hang on to the result value and keep
> returning that for the entire lifespan of the compiled expression.
> We already noted that that breaks plpgsql's "simple expression"
> logic, and it seems inevitable to me that it will be an issue for
> other places as well.  I think it'd be a better design if we had some
> provision for resetting the cached values, short of recompiling the
> expression from scratch.

Hm. Yes, that makes me uncomfortable as well.


> One way that occurs to me to do this is to replace the simple boolean
> isExecuted flags with a generation counter, and add a master generation
> counter to ExprState.  The rule for executing CachedExpr would be "if my
> generation counter is different from the ExprState's counter, then
> evaluate the subexpression and copy the ExprState's counter into mine".
> Then the procedure for forcing recalculation of cached values is just to
> increment the ExprState's counter.  There are other ways one could imagine
> doing this --- for instance, I initially thought of keeping the master
> counter in the ExprContext being used to run the expression.  But you need
> some way to remember what counter value was used last with a particular
> expression, so probably keeping it in ExprState is better.

I'm not a big fan of this solution. We seem to be inventing more and
more places we keep state, rather than the contrary.


> Or we could just brute-force it by providing a function that runs through
> a compiled expression step list and resets the isExecuted flag for each
> EEOP_CACHEDEXPR_IF_CACHED step it finds.  A slightly less brute-force
> way is to link those steps together in a list, so that the function
> doesn't have to visit irrelevant steps.  If the reset function were seldom
> used then the extra cycles for this wouldn't be very expensive.  But I'm
> not sure it will be seldom used --- it seems like plpgsql simple
> expressions will be doing this every time --- so I think the counter
> approach might be a better idea.

Hm, that sounds like it'd not be cheap.


> I'm curious to know whether Andres has some other ideas, or whether he
> feels this is all a big wart on the compiled-expression concept.

I don't have too many "artistic" concerns from the compiled expression
POV. The biggest issue I see is that it'll make it a bit harder to
separate out the expression compilation phase from the expression
instantiation phase - something I think we definitely want.


> I don't think there are any existing cases where we keep any
> meaningful state across executions of a compiled-expression data
> structure; maybe that's a bad idea in itself.

To me, who has *not* followed the thread in detail, it sounds like the
relevant data shouldn't be stored inside the expression itself.  For
one, we do not want to have to visit every single simple expression and
reset them, for another it architecturally doesn't seem the right place
to me.  Storing all cached values in an EState or ExprContext (the
latter referring to the former) somewhat alike the values for Param's
sounds a lot more reasonable to me.

Besides that it seems to make it a lot easier to reset the values, it
also seems like it makes it a lot cleaner to cache stable functions
across multiple expressions in different places in a query? ISTM having
expression steps to actually compute the expression value in every
referencing expression is quite the waste.

This all reminds me a lot of the infrastructure for Params...

Greetings,

Andres Freund



Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Tom Lane
Stephen Frost  writes:
> I'm afraid we may still get some push-back from existing users of
> --clean since, with the change you're proposing, we wouldn't be cleaning
> up anything that's been done to the public schema when it comes to
> comment changes or ACL changes, right?

No, if you have a nondefault ACL, that will still get applied.  This
arrangement would drop comment changes, but I can't get excited about
that; it's certainly far less of an inconvenience in that scenario
than dumping the comment is in non-superuser-restore scenarios.

regards, tom lane



Re: Would a BGW need shmem_access or database_connection to enumerate databases?

2018-01-24 Thread Chapman Flack
Thanks! I had actually registered that one (with a related one)
for CF 2018-03, having missed the deadline for -01:

https://commitfest.postgresql.org/17/1467/

-Chap

On 01/24/2018 01:20 PM, Bruce Momjian wrote:
> On Thu, Dec 14, 2017 at 06:12:35PM -0500, Chapman Flack wrote:
>> On 12/04/2017 09:13 AM, Craig Ringer wrote:
>>> On 1 December 2017 at 23:04, Chapman Flack  wrote:
 Can I call RegisterDynamicBackgroundWorker when not in the postmaster,
 but also not in a "regular backend", but rather another BGW?

>>> Yes. BDR does it a lot.
>>
>> Would this doc patch be acceptable to clarify that, in case
>> I'm not the last person who might wonder?
> 
> Thanks, patch applied to head.
> 
> ---
> 
>> >From 3308ef5647e8ce4a84855b4d0ca09ba6aeb7 Mon Sep 17 00:00:00 2001
>> From: Chapman Flack 
>> Date: Thu, 14 Dec 2017 18:09:14 -0500
>> Subject: [PATCH] Clarify that a BGW can register a dynamic BGW.
>>
>> ---
>>  doc/src/sgml/bgworker.sgml | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
>> index 4bc2b69..e490bb8 100644
>> --- a/doc/src/sgml/bgworker.sgml
>> +++ b/doc/src/sgml/bgworker.sgml
>> @@ -41,7 +41,7 @@
>>*worker, BackgroundWorkerHandle **handle).  Unlike
>>RegisterBackgroundWorker, which can only be called 
>> from within
>>the postmaster, RegisterDynamicBackgroundWorker must 
>> be
>> -  called from a regular backend.
>> +  called from a regular backend, possibly another background worker.
>>   
>>  
>>   
>> -- 
>> 1.8.3.1
>>
> 
> 




Re: copy.c allocation constant

2018-01-24 Thread Andres Freund
On 2018-01-24 13:19:19 -0500, Robert Haas wrote:
> On Wed, Jan 24, 2018 at 12:25 PM, Tomas Vondra
>  wrote:
> > At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator
> > with similar ideas (freelists, ...) so hopefully it's fine too.
> >
> > And then there are the systems without glibc, or with other libc
> > implementations. No idea about those.
> 
> My guess is that a fairly common pattern for larger chunks will be to
> round the size up to a multiple of 4kB, the usual memory page size.

Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most
platforms though. From man mallopt:
 Balancing  these  factors  leads  to a default setting of 128*1024 for the 
M_MMAP_THRESHOLD parameter.
Additionally, even when malloc() chooses to use mmap() to back an
allocation, it'll still needs a header to know the size of the
allocation and such. So exactly using a size of a multiple of 4KB will
still leave you with wasted space.  Due to the latter I can't see it
mattering whether or not we add +1 to a power-of-two size.

There's malloc_usable_size() returning the actual size of an
allocation. It'd probably be worthwhile to use that in a few of our own
allocator routines. If not available on a platform we can just have a
stub that returns the requested size...

Greetings,

Andres Freund



Re: [PROPOSAL] Shared Ispell dictionaries

2018-01-24 Thread Arthur Zakirov
2018-01-24 20:57 GMT+03:00 Tomas Vondra :
>
> Thanks. I don't have time to review/test this before FOSDEM, but a
> couple of comments regarding some of the points you mentioned.
>

Thank you for your thoughts.


> > I thought about it. And it seems to me that we can use functions
> > ts_unload() and ts_reload() instead of new syntax. We already have
> > text search functions like ts_lexize() and ts_debug(), and it is
> > better to keep consistency.
>
> This argument seems a bit strange. Both ts_lexize() and ts_debug() are
> operating on text values, and are meant to be executed as functions from
> SQL - particularly ts_lexize(). It's hard to imagine this implemented as
> DDL commands.
>
> The unload/reload is something that operates on a database object
> (dictionary), which already has create/drop/alter DDL. So it seems
> somewhat natural to treat unload/reload as another DDL action.
>
> Taken to an extreme, this argument would essentially mean we should not
> have any DDL commands because we have SQL functions.
>
> That being said, I'm not particularly attached to having this DDL now.
> Implementing it seems straight-forward (particularly when we already
> have the stuff implemented as functions), and some of the other open
> questions seem more important to tackle now.
>

I understood your opinion. I haven't strong opinion on the subject yet.
And I agree that they can be implemented in future improvements for shared
dictionaries.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: pgindent run?

2018-01-24 Thread Andres Freund
On 2018-01-23 22:22:47 -0500, Bruce Momjian wrote:
> On Tue, Nov 28, 2017 at 04:38:12PM -0500, Tom Lane wrote:
> > Thomas Munro  writes:
> > > On Wed, Nov 29, 2017 at 9:47 AM, Tom Lane  wrote:
> > >> I think that'd be taking it too far, especially given that the dependency
> > >> on a typedefs list means that the git hook might have a different idea
> > >> of what's correctly indented than the committer does.  It'd be very hard
> > >> to debug such discrepancies and figure out what would satisfy the hook.
> > 
> > > Is there no way to get reasonable indentation that doesn't depend on
> > > that typedefs list?
> > 
> > Perhaps, but not with the tool we've got.
> > 
> > It's well known that C is unparseable without knowing which identifiers
> > are typedefs, so it doesn't exactly surprise me that it might not be
> > sanely indentable without knowing that.  But I've not thought hard about
> > it, nor looked for alternate tools.
> 
> For people curious about the C typedef parsing details:
> 
>   http://calculist.blogspot.com/2009/02/c-typedef-parsing-problem.html
>   
> https://stackoverflow.com/questions/243383/why-cant-c-be-parsed-with-a-lr1-parser

FWIW, I think this problem could just as well be addressed with a few
printing heuristics instead of actually needing an actual list of
typedefs. There'd be one or two edge cases of bad formatting, but the
end result would be far less painful than what we have today, were
basically nobody can format their patches without a lot of manual
cherry-picking or large scale unrelated whitespace changes.

Greetings,

Andres Freund



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 13:11:22 -0500, Robert Haas wrote:
> So for me, the additional hash index tests don't cost anything
> measurable and the additional hash join tests cost about a second.  I
> think this probably accounts for why committers other than you keep
> "adding so much time to the regression tests".  On modern hardware,
> the costs just don't matter.

I very much agree with the general sentiment, but a second of a 25s test
certainly isn't nothing.  As I've just written a few messages upthread,
I think we can hide the overall timing costs to a much larger degree
than we're doing, but I don't think we need not pay attention at all.


> Now, how much should we care about the performance of software with a
> planned release date of 2018 on hardware discontinued in 2001,
> hardware that is apparently about 20 times slower than a modern
> laptop?  Some, perhaps, but maybe not a whole lot.  Removing tests
> that have found actual bugs because they cost runtime on ancient
> systems that nobody uses for serious work doesn't make sense to me.

I again agree with the sentiment. One caveat is that old machines also
somewhat approximate testing with more instrumentation / debugging
enabled (say valgrind, CLOBBER_CACHE_ALWAYS, etc). So removing excessive
test overhead has still quite some benefits. But I definitely do not
want to lower coverage to achieve it.

Greetings,

Andres Freund



Re: PATCH: Exclude unlogged tables from base backups

2018-01-24 Thread David Steele
Hi Masahiko,

Thanks for the review!

On 1/22/18 3:14 AM, Masahiko Sawada wrote:
> On Thu, Dec 14, 2017 at 11:58 PM, Robert Haas  wrote:
>>
>> We would also have a problem if the missing file caused something in
>> recovery to croak on the grounds that the file was expected to be
>> there, but I don't think anything works that way; I think we just
>> assume missing files are an expected failure mode and silently do
>> nothing if asked to remove them.
> 
> I also couldn't see a problem in this approach.
> 
> Here is the first review comments.
> 
> +   unloggedDelim = strrchr(path, '/');
> 
> I think it doesn't work fine on windows. How about using
> last_dir_separator() instead?

I think this function is OK on Windows -- we use it quite a bit.
However, last_dir_separator() is clearer so I have changed it.

> 
> + * Find all unlogged relations in the specified directory and return
> their OIDs.
> 
> What the ResetUnloggedrelationsHash() actually returns is a hash
> table. The comment of this function seems not appropriate.

Fixed.

> +   /* Part of path that contains the parent directory. */
> +   int parentPathLen = unloggedDelim - path;
> +
> +   /*
> +* Build the unlogged relation hash if the parent path is 
> either
> +* $PGDATA/base or a tablespace version path.
> +*/
> +   if (strncmp(path, "./base", parentPathLen) == 0 ||
> +   (parentPathLen >=
> (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) &&
> +strncmp(unloggedDelim -
> (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
> +TABLESPACE_VERSION_DIRECTORY,
> +
> sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
> +   unloggedHash = ResetUnloggedRelationsHash(path);
> +   }
> 
> How about using get_parent_directory() to get parent directory name?

get_parent_directory() munges the string that is passed to it which I
was trying to avoid (we'd need a copy) - and I don't think it makes the
rest of the logic any simpler without constructing yet another string to
hold the tablespace path.

I know performance isn't the most important thing here, so if the
argument is for clarity perhaps it makes sense. Otherwise I don't know
if it's worth it.

> Also, I think it's better to destroy the unloggedHash after use.

Whoops! Fixed.

> +   /* Exclude all forks for unlogged tables except the
> init fork. */
> +   if (unloggedHash && ResetUnloggedRelationsMatch(
> +   unloggedHash, de->d_name) == unloggedOther)
> +   {
> +   elog(DEBUG2, "unlogged relation file \"%s\"
> excluded from backup",
> +de->d_name);
> +   continue;
> +   }
> 
> I think it's better to log this debug message at DEBUG2 level for
> consistency with other messages.

I think you mean DEBUG1?  It's already at DEBUG2.

I considered using DEBUG1 but decided against it.  The other exclusions
will produce a limited amount of output because there are only a few of
them.  In the case of unlogged tables there could be any number of
exclusions and I thought that was too noisy for DEBUG1.

> +   ok(!-f "$tempdir/tbackup/tblspc1/$tblspc1UnloggedBackupPath",
> +   'unlogged imain fork not in tablespace backup');
> 
> s/imain/main/

Fixed.

> If a new unlogged relation is created after constructed the
> unloggedHash before sending file, we cannot exclude such relation. It
> would not be problem if the taking backup is not long because the new
> unlogged relation unlikely becomes so large. However, if takeing a
> backup takes a long time, we could include large main fork in the
> backup.

This is a good point.  It's per database directory which makes it a
little better, but maybe not by much.

Three options here:

1) Leave it as is knowing that unlogged relations created during the
backup may be copied and document it that way.

2) Construct a list for SendDir() to work against so the gap between
creating that and creating the unlogged hash is as small as possible.
The downside here is that the list may be very large and take up a lot
of memory.

3) Check each file that looks like a relation in the loop to see if it
has an init fork.  This might affect performance since an
opendir/readdir loop would be required for every relation.

Personally, I'm in favor of #1, at least for the time being.  I've
updated the docs as indicated in case you and Adam agree.

New patches attached.

Thanks!
-- 
-David
da...@pgmasters.net
diff --git a/src/test/recovery/t/014_unlogged_reinit.pl 
b/src/test/recovery/t/014_unlogged_reinit.pl
new file mode 100644
index 00..ac2e251158
--- /dev/null
+++ b/src/test/recovery/t/014_unlogged_reinit.pl
@@ -0,0 +1,117 @@
+# Tests that unlogged tables are properly reinitialized after a crash.
+#
+# The behavior should be the same w

Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> In further testing of that, I noticed that it made the behavior of our
> >> other bugaboo, the public schema, rather inconsistent.  With this
> >> builtin-extensions hack, the plpgsql extension doesn't get dumped,
> >> whether or not you say "clean".  But the public schema does get
> >> dropped and recreated if you say "clean".  That's not helpful for
> >> non-superuser users of pg_dump, so I think we should try to fix it.
> 
> > I'm not entirely sure about trying to also support --clean for
> > non-superusers..  We've long had that the public schema is dropped and
> > recreated with --clean and it seems likely that at least some users are
> > depending on us doing that.  In any case, it's certainly not a change
> > that I think we could backpatch.  Perhaps we could just change it moving
> > forward (which would make me happier, really, since what I think we do
> > with these initdb-time things currently is a bit bizarre).
> 
> Sure, I was not proposing this for back-patch --- it depends on the
> other stuff we've committed recently, anyway.
> 
> > Yes, having that in getNamespaces() isn't correct but we need to do
> > something there, and I've been trying to figure out what.
> 
> I claim this is what ;-)

Well, that would make things in v11 better, certainly.  I suppose for
back-patching, I can just go make the change to pg_restore that I
outlined and that would deal with the complaints we've gotten there.

I'm afraid we may still get some push-back from existing users of
--clean since, with the change you're proposing, we wouldn't be cleaning
up anything that's been done to the public schema when it comes to
comment changes or ACL changes, right?

A typical use-case of pg_dump with --clean being to 'reset' a system
after, say, dumping out some subset of a production system and then
loading it into the development environment that all of the devs have
full access to, and where there might have been changes to the 'public'
that you want to revert (to get it back to looking like how 'prod' was).
In current versions this should result in at least the ACLs on public
matching what they are on prod, along with the comment and any other
changes done to it, but we would lose that if we stop doing drop/create
of the public schema on --clean.

Then again, the DROP SCHEMA will fail if any objects end up created in
the public schema anyway, so it isn't like that's a complete solution
regardless.  I suppose it's a bit surprising that we haven't been asked
for a --clean-cascade option.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: FOR EACH ROW triggers on partitioned tables

2018-01-24 Thread Robert Haas
On Tue, Jan 23, 2018 at 5:10 PM, Alvaro Herrera  wrote:
> The main question is this: when running the trigger function, it is
> going to look as it is running in the context of the partition, not in
> the context of the parent partitioned table (TG_RELNAME etc).  That
> seems mildly ugly: some users may be expecting that the partitioning
> stuff is invisible to the rest of the system, so if you have triggers on
> a regular table and later on decide to partition that table, the
> behavior of triggers will change, which is maybe unexpected.  Maybe this
> is not really a problem, but I'm not sure and would like further
> opinions.

It doesn't seem either great or horrible.

Also, what about logical replication?  Amit just raised this issue for
the UPDATE row movement patch, and it seems like the issues are
similar here.  If somebody's counting on the same kinds of per-row
triggers to fire during logical replication as we do during the
original operation, they will be disappointed.

>> Also, does ALTER TABLE ... ENABLE/DISABLE TRIGGER do the right things on
>> partitioned tables?
>
> Haven't done this yet, either.  I like Simon's suggestion of outright
> disallowing this.

Why not just make it work?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Would a BGW need shmem_access or database_connection to enumerate databases?

2018-01-24 Thread Bruce Momjian
On Thu, Dec 14, 2017 at 06:12:35PM -0500, Chapman Flack wrote:
> On 12/04/2017 09:13 AM, Craig Ringer wrote:
> > On 1 December 2017 at 23:04, Chapman Flack  wrote:
> >> Can I call RegisterDynamicBackgroundWorker when not in the postmaster,
> >> but also not in a "regular backend", but rather another BGW?
> >>
> > Yes. BDR does it a lot.
> 
> Would this doc patch be acceptable to clarify that, in case
> I'm not the last person who might wonder?

Thanks, patch applied to head.

---

> >From 3308ef5647e8ce4a84855b4d0ca09ba6aeb7 Mon Sep 17 00:00:00 2001
> From: Chapman Flack 
> Date: Thu, 14 Dec 2017 18:09:14 -0500
> Subject: [PATCH] Clarify that a BGW can register a dynamic BGW.
> 
> ---
>  doc/src/sgml/bgworker.sgml | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
> index 4bc2b69..e490bb8 100644
> --- a/doc/src/sgml/bgworker.sgml
> +++ b/doc/src/sgml/bgworker.sgml
> @@ -41,7 +41,7 @@
>*worker, BackgroundWorkerHandle **handle).  Unlike
>RegisterBackgroundWorker, which can only be called 
> from within
>the postmaster, RegisterDynamicBackgroundWorker must 
> be
> -  called from a regular backend.
> +  called from a regular backend, possibly another background worker.
>   
>  
>   
> -- 
> 1.8.3.1
> 


-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



  1   2   >