Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-24 Thread Michael Paquier
On Thu, Jan 25, 2018 at 06:14:41AM +, Tsunakawa, Takayuki wrote:
> I don't know why pg_temp3.fetchchunks still exists.  Maybe the user
> ran pg_ctl stop -mi while pg_rewind was running.

Likely that was the case :(

As a superuser, DROP TABLE should work on the temporary schema of
another session. Have you tried that to solve the situation?

> * I think temporary tables should not require vacuuming for XID
> wraparound.  Furtherover, should updates/deletes to temporary tables
> be in-place instead of creating garbage, so that any form of vacuum is
> unnecessary?  Other sessions do not need to read temporary tables.

Yeah, there are many areas of improvements in this area. Temp tables
also generate WAL..

> * In this incident, autovacuum worker misjudged that
> pg_temp_3.fetchchunks can't be deleted, although the creator
> (pg_rewind) is no longer active.  How can we delete orphan temporary
> tables safely? 

As long as Postgres sees that its temporary schema is in use, it would
think that the table is not orphaned. Another thing possible would be to
have the session now holding this schema space to reuse fetchchunks so
as things are reset.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] Lockable views

2018-01-24 Thread Thomas Munro
On Sun, Dec 31, 2017 at 11:57 PM, Yugo Nagata  wrote:
> On Fri, 29 Dec 2017 23:39:39 +0900 (JST)
> Tatsuo Ishii  wrote:
>> Your addition to the doc:
>> +   Automatically updatable views (see )
>> +   that do not have INSTEAD OF triggers or INSTEAD
>> +   rules are also lockable. When a view is locked, its base relations are
>> +   also locked recursively with the same lock mode.
>
> I added this point to the documentation.

+   Automatically updatable views (see )
+   that do not have INSTEAD OF triggers or INSTEAD

Tthe documentation doesn't build: you now need to say 
instead of , and you need to say .

-- 
Thomas Munro
http://www.enterprisedb.com



Re: ALTER TABLE ADD COLUMN fast default

2018-01-24 Thread Thomas Munro
On Wed, Jan 17, 2018 at 2:21 AM, Andrew Dunstan
 wrote:
> Yeah, got caught by the bki/pg_attribute changes on Friday. here's an
> updated version. Thanks for looking.

A boring semi-automated update:  this no long compiles, because
8561e4840c8 added a new call to heap_attisnull().  Pretty sure it just
needs NULL in the third argument.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: JIT compiling with LLVM v9.0

2018-01-24 Thread Jeff Davis
On Tue, Jan 23, 2018 at 11:20 PM, Andres Freund  wrote:
> Hi,
>
> I've spent the last weeks working on my LLVM compilation patchset. In
> the course of that I *heavily* revised it. While still a good bit away
> from committable, it's IMO definitely not a prototype anymore.

Great!

A couple high-level questions:

1. I notice a lot of use of the LLVM builder, for example, in
slot_compile_deform(). Why can't you do the same thing you did with
function code, where you create the ".bc" at build time from plain C
code, and then load it at runtime?
2. I'm glad you considered extensions. How far can we go with this in
the future? Can we have bitcode-only extensions that don't need a .so
file? Can we store the bitcode in pg_proc, simplifying deployment and
allowing extensions to travel over replication? I am not asking for
this now, of course, but I'd like to get the idea out there so we
leave room.

Regards,
 Jeff Davis



Re: JIT compiling with LLVM v9.0

2018-01-24 Thread Jeff Davis
On Wed, Jan 24, 2018 at 1:35 PM, Pierre Ducroquet  wrote:
> In LLVM 5.0, it looks like DebugInfo.h is not available in llvm-c, only as a C
> ++ API in llvm/IR/DebugInfo.h.

The LLVM APIs don't seem to be very stable; won't there just be a
continuous stream of similar issues?

Pinning major postgresql versions to specific LLVM versions doesn't
seem very appealing. Even if you aren't interested in the latest
changes in LLVM, trying to get the right version on your machine will
be annoying.

Regards,
Jeff Davis



Re: Rangejoin rebased

2018-01-24 Thread Jeff Davis
On Tue, Jan 23, 2018 at 2:04 AM, Simon Riggs  wrote:
> Perhaps we are misunderstanding each other
>
> TIMESTAMP <@ RANGE1 doesn't match if RANGE1 is empty
> and that is the most important case

When <@ is supported, that case should be fine if range1 is on the
outer. The case I was concerned about is with a R1 <@ R2 join where R1
is on the inner side and could have empty ranges.

One option would be to try to force R2 to be on the inner. But that
doesn't quite solve other related issues, like if R2 has a few large
ranges that contain almost everything.

Right now I'm considering an approach where we use some counters to
determine that a few ranges are preventing us from moving the mark
forward, and then move those few ranges into a separate tuplestore so
that we can move the mark forward.

> RANGE OP RANGE is important also. It would be useful for OP to be more
> than just &&

I agree that contains/contained-by are useful; do you have other
operators in mind as well?

>
> It's certainly weird that R1 @> EMPTY is true, but R1 && EMPTY is not.

This was discussed back in 9.2, and there were no obviously better
semantics available. I chose to follow set semantics: X contains Y
means that Y is a subset of X; X overlaps Y means that the
intersection of X and Y is nonempty.

I understand it can be surprising, but other definitions can be surprising, too.

Regards,
 Jeff Davis



Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-24 Thread Tsunakawa, Takayuki
Hello,

I've found a problem that an orphaned temporary table could cause XID 
wraparound.  Our customer encountered this problem with PG 9.5.2, but I think 
this will happen with the latest PG.

I'm willing to fix this, but I'd like to ask you what approach we should take.


PROBLEM


The customer has a database for application data, which I call it user_db here. 
 They don't store application data in postgres database.

No tables in user_db was autovacuumed for more than a month, leading to user 
tables bloating.  Those tables are eligible for autovacuum according to 
pg_stat_all_tables and autovacuum settings.

age(datfrozenxid) of user_db and postgres databases are greater than 
autovacuum_max_freeze_age, so they are eligible for autovacuuming for XID 
wraparound.

There are user tables in user_db whose age(relfrozenxid) is greater than 
autovacuum_freeze_max_age, so those tables should get autovacuum treatment.


CAUSE


postgres database has a table named pg_temp_3.fetchchunks, whose 
age(relfrozenxid) is greater than autovacuum_freeze_max_age.  This temporary 
table is the culprit.  pg_temp_3.fetchchunks is created by pg_rewind.  The 
customer says they ran pg_rewind.

autovacuum launcher always choose postgres, because do_start_worker() scans 
pg_database and finds that postgres database needs vacuuming for XID 
wraparound.  user_db is never chosen for vacuuming, although it also needs 
vacuuming for XID wraparound.

autovacuum worker doesn't delete pg_temp3.fetchchunks, because the backendid 3 
is used by some application so autovacuum worker thinks that the backend is 
active and the temporary table cannot be dropped.

I don't know why pg_temp3.fetchchunks still exists.  Maybe the user ran pg_ctl 
stop -mi while pg_rewind was running.


FIX


I have the following questions.  Along which line should I proceed to fix the 
problem?

* Why does autovacuum launcher always choose only one database when that 
database need vacuuming for XID wraparound?  Shouldn't it also choose other 
databases?

* I think temporary tables should not require vacuuming for XID wraparound.  
Furtherover, should updates/deletes to temporary tables  be in-place instead of 
creating garbage, so that any form of vacuum is unnecessary?  Other sessions do 
not need to read temporary tables.

* In this incident, autovacuum worker misjudged that pg_temp_3.fetchchunks 
can't be deleted, although the creator (pg_rewind) is no longer active.  How 
can we delete orphan temporary tables safely?


Regards
Takayuki Tsunakawa






Re: PATCH: Exclude unlogged tables from base backups

2018-01-24 Thread Masahiko Sawada
On Thu, Jan 25, 2018 at 3:25 AM, David Steele  wrote:
> Hi Masahiko,
>
> Thanks for the review!
>
> On 1/22/18 3:14 AM, Masahiko Sawada wrote:
>> On Thu, Dec 14, 2017 at 11:58 PM, Robert Haas  wrote:
>>>
>>> We would also have a problem if the missing file caused something in
>>> recovery to croak on the grounds that the file was expected to be
>>> there, but I don't think anything works that way; I think we just
>>> assume missing files are an expected failure mode and silently do
>>> nothing if asked to remove them.
>>
>> I also couldn't see a problem in this approach.
>>
>> Here is the first review comments.
>>
>> +   unloggedDelim = strrchr(path, '/');
>>
>> I think it doesn't work fine on windows. How about using
>> last_dir_separator() instead?
>
> I think this function is OK on Windows -- we use it quite a bit.
> However, last_dir_separator() is clearer so I have changed it.

Thank you for updating this. I was concerned about a separator
character '/' might not work fine on windows.

>
>> 
>> + * Find all unlogged relations in the specified directory and return
>> their OIDs.
>>
>> What the ResetUnloggedrelationsHash() actually returns is a hash
>> table. The comment of this function seems not appropriate.
>
> Fixed.
>
>> +   /* Part of path that contains the parent directory. */
>> +   int parentPathLen = unloggedDelim - path;
>> +
>> +   /*
>> +* Build the unlogged relation hash if the parent path is 
>> either
>> +* $PGDATA/base or a tablespace version path.
>> +*/
>> +   if (strncmp(path, "./base", parentPathLen) == 0 ||
>> +   (parentPathLen >=
>> (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) &&
>> +strncmp(unloggedDelim -
>> (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
>> +TABLESPACE_VERSION_DIRECTORY,
>> +
>> sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
>> +   unloggedHash = ResetUnloggedRelationsHash(path);
>> +   }
>>
>> How about using get_parent_directory() to get parent directory name?
>
> get_parent_directory() munges the string that is passed to it which I
> was trying to avoid (we'd need a copy) - and I don't think it makes the
> rest of the logic any simpler without constructing yet another string to
> hold the tablespace path.

Agreed.

>
> I know performance isn't the most important thing here, so if the
> argument is for clarity perhaps it makes sense. Otherwise I don't know
> if it's worth it.
>
>> Also, I think it's better to destroy the unloggedHash after use.
>
> Whoops! Fixed.
>
>> +   /* Exclude all forks for unlogged tables except the
>> init fork. */
>> +   if (unloggedHash && ResetUnloggedRelationsMatch(
>> +   unloggedHash, de->d_name) == unloggedOther)
>> +   {
>> +   elog(DEBUG2, "unlogged relation file \"%s\"
>> excluded from backup",
>> +de->d_name);
>> +   continue;
>> +   }
>>
>> I think it's better to log this debug message at DEBUG2 level for
>> consistency with other messages.
>
> I think you mean DEBUG1?  It's already at DEBUG2.

Oops, yes I meant DEBUG1.

>
> I considered using DEBUG1 but decided against it.  The other exclusions
> will produce a limited amount of output because there are only a few of
> them.  In the case of unlogged tables there could be any number of
> exclusions and I thought that was too noisy for DEBUG1.

IMO it's okay to output many unlogged tables for a debug purpose but I
see your point.

>
>> +   ok(!-f "$tempdir/tbackup/tblspc1/$tblspc1UnloggedBackupPath",
>> +   'unlogged imain fork not in tablespace backup');
>>
>> s/imain/main/
>
> Fixed.
>
>> If a new unlogged relation is created after constructed the
>> unloggedHash before sending file, we cannot exclude such relation. It
>> would not be problem if the taking backup is not long because the new
>> unlogged relation unlikely becomes so large. However, if takeing a
>> backup takes a long time, we could include large main fork in the
>> backup.
>
> This is a good point.  It's per database directory which makes it a
> little better, but maybe not by much.
>
> Three options here:
>
> 1) Leave it as is knowing that unlogged relations created during the
> backup may be copied and document it that way.
>
> 2) Construct a list for SendDir() to work against so the gap between
> creating that and creating the unlogged hash is as small as possible.
> The downside here is that the list may be very large and take up a lot
> of memory.
>
> 3) Check each file that looks like a relation in the loop to see if it
> has an init fork.  This might affect performance since an
> opendir/readdir loop would be required for every relation.
>
> Personally, I'm in favor of #1, at least for the 

Re: Further cleanup of pg_dump/pg_restore item selection code

2018-01-24 Thread David G. Johnston
On Wednesday, January 24, 2018, Tom Lane  wrote:
>
> I think you might be missing one of the main points here, which is
> that --create is specified as causing both a CREATE DATABASE and a
> reconnect to that database.
>

I missed the implication though I read and even thought about questioning
that aspect (in pg_dump though, not restore).

Should pg_restore fail if asked to --create without a database entry in the
TOC?

David J.


Re: Further cleanup of pg_dump/pg_restore item selection code

2018-01-24 Thread David G. Johnston
On Wednesday, January 24, 2018, Tom Lane  wrote:
>
>
> This does not go all the way towards making pg_restore's item selection
> switches dump subsidiary objects the same as pg_dump would, because
> pg_restore doesn't really have enough info to deal with indexes and
> table constraints the way pg_dump does.  Perhaps we could intuit the
> parent table by examining object dependencies, but that would be a
> bunch of new logic that seems like fit material for a different patch.
> In the meantime I just added a disclaimer to pg_restore.sgml explaining
> this.
>

Unless we really wanted to keep prior-version compatibility it would seem
more reliable to have pg_dump generate a new TOC entry type describing
these dependencies and have pg_restore read and interpret them during
selective restore mode.  Basically a cross-referencing "if you restore A
also restore B". Where A can only be tables and B indexes and constraints
(if we can get away with it being that limited initially).

David J.


Re: CREATE ROUTINE MAPPING

2018-01-24 Thread David Fetter
On Thu, Jan 18, 2018 at 04:09:13PM -0500, Corey Huinker wrote:
> >
> >
> > >
> > > But other situations seem un-handle-able to me:
> > >
> > > SELECT remote_func1(l.x) FROM local_table l WHERE l.active = true;
> >
> > Do we have any way, or any plan to make a way, to push the set (SELECT
> > x FROM local_table WHERE active = true) to the remote side for
> > execution there?  Obviously, there are foreign DBs that couldn't
> > support this, but I'm guessing they wouldn't have much by way of UDFs
> > either.
> >
> 
> No. The remote query has to be generated at planning time, so it can't make
> predicates out of anything that can't be resolved into constants by the
> planner itself. The complexities of doing so would be excessive, far better
> to let the application developer split the queries up because they know
> better which parts have to resolve first.

So Corey and I, with lots of inputs from Andrew Gierth and Matheus
Oliveira, have come up with a sketch of how to do this, to wit:

- Extend CREATE FUNCTION to take either FOREIGN and SERVER or AS and
  LANGUAGE as parameters, but not both. This seems simpler, at least
  in a proof of concept, than creating SQL standard compliant grammar
  out of whole cloth.  The SQL standard grammar could be layered in
  later via the rewriter if this turns out to work.

- In pg_proc, store foreign functions as having a new language,
  sql_med, which doesn't actually exist.  This "language" would
  function as a hint to the planner.

- Add a new system catalog for foreign functions that references
  pg_proc and pg_foreign_server. Writing to it would also do the usual
  stuff with pg_depend.

- During planning, at least to start, we'd ensure that foreign
  functions can only take arguments on the same server.

- Once it's established that the combinations could actually work,
  execution gets pushed to the foreign server(s)

What say?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 11:02 PM, Tom Lane  wrote:
> I may be wasting my breath here, but in one more attempt to convince
> you that "time make check" on your laptop is not the only number that
> anyone should be interested in, ...

Now that is not what I said, or at least not what I intended to say.
I'm taking the position that what happens on a developer laptop is
more relevant than what happens on an ancient buildfarm critter.  I am
NOT taking the position that my particular laptop or even developer
laptops generally are the *only* thing that matters.  I gave the
numbers from my laptop because it's the one I can test.  I cannot
easily test yours.

> What I take away here is that there's been a pretty steep cost increase
> for the regression tests since v10, and that is *not* in line with the
> historical average.  In fact, in most years we've bought enough speedup
> through performance improvements to pay for the test cases we added.
> This is masked if you just eyeball "make check" compared to several years
> ago.  But to do that, you have to ignore the fact that we made substantial
> improvements in the runtime of initdb as well as the regression tests
> proper circa v10, and we've now thrown that away and more.

OK, I can see some increase there.  It's definitely more for you than
it is for me, since you see something like a 10% slowdown between 10
and master and I see basically no difference.  I don't know why that
should be, but I'm not doubting you.

> So I remain dissatisfied with these results, particularly because in
> my own work habits, the time for "make installcheck-parallel" is way
> more interesting than "make check".  I avoid redoing installs and
> initdbs if I don't need them.

I'm not that efficient, but noted.

>> ... Even if that meant that you had
>> to wait 1 extra second every time you run 'make check', I would judge
>> that worthwhile.
>
> I think this is a bad way of looking at it.  Sure, in terms of
> one developer doing one test run, a second or two means nothing.
> But for instance, if you want to do 100 test runs in hope of catching
> a seldom-reproduced bug, it adds up.  It also adds up when you consider
> the aggregate effort expended by the buildfarm, or the time you have
> to wait to see buildfarm results.

Sure, but as Andres said, you also have to consider how much developer
time it takes to recoup the savings.  If it takes a day of development
time to save a second of regression test time, that might be worth it;
if it takes a month, color me doubtful, especially if the result is a
more fragile test that happens to pass on all of the buildfarm
critters we have now but might fail spuriously when somebody adds a
new one.

> Yeah.  What I thought this argument was about was convincing *you*
> that that would be a reasonable patch to apply.  It seems from my
> experiment on gaur that that patch makes the results unstable, so
> if we can do it at all it will need more work.  But I do think
> it's worth putting in some more sweat here.  In the long run the
> time savings will add up.

Why me?  Thomas wrote the patch, Andres committed it, and you
complained about it.  I'm just along for the ride.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Further cleanup of pg_dump/pg_restore item selection code

2018-01-24 Thread Tom Lane
"David G. Johnston"  writes:
> On Wednesday, January 24, 2018, Tom Lane  wrote:
>> The same behaviors occur if you do e.g.
>>  pg_dump -Fc -t sometable somedb | pg_restore --create
>> which is another undocumented misbehavior: the docs for pg_restore do not
>> suggest that --create might fail if the source archive was selective.

> pg_restore -t:

> "When -t is specified, pg_restore makes no attempt to restore any other
> database objects that the selected table(s) might depend upon. Therefore,
> there is no guarantee that a specific-table restore into a clean database
> will succeed."

I think you might be missing one of the main points here, which is
that --create is specified as causing both a CREATE DATABASE and a
reconnect to that database.  If it silently becomes a no-op, which
is what happens today, the restore is likely to go into the wrong
database entirely (or at least not the DB the user expected).

I won't deny the possibility that some people are depending on the
existing wrong behavior, but that's a hazard with any bug fix.

regards, tom lane



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Tom Lane
Robert Haas  writes:
> There is no need to collect years of data in order to tell whether or
> not the time to run the tests has increased by as much on developer
> machines as it has on prairiedog.  You showed the time going from 3:36
> to 8:09 between 2014 and the present.  That is a 2.26x increase.  It
> is obvious from the numbers I posted before that no such increase has
> taken place in the time it takes to run 'make check' on my relatively
> modern laptop.  Whatever difference exists is measured in
> milliseconds.

I may be wasting my breath here, but in one more attempt to convince
you that "time make check" on your laptop is not the only number that
anyone should be interested in, here are some timings off my development
workstation.  These are timings off current tip of each release branch,
all the same build options etc, so very comparable:

9.4
[ don't remember the equivalent to top-level make temp-install here ]
top-level make check 24.725s
installcheck-parallel 15.383s
installcheck 27.560s

9.5
make temp-install 3.702s
initdb 2.328s
top-level make check 24.709s
installcheck-parallel 16.632s
installcheck 32.427s

9.6
make temp-install 3.971s
initdb 2.178s
top-level make check 24.048s
installcheck-parallel 15.889s
installcheck 32.775s

10
make temp-install 4.051s
initdb 1.363s
top-level make check 21.784s
installcheck-parallel 15.209s
installcheck 31.938s

HEAD
make temp-install 4.048s
initdb 1.361s
top-level make check 24.027s
installcheck-parallel 16.914s
installcheck 35.745s

I copied-and-pasted the "real time" results of time(1) for each of these,
not bothering to round them off; but the numbers are only reproducible
to half a second or so, so there's no significance in the last couple
digits.  Most numbers above are the minimum of 2 or more runs.

What I take away here is that there's been a pretty steep cost increase
for the regression tests since v10, and that is *not* in line with the
historical average.  In fact, in most years we've bought enough speedup
through performance improvements to pay for the test cases we added.
This is masked if you just eyeball "make check" compared to several years
ago.  But to do that, you have to ignore the fact that we made substantial
improvements in the runtime of initdb as well as the regression tests
proper circa v10, and we've now thrown that away and more.

So I remain dissatisfied with these results, particularly because in
my own work habits, the time for "make installcheck-parallel" is way
more interesting than "make check".  I avoid redoing installs and
initdbs if I don't need them.

> ... Even if that meant that you had
> to wait 1 extra second every time you run 'make check', I would judge
> that worthwhile.

I think this is a bad way of looking at it.  Sure, in terms of
one developer doing one test run, a second or two means nothing.
But for instance, if you want to do 100 test runs in hope of catching
a seldom-reproduced bug, it adds up.  It also adds up when you consider
the aggregate effort expended by the buildfarm, or the time you have
to wait to see buildfarm results.

> Another thing you could do is consider applying the patch Thomas
> already posted to reduce the size of the tables involved.

Yeah.  What I thought this argument was about was convincing *you*
that that would be a reasonable patch to apply.  It seems from my
experiment on gaur that that patch makes the results unstable, so
if we can do it at all it will need more work.  But I do think
it's worth putting in some more sweat here.  In the long run the
time savings will add up.

regards, tom lane



Further cleanup of pg_dump/pg_restore item selection code

2018-01-24 Thread David G. Johnston
On Wednesday, January 24, 2018, Tom Lane  wrote:

> and it has a bunch of strange
> behaviors that can really only be characterized as bugs.  An example is
> that
>
> pg_dump --create -t sometable somedb
>
>
pg_dump -t:

"The -n and -N switches have no effect when -t is used, because tables
selected by -t will be dumped regardless of those switches, and non-table
objects will not be dumped."

a database is a non-table object...though the proposed behavior seems
reasonable; but maybe worthy of being pointed out just like the -n switches
are.

(probably been asked before but why 'no effect' instead of 'will cause the
dump to fail before it begins'?)


> The same behaviors occur if you do e.g.
>
> pg_dump -Fc -t sometable somedb | pg_restore --create
>
> which is another undocumented misbehavior: the docs for pg_restore do not
> suggest that --create might fail if the source archive was selective.
>

pg_restore -t:

"When -t is specified, pg_restore makes no attempt to restore any other
database objects that the selected table(s) might depend upon. Therefore,
there is no guarantee that a specific-table restore into a clean database
will succeed."

That -t was specified on the dump instead of the restore could be clarified
but applies nonetheless.

Do we document anywhere what is a "database object" and what are "property
objects" that are restored even when -t is specified?


> Attached is a patch that proposes to clean this up.  It arranges
> things so that CREATE DATABASE (and, if --clean, DROP DATABASE)
> is emitted if and only if you said --create, regardless of other
> selectivity switches.


Makes sense, the bit about regardless of other switches probably should
find it's way into the docs.

Adding a drop database where there never was one before is a bit
disconcerting though...even if the switches imply the user should be
expecting one,


>
>  And it fixes selective pg_restore to include
> subsidiary ACLs, comments, and security labels.


+1

David J.


RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-24 Thread Tsunakawa, Takayuki
From: Robert Haas [mailto:robertmh...@gmail.com]
> I think open_datasync will be worse on systems where fsync() is expensive
> -- it forces the data out to disk immediately, even if the data doesn't
> need to be flushed immediately.  That's bad, because we wait immediately
> when we could have deferred the wait until later and maybe gotten the WAL
> writer to do the work in the background.  But it might be better on systems
> where fsync() is basically free, because there you might as well just get
> it out of the way immediately and not leave something left to be done later.
> 
> This is just a guess, of course.  You didn't mention what the underlying
> storage for your test was?

Uh, your guess was correct.  My file system was ext3, where fsync() writes all 
dirty buffers in page cache.

As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM 
volume with ext4 (mounted with options noatime, nobarrier) on a PCIe flash 
memory.

5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 50829.597 ops/sec  20 usecs/op
fdatasync 42094.381 ops/sec  24 usecs/op
fsync  42209.972 ops/sec  
24 usecs/op
fsync_writethroughn/a
open_sync 48669.605 ops/sec  21 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 26366.373 ops/sec  38 usecs/op
fdatasync 33922.725 ops/sec  29 usecs/op
fsync 32990.209 ops/sec  30 usecs/op
fsync_writethroughn/a
open_sync 24326.249 ops/sec  41 usecs/op

What do you think about changing the default value of wal_sync_method on Linux 
in PG 11?  I can understand the concern that users might hit performance 
degredation if they are using PostgreSQL on older systems.  But it's also 
mottainai that many users don't notice the benefits of wal_sync_method = 
open_datasync on new systems.

Regards
Takayuki Tsunakawa




Re: [HACKERS] Support to COMMENT ON DATABASE CURRENT_DATABASE

2018-01-24 Thread Jing Wang
>Not surprisingly, this patch no longer applies in the wake of commit
>b3f840120.  Rather than rebasing the pg_dump portions, I would suggest
>you just drop them.

It  has been removed from the pg_dump codes.

>I notice some other patch application failures in dbcommands.c,
>objectaddress.c, and user.c, so there's more work to do besides
>this
Yes.  fixed it.

Enclosed please find the latest patch fixed above.


Regards,
Jing Wang
Fujitsu Australia


support_CURRENT_DATABASE_keyword_v4.7.patch
Description: Binary data


Re: non-bulk inserts and tuple routing

2018-01-24 Thread Amit Langote
On 2018/01/24 17:25, Amit Langote wrote:
> On 2018/01/20 7:07, Robert Haas wrote:
>> On Fri, Jan 19, 2018 at 3:56 AM, Amit Langote wrote:
>>> I rebased the patches, since they started conflicting with a recently
>>> committed patch [1].
>>
>> I think that my latest commit has managed to break this pretty thoroughly.
> 
> I rebased it.  Here are the performance numbers again.
> 
> * Uses following hash-partitioned table:
> 
> create table t1 (a int, b int) partition by hash (a);
> create table t1_x partition of t1 for values with (modulus M, remainder R)
> ...
> 
> 
> * Non-bulk insert uses the following code (insert 100,000 rows one-by-one):
> 
> do $$
> begin
>   for i in 1..10 loop
> insert into t1 values (i, i+1);
>   end loop;
> end; $$;
> 
> Times in milliseconds:
> 
> #parts   HEADPatched
>  8   6148.313   4938.775
> 16   8882.420   6203.911
> 32  14251.072   8595.068
> 64  24465.691  13718.161
>128  45099.435  23898.026
>256  87307.332  44428.126
> 
> * Bulk-inserting 100,000 rows using COPY:
> 
> copy t1 from '/tmp/t1.csv' csv;
> 
> Times in milliseconds:
> 
> #parts   HEADPatched
> 
>  8466.170446.865
> 16445.341444.990
> 32443.544487.713
> 64460.579435.412
>128469.953422.403
>256463.592431.118

Rebased again.

Thanks,
Amit
From 250fe77e304a0f0841f2d7978405c2a32fdeb7a2 Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 19 Dec 2017 10:43:45 +0900
Subject: [PATCH v4 1/3] Teach CopyFrom to use ModifyTableState for
 tuple-routing

This removes all fields of CopyStateData that were meant for
tuple routing and instead uses ModifyTableState that has all those
fields, including transition_tupconv_maps.  In COPY's case,
transition_tupconv_maps is only required if tuple routing is being
used, so it's safe.
---
 src/backend/commands/copy.c | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 04a24c6082..251676b321 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -166,9 +166,6 @@ typedef struct CopyStateData
boolvolatile_defexprs;  /* is any of defexprs volatile? 
*/
List   *range_table;
 
-   /* Tuple-routing support info */
-   PartitionTupleRouting *partition_tuple_routing;
-
TransitionCaptureState *transition_capture;
 
/*
@@ -2285,6 +2282,7 @@ CopyFrom(CopyState cstate)
ResultRelInfo *resultRelInfo;
ResultRelInfo *saved_resultRelInfo = NULL;
EState *estate = CreateExecutorState(); /* for ExecConstraints() */
+   ModifyTableState *mtstate = makeNode(ModifyTableState);
ExprContext *econtext;
TupleTableSlot *myslot;
MemoryContext oldcontext = CurrentMemoryContext;
@@ -2303,6 +2301,8 @@ CopyFrom(CopyState cstate)
SizebufferedTuplesSize = 0;
int firstBufferedLineNo = 0;
 
+   PartitionTupleRouting *proute = NULL;
+
Assert(cstate->rel);
 
/*
@@ -2468,10 +2468,15 @@ CopyFrom(CopyState cstate)
 */
if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
-   PartitionTupleRouting *proute;
+   ModifyTable *node = makeNode(ModifyTable);
 
-   proute = cstate->partition_tuple_routing =
-   ExecSetupPartitionTupleRouting(NULL, cstate->rel, 1, 
estate);
+   /* Just need make this field appear valid. */
+   node->nominalRelation = 1;
+   mtstate->ps.plan = (Plan *) node;
+   mtstate->ps.state = estate;
+   mtstate->resultRelInfo = resultRelInfo;
+   proute = mtstate->mt_partition_tuple_routing =
+   ExecSetupPartitionTupleRouting(mtstate, cstate->rel, 1, 
estate);
 
/*
 * If we are capturing transition tuples, they may need to be
@@ -2496,7 +2501,7 @@ CopyFrom(CopyState cstate)
if ((resultRelInfo->ri_TrigDesc != NULL &&
 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-   cstate->partition_tuple_routing != NULL ||
+   mtstate->mt_partition_tuple_routing != NULL ||
cstate->volatile_defexprs)
{
useHeapMultiInsert = false;
@@ -2571,10 +2576,9 @@ CopyFrom(CopyState cstate)
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
/* Determine the partition to heap_insert the tuple into */
-   if (cstate->partition_tuple_routing)
+   if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
  

Re: Failed to request an autovacuum work-item in silence

2018-01-24 Thread Masahiko Sawada
On Thu, Jan 25, 2018 at 12:14 AM, Fabrízio de Royes Mello
 wrote:
>
>
> On Wed, Jan 24, 2018 at 12:31 PM, Fabrízio de Royes Mello
>  wrote:
>>
>>
>>
>> On Tue, Jan 23, 2018 at 11:44 PM, Masahiko Sawada 
>> wrote:
>> >
>> > On Tue, Jan 23, 2018 at 8:03 PM, Fabrízio de Royes Mello
>> >  wrote:
>> > >
>> > > Em ter, 23 de jan de 2018 às 03:36, Masahiko Sawada
>> > > 
>> > > escreveu:
>> > >>
>> > >> Hi all,
>> > >>
>> > >> While reading the code, I realized that the requesting an autovacuum
>> > >> work-item could fail in silence if work-item array is full. So the
>> > >> users cannot realize that work-item is never performed.
>> > >> AutoVacuumRequestWork() seems to behave so from the initial
>> > >> implementation but is there any reason of such behavior? It seems to
>> > >> me that it can be a problem even now that there is only one kind of
>> > >> work-item. Attached patch for fixing it.
>> > >
>> > >
>> > > Seems reasonable but maybe you can use the word "worker" instead of
>> > > "work
>> > > item" for report message.
>> > >
>> >
>> > Thank you for the comment.
>> > Or can we use the word "work-item" since the commit log and source
>> > code use this word?
>> >
>>
>> You're correct, I misunderstood it thinking about autovacuum workers and
>> not the internal workitem array.
>>
>> As NUM_WORKITEMS is fixed in 256 we don't have any real feedback if in a
>> real workload this can send a lot of messages to log, so:
>> 1) maybe invent a new GUC to control if we need or not to send this info
>> to log
>> 2) change elevel for DEBUG1
>>

Hmm, I'd rather think to log the message at LOG level and to have a
new GUC to control the size of maximum of work-items array . I think
we should always notify users that work-item couldn't get added
because it can become a potential performance problem. Also it might
lead other problems once we have other types of work-item. In the log
message, we can hint for user that you might want to increase maximum
of work-items array.

>
> Looking better for the autovacuum code your patch will work just for BRIN
> request "brin_summarize_range", correct?
>

Correct.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Thomas Munro
On Thu, Jan 25, 2018 at 9:28 AM, Peter Geoghegan  wrote:
> On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro
>  wrote:
>> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan  wrote:
>>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>>
>>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>>> real problem. Sure enough, the coding pattern within
>>> _bt_leader_heapscan() can cause us to wait forever even with commit
>>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>>> consequence of the patch not using tuple queues (it uses the new
>>> tuplesort sharing thing instead).
>>
>> Just curious: does the attached also help?
>
> I can still reproduce the problem without the fix I described (which
> does work), using your patch instead.
>
> Offhand, I suspect that the way you set ParallelMessagePending may not
> always leave it set when it should be.

Here's a version that works, and a minimal repro test module thing.
Without 0003 applied, it hangs.  With 0003 applied, it does this:

postgres=# call test_fork_failure();
CALL
postgres=# call test_fork_failure();
CALL
postgres=# call test_fork_failure();
ERROR:  lost connection to parallel worker
postgres=# call test_fork_failure();
ERROR:  lost connection to parallel worker

I won't be surprised if 0003 is judged to be a horrendous abuse of the
interrupt system, but these patches might at least be useful for
understanding the problem.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Chaos-monkey-fork-failure.patch
Description: Binary data


0002-A-simple-test-module-that-hangs-on-fork-failure.patch
Description: Binary data


0003-Pessimistic-fork-failure-detector.patch
Description: Binary data


Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-01-24 Thread Michael Paquier
On Wed, Jan 24, 2018 at 12:45:48PM -0300, Alvaro Herrera wrote:
> /me wonders if there's anything that would suggest to make this
> extensive to processes other than backends ...

That's a good thought. Now ProcessInterrupts() is not used by
non-backend processes. For example the WAL receiver has its own logic to
handle interrupts but I guess that we could have an interface which can
be plugged in for any processes, which is by default enabled for
bgworkers.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

2018-01-24 Thread Michael Paquier
On Wed, Jan 24, 2018 at 12:43:51PM -0500, Stephen Frost wrote:
> * chenhj (chjis...@163.com) wrote:
>> At 2018-01-23 09:56:48, "Stephen Frost"  wrote:
>>> I've only read through the thread to try and understand what's going on
>>> and the first thing that comes to mind is that you're changing
>>> pg_rewind to not remove the WAL from before the divergence (split)
>>> point, but I'm not sure why.  As noted, that WAL isn't needed for
>>> anything (it's from before the split, after all), so why keep it?  Is
>>> there something in this optimization that depends on the old WAL being
>>> there and, if so, what and why?
>> 
>> After run pg_rewind, the first startup of postgres will do crash recovery.
>> And crash recovery will begin from the previous redo point preceding the 
>> divergence.
>> So, the WAL after the redo point and before the divergence is needed.
> 
> Right.

Most of the time, and particularly since v11 has removed the need to
retain more past segments than one completed checkpoint, those segments
have less chances to be on the source server, limiting more the impact
of the patch discussed on this thread.

>> Of course, the WAL before the redo point is not needed, but in my point of 
>> view,
>> recycling unwanted WAL does not have to be done by pg_rewind.
> 
> That's what pg_rewind has been doing though, isn't it?  And it's not
> like that WAL is useful for anything, is it?  That's also how
> pg_basebackup works.

As of HEAD, pg_rewind handles data in pg_wal similarly to other
paths which are not relation files: the files from the source are just
blindly copied to the target. After the rewind and once recovery begins,
we just let the startup process do the cleanup instead of
pg_rewind. Regarding that pg_basebackup is different, you get the choice
to do what you want using --wal-method, and you actually get the
segments that you only need to get a self-contained base backup.

>>> That's also different from how pg_basebackup works, which I don't think
>>> is good (seems like pg_rewind should operate in a pretty similar manner
>>> to pg_basebackup).
>> 
>> Thanks for your comments!
>> I also considered copy WAL just like how pg_basebackup does,but a
>> implement similar to pg_basebackup's manner may be not so simple. 
> 
> Using the replication protocol to fetch WAL would be a good thing to do
> (actually, making pg_rewind entirely work through a connection to the
> current primary would be great) but that's independent of what I'm
> asking for here.  Here I'm just suggesting that we not change what
> pg_rewind is doing today when it comes to the existing WAL on the
> old-primary.

Yes, superuser is necessary now, if we could get to a point where only a
replication permission is needed that would be nice. Now we could do
things differently. We could have a system role dedicated to pg_rewind
which works only on the functions from genfile.c that pg_rewind needs,
in order to leverage the need of a superuser.

>> And the WAL which contains the previous redo point preceding the
>> divergence may be only exists in target server and had been recycled
>> in source. That's different between pg_rewind and pg_basebackup. 
> 
> Hm, pg_rewind was removing that and expecting it to be on the new
> primary?  If that's the case then I could see an argument for keeping
> WAL that's from the divergence point onward, but I still don't think
> we should have pg_rewind just leave all of the prior WAL in place.

Another thing that we could as well do is simply not fetching any WAL
files at all during a rewind, then let the startup process of the
rewound server decide by itself what it needs. This would leverage the
data transfered in all cases. It is easy to define the start point of
WAL segments needed for a rewound server because the last checkpoint
record before WAL forked is calculated before transferring any data. The
finish point cannot be exact though because you don't know up to which
point you should transfer it. In some ways, this is close to a base
backup. We could as well define an end point to minimize the amount of
WAL as the last completed segment before data transfer begins, but then
you need to worry about WAL segment holes and such. At the end of the
day, just not transferring any data from pg_wal looks more solid to me
as a first step if we need to worry about data that is transferred but
finishes by being useless.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] SERIALIZABLE with parallel query

2018-01-24 Thread Thomas Munro
On Wed, Dec 13, 2017 at 5:30 PM, Thomas Munro
 wrote:
> On Wed, Dec 13, 2017 at 2:09 PM, Haribabu Kommi
>  wrote:
>> Thanks for explaining the problem in generating an isolation test to
>> test the serialize parallel query.
>>
>> Committer can decide whether existing test is fine to part of the test suite
>> or remove it, other than that everything is fine. so I am moving the patch
>> into "ready for committer" state.
>
> Thank you!  I will try to find a good benchmark that will really
> exercise parallel query + SSI.

This started crashing some time yesterday with an assertion failure in
the isolation tests after commit 2badb5af landed.  Reordering of code
in parallel.c confused patch's fuzz heuristics leading
SetSerializableXact() to be called too soon.  Here's a fix for that.

-- 
Thomas Munro
http://www.enterprisedb.com


ssi-parallel-v10.patch
Description: Binary data


Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Peter Geoghegan
On Wed, Jan 24, 2018 at 3:37 PM, Robert Haas  wrote:
> Well, I've been resisting that approach from the very beginning of
> parallel query.  Eventually, I hope that we're going to go in the
> direction of changing our mind about how many workers parallel
> operations use "on the fly".  For example, if there are 8 parallel
> workers available and 4 of them are in use, and you start a query (or
> index build) that wants 6 but only gets 4, it would be nice if the
> other 2 could join later after the other operation finishes and frees
> some up.

That seems like a worthwhile high-level goal.

I remember looking into Intel Threading Building Blocks many years
ago, and seeing some interesting ideas there. According to Wikipedia,
"TBB implements work stealing to balance a parallel workload across
available processing cores in order to increase core utilization and
therefore scaling". The programmer does not operate in terms of an
explicit number of threads, and there are probably certain types of
problems that this has an advantage with.

That model also has its costs, though, and I don't think it's every
going to supplant a lower level approach. In an ideal world, you have
both things, because TBB's approach apparently has high coordination
overhead on many core systems.

> That, of course, won't work very well if parallel operations
> are coded in such a way that the number of workers must be nailed down
> at the very beginning.

But my whole approach to sorting is based on the idea that each worker
produces a roughly even amount of output to merge. I don't see any
scope to do better for parallel CREATE INDEX. (Other uses for parallel
sort are another matter, though.)

> Now maybe all that seems like pie in the sky, and perhaps it is, but I
> hold out hope.  For queries, there is another consideration, which is
> that some queries may run with parallelism but actually finish quite
> quickly - it's not desirable to make the leader wait for workers to
> start when it could be busy computing.  That's a lesser consideration
> for bulk operations like parallel CREATE INDEX, but even there I don't
> think it's totally negligible.

Since I don't have to start this until the leader stops participating
as a worker, there is no wait in the leader. In the vast majority of
cases, a call to something like WaitForParallelWorkersToAttach() ends
up looking at state in shared memory, immediately determining that
every launched process initialized successfully. The overhead should
be negligible in the real world.

> For both reasons, it's much better, or so it seems to me, if parallel
> operations are coded to work with the number of workers that show up,
> rather than being inflexibly tied to a particular worker count.

I've been clear from day one that my approach to parallel tuplesort
isn't going to be that useful to parallel query in its first version.
You need some kind of partitioning (a distribution sort of some kind)
for that, and probably plenty of cooperation from within the executor.
I've also said that I don't think we can do much better for parallel
CREATE INDEX even *with* support for partitioning, which is something
borne out by comparisons with other systems. My patch was always
presented as an 80/20 solution.

I have given you specific technical reasons why I think that using a
barrier is at least a bad idea for nbtsort.c, and probably for
nodeGather.c, too. Those problems will need to be worked through if
you're not going to concede the point on using a barrier. Your
aspirations around not assuming that workers cannot join later seem
like good ones, broadly speaking, but they are not particularly
applicable to how *anything* happens to work now.

Besides all this, I'm not even suggesting that I need to know the
number of workers up front for parallel CREATE INDEX. Perhaps
nworkers_launched can be incremented after the fact following some
later enhancement to the parallel infrastructure, in which case
parallel CREATE INDEX will theoretically be prepared to take advantage
right away (though other parallel sort operations seem more likely to
*actually* benefit). That will be a job for the parallel
infrastructure, though, not for each and every parallel operation --
how else could we possibly hope to add more workers that become
available half way through, as part of a future enhancement to the
parallel infrastructure? Surely every caller to
CreateParallelContext() should not need to invent their own way of
doing this.

All I want is to be able to rely on nworkers_launched. That's not in
tension with this other goal/aspiration, and actually seems to
complement it.

-- 
Peter Geoghegan



Re: reducing isolation tests runtime

2018-01-24 Thread Tom Lane
Alvaro Herrera  writes:
> On the subject of test total time, we could paralelize isolation tests.
> Right now "make check" in src/test/isolation takes 1:16 on my machine.
> Test "timeouts" takes full 40s of that, with nothing running in parallel
> -- the machine is completely idle.

BTW, one small issue there is that the reason the timeouts test is so
slow is that we have to use multi-second timeouts to be sure slower
buildfarm critters (eg valgrind animals) will get the expected results.
So I'm worried that if the machine isn't otherwise idle, we will get
random failures.

We could parallelize the rest of those tests and leave timeouts in its own
group.  That cuts the payback a lot :-( but might still be worth doing.
Or maybe tweak things so that the buildfarm runs a serial schedule but
manual testing doesn't.  Or we could debate how important the timeout
tests really are ... or think harder about how to make them reproducible.

regards, tom lane



Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 5:52 PM, Peter Geoghegan  wrote:
>> If we made the Gather node wait only for workers that actually reached
>> the Gather -- either by using a Barrier or by some other technique --
>> then this would be a lot less fragile.  And the same kind of technique
>> would work for parallel CREATE INDEX.
>
> The use of a barrier has problems of its own [1], though, of which one
> is unique to the parallel_leader_participation=off case. I thought
> that you yourself agreed with this [2] -- do you?
>
> Another argument against using a barrier in this way is that it seems
> like way too much mechanism to solve a simple problem. Why should a
> client of parallel.h not be able to rely on nworkers_launched (perhaps
> only after "verifying it can be trusted")? That seem like a pretty
> reasonable requirement for clients to have for any framework for
> parallel imperative programming.

Well, I've been resisting that approach from the very beginning of
parallel query.  Eventually, I hope that we're going to go in the
direction of changing our mind about how many workers parallel
operations use "on the fly".  For example, if there are 8 parallel
workers available and 4 of them are in use, and you start a query (or
index build) that wants 6 but only gets 4, it would be nice if the
other 2 could join later after the other operation finishes and frees
some up.  That, of course, won't work very well if parallel operations
are coded in such a way that the number of workers must be nailed down
at the very beginning.

Now maybe all that seems like pie in the sky, and perhaps it is, but I
hold out hope.  For queries, there is another consideration, which is
that some queries may run with parallelism but actually finish quite
quickly - it's not desirable to make the leader wait for workers to
start when it could be busy computing.  That's a lesser consideration
for bulk operations like parallel CREATE INDEX, but even there I don't
think it's totally negligible.

For both reasons, it's much better, or so it seems to me, if parallel
operations are coded to work with the number of workers that show up,
rather than being inflexibly tied to a particular worker count.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: reducing isolation tests runtime

2018-01-24 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jan 24, 2018 at 6:10 PM, Alvaro Herrera  
> wrote:
>> On the subject of test total time, we could paralelize isolation tests.

> Oh, cool.  Yes, the time the isolation tests take to run is quite
> annoying.  I didn't realize it would be so easy to run it in parallel.

+1 to both --- I hadn't realized we had enough infrastructure to do this
in parallel, either.

regards, tom lane



Re: CONSTANT/NOT NULL/initializer properties for plpgsql record variables

2018-01-24 Thread Tom Lane
I wrote:
> I said a couple of times in recent threads that it wouldn't be too hard
> to implement $SUBJECT given the other patches I've been working on.

Here's a version rebased up to HEAD, with a trivial merge conflict fixed.

This now needs to be applied over the patches in
https://postgr.es/m/833.1516834...@sss.pgh.pa.us
and
https://postgr.es/m/8376.1516835...@sss.pgh.pa.us

regards, tom lane

diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile
index 2190eab..3ac64e2 100644
*** a/src/pl/plpgsql/src/Makefile
--- b/src/pl/plpgsql/src/Makefile
*** DATA = plpgsql.control plpgsql--1.0.sql 
*** 26,32 
  
  REGRESS_OPTS = --dbname=$(PL_TESTDB)
  
! REGRESS = plpgsql_call plpgsql_control plpgsql_record plpgsql_transaction
  
  all: all-lib
  
--- 26,33 
  
  REGRESS_OPTS = --dbname=$(PL_TESTDB)
  
! REGRESS = plpgsql_call plpgsql_control plpgsql_record \
! 	plpgsql_transaction plpgsql_varprops
  
  all: all-lib
  
diff --git a/src/pl/plpgsql/src/expected/plpgsql_varprops.out b/src/pl/plpgsql/src/expected/plpgsql_varprops.out
index ...109056c .
*** a/src/pl/plpgsql/src/expected/plpgsql_varprops.out
--- b/src/pl/plpgsql/src/expected/plpgsql_varprops.out
***
*** 0 
--- 1,300 
+ --
+ -- Tests for PL/pgSQL variable properties: CONSTANT, NOT NULL, initializers
+ --
+ create type var_record as (f1 int4, f2 int4);
+ create domain int_nn as int not null;
+ create domain var_record_nn as var_record not null;
+ create domain var_record_colnn as var_record check((value).f2 is not null);
+ -- CONSTANT
+ do $$
+ declare x constant int := 42;
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ NOTICE:  x = 42
+ do $$
+ declare x constant int;
+ begin
+   x := 42;  -- fail
+ end$$;
+ ERROR:  variable "x" is declared CONSTANT
+ LINE 4:   x := 42;  -- fail
+   ^
+ do $$
+ declare x constant int; y int;
+ begin
+   for x, y in select 1, 2 loop  -- fail
+   end loop;
+ end$$;
+ ERROR:  variable "x" is declared CONSTANT
+ LINE 4:   for x, y in select 1, 2 loop  -- fail
+   ^
+ do $$
+ declare x constant int[];
+ begin
+   x[1] := 42;  -- fail
+ end$$;
+ ERROR:  variable "x" is declared CONSTANT
+ LINE 4:   x[1] := 42;  -- fail
+   ^
+ do $$
+ declare x constant int[]; y int;
+ begin
+   for x[1], y in select 1, 2 loop  -- fail (currently, unsupported syntax)
+   end loop;
+ end$$;
+ ERROR:  syntax error at or near "["
+ LINE 4:   for x[1], y in select 1, 2 loop  -- fail (currently, unsup...
+^
+ do $$
+ declare x constant var_record;
+ begin
+   x.f1 := 42;  -- fail
+ end$$;
+ ERROR:  variable "x" is declared CONSTANT
+ LINE 4:   x.f1 := 42;  -- fail
+   ^
+ do $$
+ declare x constant var_record; y int;
+ begin
+   for x.f1, y in select 1, 2 loop  -- fail
+   end loop;
+ end$$;
+ ERROR:  variable "x" is declared CONSTANT
+ LINE 4:   for x.f1, y in select 1, 2 loop  -- fail
+   ^
+ -- initializer expressions
+ do $$
+ declare x int := sin(0);
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ NOTICE:  x = 0
+ do $$
+ declare x int := 1/0;  -- fail
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ ERROR:  division by zero
+ CONTEXT:  SQL statement "SELECT 1/0"
+ PL/pgSQL function inline_code_block line 3 during statement block local variable initialization
+ do $$
+ declare x bigint[] := array[1,3,5];
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ NOTICE:  x = {1,3,5}
+ do $$
+ declare x record := row(1,2,3);
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ NOTICE:  x = (1,2,3)
+ do $$
+ declare x var_record := row(1,2);
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ NOTICE:  x = (1,2)
+ -- NOT NULL
+ do $$
+ declare x int not null;  -- fail
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ ERROR:  variable "x" must have a default value, since it's declared NOT NULL
+ LINE 2: declare x int not null;  -- fail
+   ^
+ do $$
+ declare x int not null := 42;
+ begin
+   raise notice 'x = %', x;
+   x := null;  -- fail
+ end$$;
+ NOTICE:  x = 42
+ ERROR:  null value cannot be assigned to variable "x" declared NOT NULL
+ CONTEXT:  PL/pgSQL function inline_code_block line 5 at assignment
+ do $$
+ declare x int not null := null;  -- fail
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ ERROR:  null value cannot be assigned to variable "x" declared NOT NULL
+ CONTEXT:  PL/pgSQL function inline_code_block line 3 during statement block local variable initialization
+ do $$
+ declare x record not null;  -- fail
+ begin
+   raise notice 'x = %', x;
+ end$$;
+ ERROR:  variable "x" must have a default value, since it's declared NOT NULL
+ LINE 2: declare x record not null;  -- fail
+  ^
+ do $$
+ declare x record not null := row(42);
+ begin
+   raise notice 'x = %', x;
+   x := row(null);  -- ok
+   raise notice 'x = %', x;
+   x := null;  -- fail
+ end$$;
+ NOTICE:  x = (42)
+ NOTICE:  x = ()
+ ERROR:  null value cannot be assigned to variable "x" declared NOT NULL
+ 

Re: reducing isolation tests runtime

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 6:10 PM, Alvaro Herrera  wrote:
> On the subject of test total time, we could paralelize isolation tests.
> Right now "make check" in src/test/isolation takes 1:16 on my machine.
> Test "timeouts" takes full 40s of that, with nothing running in parallel
> -- the machine is completely idle.
>
> Seems like we can have a lot of time back just by changing the schedule
> to use multiple tests per line (in particular, put the other slow tests
> together with timeouts), per the attached; with this new schedule,
> isolation takes 44 seconds in my machine -- a win of 32 seconds.  We can
> win a couple of additional second by grouping a few other lines, but
> this is the biggest win.
>
> (This needs to be adjusted because some table names in the specs
> conflict.)

Oh, cool.  Yes, the time the isolation tests take to run is quite
annoying.  I didn't realize it would be so easy to run it in parallel.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: plpgsql function startup-time improvements

2018-01-24 Thread Tom Lane
Pavel Stehule  writes:
> please, can you rebase all three patches necessary for patching?

Done.  These now need to be applied over
https://www.postgresql.org/message-id/833.1516834...@sss.pgh.pa.us

regards, tom lane

diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index 09ecaec..97cb763 100644
*** a/src/pl/plpgsql/src/pl_comp.c
--- b/src/pl/plpgsql/src/pl_comp.c
*** plpgsql_adddatum(PLpgSQL_datum *new)
*** 2183,2194 
--- 2183,2211 
  static void
  plpgsql_finish_datums(PLpgSQL_function *function)
  {
+ 	Size		copiable_size = 0;
  	int			i;
  
  	function->ndatums = plpgsql_nDatums;
  	function->datums = palloc(sizeof(PLpgSQL_datum *) * plpgsql_nDatums);
  	for (i = 0; i < plpgsql_nDatums; i++)
+ 	{
  		function->datums[i] = plpgsql_Datums[i];
+ 
+ 		/* This must agree with copy_plpgsql_datums on what is copiable */
+ 		switch (function->datums[i]->dtype)
+ 		{
+ 			case PLPGSQL_DTYPE_VAR:
+ copiable_size += MAXALIGN(sizeof(PLpgSQL_var));
+ break;
+ 			case PLPGSQL_DTYPE_REC:
+ copiable_size += MAXALIGN(sizeof(PLpgSQL_rec));
+ break;
+ 			default:
+ break;
+ 		}
+ 	}
+ 	function->copiable_size = copiable_size;
  }
  
  
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index e119744..e2d315c 100644
*** a/src/pl/plpgsql/src/pl_exec.c
--- b/src/pl/plpgsql/src/pl_exec.c
*** static HTAB *shared_cast_hash = NULL;
*** 235,241 
  static void coerce_function_result_tuple(PLpgSQL_execstate *estate,
  			 TupleDesc tupdesc);
  static void plpgsql_exec_error_callback(void *arg);
! static PLpgSQL_datum *copy_plpgsql_datum(PLpgSQL_datum *datum);
  static MemoryContext get_stmt_mcontext(PLpgSQL_execstate *estate);
  static void push_stmt_mcontext(PLpgSQL_execstate *estate);
  static void pop_stmt_mcontext(PLpgSQL_execstate *estate);
--- 235,242 
  static void coerce_function_result_tuple(PLpgSQL_execstate *estate,
  			 TupleDesc tupdesc);
  static void plpgsql_exec_error_callback(void *arg);
! static void copy_plpgsql_datums(PLpgSQL_execstate *estate,
! 	PLpgSQL_function *func);
  static MemoryContext get_stmt_mcontext(PLpgSQL_execstate *estate);
  static void push_stmt_mcontext(PLpgSQL_execstate *estate);
  static void pop_stmt_mcontext(PLpgSQL_execstate *estate);
*** plpgsql_exec_function(PLpgSQL_function *
*** 458,465 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	for (i = 0; i < estate.ndatums; i++)
! 		estate.datums[i] = copy_plpgsql_datum(func->datums[i]);
  
  	/*
  	 * Store the actual call argument values into the appropriate variables
--- 459,465 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	copy_plpgsql_datums(, func);
  
  	/*
  	 * Store the actual call argument values into the appropriate variables
*** plpgsql_exec_trigger(PLpgSQL_function *f
*** 859,866 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	for (i = 0; i < estate.ndatums; i++)
! 		estate.datums[i] = copy_plpgsql_datum(func->datums[i]);
  
  	/*
  	 * Put the OLD and NEW tuples into record variables
--- 859,865 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	copy_plpgsql_datums(, func);
  
  	/*
  	 * Put the OLD and NEW tuples into record variables
*** plpgsql_exec_event_trigger(PLpgSQL_funct
*** 1153,1159 
  {
  	PLpgSQL_execstate estate;
  	ErrorContextCallback plerrcontext;
- 	int			i;
  	int			rc;
  	PLpgSQL_var *var;
  
--- 1152,1157 
*** plpgsql_exec_event_trigger(PLpgSQL_funct
*** 1174,1181 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	for (i = 0; i < estate.ndatums; i++)
! 		estate.datums[i] = copy_plpgsql_datum(func->datums[i]);
  
  	/*
  	 * Assign the special tg_ variables
--- 1172,1178 
  	 * Make local execution copies of all the datums
  	 */
  	estate.err_text = gettext_noop("during initialization of execution state");
! 	copy_plpgsql_datums(, func);
  
  	/*
  	 * Assign the special tg_ variables
*** plpgsql_exec_error_callback(void *arg)
*** 1290,1346 
   * Support function for initializing local execution variables
   * --
   */
! static PLpgSQL_datum *
! copy_plpgsql_datum(PLpgSQL_datum *datum)
  {
! 	PLpgSQL_datum *result;
  
! 	switch (datum->dtype)
! 	{
! 		case PLPGSQL_DTYPE_VAR:
! 			{
! PLpgSQL_var *new = palloc(sizeof(PLpgSQL_var));
  
! memcpy(new, datum, sizeof(PLpgSQL_var));
! /* should be preset to null/non-freeable */
! Assert(new->isnull);
! 

reducing isolation tests runtime

2018-01-24 Thread Alvaro Herrera
On the subject of test total time, we could paralelize isolation tests.
Right now "make check" in src/test/isolation takes 1:16 on my machine.
Test "timeouts" takes full 40s of that, with nothing running in parallel
-- the machine is completely idle.

Seems like we can have a lot of time back just by changing the schedule
to use multiple tests per line (in particular, put the other slow tests
together with timeouts), per the attached; with this new schedule,
isolation takes 44 seconds in my machine -- a win of 32 seconds.  We can
win a couple of additional second by grouping a few other lines, but
this is the biggest win.

(This needs to be adjusted because some table names in the specs
conflict.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/test/isolation/isolation_schedule 
b/src/test/isolation/isolation_schedule
index 74d7d59546..f8dbd018e7 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -1,3 +1,4 @@
+test: timeouts tuplelock-update deadlock-hard deadlock-soft-2 alter-table-1 
receipt-report multiple-row-versions
 test: read-only-anomaly
 test: read-only-anomaly-2
 test: read-only-anomaly-3
@@ -6,26 +7,18 @@ test: read-write-unique-2
 test: read-write-unique-3
 test: read-write-unique-4
 test: simple-write-skew
-test: receipt-report
 test: temporal-range-integrity
 test: project-manager
 test: classroom-scheduling
 test: total-cash
 test: referential-integrity
 test: ri-trigger
-test: partial-index
-test: two-ids
-test: multiple-row-versions
-test: index-only-scan
+test: partial-index two-ids index-only-scan eval-plan-qual lock-update-delete 
lock-committed-update lock-committed-keyupdate tuplelock-conflict alter-table-2 
alter-table-3 create-trigger
 test: deadlock-simple
-test: deadlock-hard
 test: deadlock-soft
-test: deadlock-soft-2
 test: fk-contention
 test: fk-deadlock
 test: fk-deadlock2
-test: eval-plan-qual
-test: lock-update-delete
 test: lock-update-traversal
 test: insert-conflict-do-nothing
 test: insert-conflict-do-nothing-2
@@ -38,12 +31,8 @@ test: delete-abort-savept-2
 test: aborted-keyrevoke
 test: multixact-no-deadlock
 test: multixact-no-forget
-test: lock-committed-update
-test: lock-committed-keyupdate
 test: update-locked-tuple
 test: propagate-lock-delete
-test: tuplelock-conflict
-test: tuplelock-update
 test: freeze-the-dead
 test: nowait
 test: nowait-2
@@ -56,13 +45,8 @@ test: skip-locked-3
 test: skip-locked-4
 test: drop-index-concurrently-1
 test: multiple-cic
-test: alter-table-1
-test: alter-table-2
-test: alter-table-3
 test: alter-table-4
-test: create-trigger
 test: sequence-ddl
 test: async-notify
 test: vacuum-reltuples
-test: timeouts
 test: vacuum-concurrent-drop


Re: Possible performance regression with pg_dump of a large number of relations

2018-01-24 Thread Stephen Frost
Hi there!

* Luke Cowell (lcow...@gmail.com) wrote:
> Hi Stephen, thank you for putting this together.

Yeah, it needs more work, which I figured out after actually hacking
together a patch for it and I've just not gotten back to it yet.

> > If folks get a chance to take a look at the query and/or test, that'd be
> > great.  I'll try to work up an actual patch to pg_dump this weekend to
> > run it through the regression tests and see if anything breaks.
> 
> I'm not sure how I can help other than testing that this runs. I can confirm 
> that it runs on 10.1. It does not run on 9.5 or 9.6 and gives this error:

Thanks for offering to help!  Once I've got an actual patch together
that works well enough to get through the regression tests, I'll
definitely let you know.  The general premise still looks viable, but
the actual query needs to be reworked.

> > ERROR:  relation "pg_init_privs" does not exist
> > LINE 139: LEFT JOIN pg_init_privs pip

I certainly hope that works on 9.6, since that's when pg_init_privs was
added..

> I'm guessing that the error is not surprising and that the query is intended 
> for an upcoming release of postgres and wouldn't be backported to 9.6.x

Presuming I can make it work, the idea would be to back-port it to 9.6
and 10, since pg_init_privs and this code was added in 9.6.

Thanks again!

Stephen


signature.asc
Description: PGP signature


Re: Converting plpgsql to use DTYPE_REC for named composite types

2018-01-24 Thread Tom Lane
Here's a version of this rebased up to HEAD, fixing a couple of trivial
merge conflicts and incorporating the docs delta I posted separately.

(I'd supposed this patch was still OK because the patch tester said so,
but I now see that the tester was only testing the docs delta :-(.)

regards, tom lane



use-dtype-rec-for-all-composites-3.patch.gz
Description: use-dtype-rec-for-all-composites-3.patch.gz


Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> Fair point, but doesn't it apply equally to non-default ACLs on any
> >> other system objects?  If you tweaked the permissions on say pg_ls_dir(),
> >> then dump, then tweak them some more, you're going to get uncertain
> >> results if you load that dump back into this database ... with or without
> >> --clean, because we certainly aren't going to drop pinned objects.
> 
> > Yes, that's certainly true, the public schema is the only "special"
> > animal in this regard and making it less special (and more like
> > pg_ls_dir()) would definitely be nice.
> 
> I wonder if it'd be worth the trouble to invent a variant of REVOKE
> that means "restore this object's permissions to default" --- that is,
> either the ACL recorded in pg_init_privs if there is one, or NULL if
> there's no pg_init_privs entry.  Then you could imagine pg_dump emitting
> that command before trying to assign an ACL to any object it hadn't
> created earlier in the run, rather than guessing about the current state
> of the object's ACL.  (I'm not volunteering.)

I actually like that idea quite a bit..  Not really high on my priority
list though.

> >> I think we could jigger things so that we dump the definition of these
> >> special quasi-system objects only if their ACLs are not default, but
> >> it's not clear to me that that's really an improvement in the long run.
> >> Seems like it's just making them even wartier.
> 
> > Yeah, that would be worse, I agree.
> 
> So are we at a consensus yet?

You had me at "make public less special", I was just trying to make sure
we all understand what that means.

+1 from me for moving forward.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: JIT compiling with LLVM v9.0

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 14:06:30 -0800, Andres Freund wrote:
> > In LLVM 5.0, it looks like DebugInfo.h is not available in llvm-c, only as 
> > a C
> > ++ API in llvm/IR/DebugInfo.h.
> 
> Hm, I compiled against 5.0 quite recently, but added the stripping of
> debuginfo lateron.  I'll add a fallback method, thanks for pointing that
> out!

Went more with your fix, there's not much point in using the C API
here. Should probably remove the use of it nearly entirely from the .cpp
file (save for wrap/unwrap() use). But man, the 'class Error' usage is
one major ugly pain.


> > But I still could not build because the LLVM API changed between 5.0 and 
> > 6.0 
> > regarding value info SummaryList. 
> 
> Hm, thought these changes were from before my 5.0 test. But the code
> evolved heavily, so I might misremember. Let me see.

Ah, that one was actually easier to fix. There's no need to get the base
object at all, so it's just a one-line change.


> Thanks, I'll try to push fixes into the tree soon-ish..

Pushed.

Thanks again for looking!

- Andres



Re: [HACKERS] UPDATE of partition key

2018-01-24 Thread Thomas Munro
On Thu, Jan 25, 2018 at 10:39 AM, Robert Haas  wrote:
> On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> Tom, do you want to double-check that this fixes it for you?
>>
>> I can confirm that a valgrind run succeeded for me with the patch
>> in place.
>
> Committed.  Sorry for the delay.

FYI I'm planning to look into adding a valgrind check to the
commitfest CI thing I run so we can catch these earlier without
committer involvement.  It's super slow because of all those pesky
regression tests so I'll probably need to improve the scheduling logic
a bit to make it useful first (prioritising new patches or something,
since otherwise it'll take up to multiple days to get around to
valgrind-testing any given patch...).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Possible performance regression with pg_dump of a large number of relations

2018-01-24 Thread Luke Cowell
Hi Stephen, thank you for putting this together.

> If folks get a chance to take a look at the query and/or test, that'd be
> great.  I'll try to work up an actual patch to pg_dump this weekend to
> run it through the regression tests and see if anything breaks.

I'm not sure how I can help other than testing that this runs. I can confirm 
that it runs on 10.1. It does not run on 9.5 or 9.6 and gives this error:
> ERROR:  relation "pg_init_privs" does not exist
> LINE 139: LEFT JOIN pg_init_privs pip

I'm guessing that the error is not surprising and that the query is intended 
for an upcoming release of postgres and wouldn't be backported to 9.6.x

Luke


Re: documentation is now XML

2018-01-24 Thread Peter Eisentraut
On 1/24/18 09:45, Bruce Momjian wrote:
> So we are not
> using TeX anymore for PG11+ docs?

as of PG10

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] UPDATE of partition key

2018-01-24 Thread Robert Haas
On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> Tom, do you want to double-check that this fixes it for you?
>
> I can confirm that a valgrind run succeeded for me with the patch
> in place.

Committed.  Sorry for the delay.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: JIT compiling with LLVM v9.0

2018-01-24 Thread Pierre Ducroquet
On Wednesday, January 24, 2018 8:20:38 AM CET Andres Freund wrote:
> As the patchset is large (500kb) and I'm still quickly evolving it, I do
> not yet want to attach it. The git tree is at
>   https://git.postgresql.org/git/users/andresfreund/postgres.git
> in the jit branch
>  
> https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shor
> tlog;h=refs/heads/jit
> 
> to build --with-llvm has to be passed to configure, llvm-config either
> needs to be in PATH or provided with LLVM_CONFIG to make. A c++ compiler
> and clang need to be available under common names or provided via CXX /
> CLANG respectively.
> 
> Regards,
> 
> Andres Freund

Hi

I tried to build on Debian sid, using GCC 7 and LLVM 5. I used the following 
to compile, using your branch @3195c2821d :

$ export LLVM_CONFIG=/usr/bin/llvm-config-5.0
$ ./configure --with-llvm
$ make

And I had the following build error :
llvmjit_wrap.cpp:32:10: fatal error: llvm-c/DebugInfo.h: No such file or 
directory
 #include "llvm-c/DebugInfo.h"
  ^~~~
compilation terminated.

In LLVM 5.0, it looks like DebugInfo.h is not available in llvm-c, only as a C
++ API in llvm/IR/DebugInfo.h.

For 'sport' (I have not played with LLVM API since more than one year), I 
tried to fix it, changing it to the C++ include.

The DebugInfo related one was easy, only one function was used.
But I still could not build because the LLVM API changed between 5.0 and 6.0 
regarding value info SummaryList. 

llvmjit_wrap.cpp: In function 
‘std::unique_ptr > > 
llvm_build_inline_plan(llvm::Module*)’:
llvmjit_wrap.cpp:285:48: error: ‘class llvm::GlobalValueSummary’ has no member 
named ‘getBaseObject’
fs = llvm::cast(gvs->getBaseObject());
^

That one was a bit uglier.

I'm not sure how to test everything properly, so the patch is attached for 
both these issues, do as you wish with it… :)

Regards

 Pierre Ducroquet

>From fdfea09dd7410d6ed7ad54df1ba3092bd0eecb92 Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet 
Date: Wed, 24 Jan 2018 22:28:34 +0100
Subject: [PATCH] Allow building with LLVM 5.0

---
 src/backend/lib/llvmjit_wrap.cpp | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/backend/lib/llvmjit_wrap.cpp b/src/backend/lib/llvmjit_wrap.cpp
index b745aec4fe..7961148a85 100644
--- a/src/backend/lib/llvmjit_wrap.cpp
+++ b/src/backend/lib/llvmjit_wrap.cpp
@@ -29,7 +29,6 @@ extern "C"
 
 #include "llvm-c/Core.h"
 #include "llvm-c/BitReader.h"
-#include "llvm-c/DebugInfo.h"
 
 #include 
 #include 
@@ -50,6 +49,7 @@ extern "C"
 #include "llvm/Analysis/ModuleSummaryAnalysis.h"
 #include "llvm/Bitcode/BitcodeReader.h"
 #include "llvm/IR/CallSite.h"
+#include "llvm/IR/DebugInfo.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/ModuleSummaryIndex.h"
 #include "llvm/Linker/IRMover.h"
@@ -218,6 +218,13 @@ llvm_inline(LLVMModuleRef M)
 	llvm_execute_inline_plan(mod, globalsToInline.get());
 }
 
+
+inline llvm::GlobalValueSummary *GlobalValueSummary__getBaseObject(llvm::GlobalValueSummary *gvs) {
+  if (auto *AS = llvm::dyn_cast(gvs))
+return >getAliasee();
+  return gvs;
+}
+
 /*
  * Build information necessary for inlining external function references in
  * mod.
@@ -282,7 +289,7 @@ llvm_build_inline_plan(llvm::Module *mod)
 			const llvm::Module *defMod;
 			llvm::Function *funcDef;
 
-			fs = llvm::cast(gvs->getBaseObject());
+			fs = llvm::cast(GlobalValueSummary__getBaseObject(gvs.get()));
 			elog(DEBUG2, "func %s might be in %s",
  funcName.data(),
  modPath.data());
@@ -476,7 +483,7 @@ load_module(llvm::StringRef Identifier)
 	 * code. Until that changes, not much point in wasting memory and cycles
 	 * on processing debuginfo.
 	 */
-	LLVMStripModuleDebugInfo(mod);
+	llvm::StripDebugInfo(*llvm::unwrap(mod));
 
 	return std::unique_ptr(llvm::unwrap(mod));
 }
-- 
2.15.1



Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Fair point, but doesn't it apply equally to non-default ACLs on any
>> other system objects?  If you tweaked the permissions on say pg_ls_dir(),
>> then dump, then tweak them some more, you're going to get uncertain
>> results if you load that dump back into this database ... with or without
>> --clean, because we certainly aren't going to drop pinned objects.

> Yes, that's certainly true, the public schema is the only "special"
> animal in this regard and making it less special (and more like
> pg_ls_dir()) would definitely be nice.

I wonder if it'd be worth the trouble to invent a variant of REVOKE
that means "restore this object's permissions to default" --- that is,
either the ACL recorded in pg_init_privs if there is one, or NULL if
there's no pg_init_privs entry.  Then you could imagine pg_dump emitting
that command before trying to assign an ACL to any object it hadn't
created earlier in the run, rather than guessing about the current state
of the object's ACL.  (I'm not volunteering.)

>> I think we could jigger things so that we dump the definition of these
>> special quasi-system objects only if their ACLs are not default, but
>> it's not clear to me that that's really an improvement in the long run.
>> Seems like it's just making them even wartier.

> Yeah, that would be worse, I agree.

So are we at a consensus yet?

regards, tom lane



Re: PATCH: Exclude unlogged tables from base backups

2018-01-24 Thread Adam Brightwell
> I agree with #1 and feel the updated docs are reasonable and
> sufficient to address this case for now.
>
> I have retested these patches against master at d6ab720360.
>
> All test succeed.
>
> Marking "Ready for Committer".

Actually, marked it "Ready for Review" to wait for Masahiko to comment/agree.

Masahiko,

If you agree with the above, would you mind updating the status accordingly?

-Adam



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 4:01 PM, Tom Lane  wrote:
> The progress-display output of pg_regress would need a complete rethink
> anyhow.  First thought is to emit two lines per test, one when we
> launch it and one when it finishes and we check the results:
>
> foreign_data: launched
> ...
> foreign_data: ok   (or FAILED)

I suspect that first line would only be interesting on the very rare
occasions when test hangs.  I'm not sure it would be good to add that
much chatter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 15:58:16 -0500, Tom Lane wrote:
> Yeah.  We already have topo sort code in pg_dump, maybe we could push that
> into someplace like src/common or src/fe_utils?  Although pg_dump hasn't
> got any need for edge weights, so maybe sharing code isn't worth it.

I suspect it may be more work to share than worth it, but either way, it
shouldn't be too hard.  Hm, isn't dbObjectTypePriority kinda an edge
weight? Seems like we properly could implement it as that.


> We could flush the existing schedule files and use a simple format like
>   testname: list of earlier tests it depends on
> (I guess there would be more properties than just the dependencies,
> but still not hard to parse.)

Yea, I think there'd need to be a few more. There's some tests that use
multiple connections, and I suspect it'll be useful to have "implicit"
ordering dependencies for a few test, like a "barrier". Otherwise
e.g. the tablespace test will be annoying to order.


> > If we keep the timings from an earlier
> > run somwhere, we can use the timing of runs as edge weights, making the
> > schedule better.
> 
> I think we could just use constant values hand-coded in the schedule file.
> It might occasionally be worth updating them, but realistically it's not
> going to matter that they be very accurate.  Probably weights like 1, 2,
> and 3 would be plenty ;-)

The reason I like the idea of using prior tests as scheduling input is
that the slowness actually depends a lot on the type of machine its run
on, and more importantly on things like valgrind, CCA, fsync=on/off,
jit=on/off (far most expensive tests is e.g. the recursion test in
errors.sql :)).

Greetings,

Andres Freund



Re: PATCH: Exclude unlogged tables from base backups

2018-01-24 Thread Adam Brightwell
>> If a new unlogged relation is created after constructed the
>> unloggedHash before sending file, we cannot exclude such relation. It
>> would not be problem if the taking backup is not long because the new
>> unlogged relation unlikely becomes so large. However, if takeing a
>> backup takes a long time, we could include large main fork in the
>> backup.
>
> This is a good point.  It's per database directory which makes it a
> little better, but maybe not by much.
>
> Three options here:
>
> 1) Leave it as is knowing that unlogged relations created during the
> backup may be copied and document it that way.
>
> 2) Construct a list for SendDir() to work against so the gap between
> creating that and creating the unlogged hash is as small as possible.
> The downside here is that the list may be very large and take up a lot
> of memory.
>
> 3) Check each file that looks like a relation in the loop to see if it
> has an init fork.  This might affect performance since an
> opendir/readdir loop would be required for every relation.
>
> Personally, I'm in favor of #1, at least for the time being.  I've
> updated the docs as indicated in case you and Adam agree.

I agree with #1 and feel the updated docs are reasonable and
sufficient to address this case for now.

I have retested these patches against master at d6ab720360.

All test succeed.

Marking "Ready for Committer".

-Adam



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Tom Lane
Andres Freund  writes:
> On 2018-01-24 15:36:35 -0500, Tom Lane wrote:
>> I'm concerned that we'd end up with a higher number of irreproducible
>> test failures with no good way to investigate them.

> Hm. We probably should dump the used ordering of tests somwhere upon
> failure, to make it easier to debug.

The progress-display output of pg_regress would need a complete rethink
anyhow.  First thought is to emit two lines per test, one when we
launch it and one when it finishes and we check the results:

foreign_data: launched
...
foreign_data: ok   (or FAILED)

regards, tom lane



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Tom Lane
Andres Freund  writes:
> On 2018-01-24 17:18:26 -0300, Alvaro Herrera wrote:
>> Yeah, I proposed this a decade ago but never had the wits to write the
>> code.

> It shouldn't be too hard, right? Leaving defining the file format,
> parsing it, creating the new schedule with depencencies and adapting
> tests aside (hah), it mostly seems a relatively simple graph ordering /
> topological sort problem, right?

Yeah.  We already have topo sort code in pg_dump, maybe we could push that
into someplace like src/common or src/fe_utils?  Although pg_dump hasn't
got any need for edge weights, so maybe sharing code isn't worth it.

We could flush the existing schedule files and use a simple format like
testname: list of earlier tests it depends on
(I guess there would be more properties than just the dependencies,
but still not hard to parse.)

> If we keep the timings from an earlier
> run somwhere, we can use the timing of runs as edge weights, making the
> schedule better.

I think we could just use constant values hand-coded in the schedule file.
It might occasionally be worth updating them, but realistically it's not
going to matter that they be very accurate.  Probably weights like 1, 2,
and 3 would be plenty ;-)

regards, tom lane



Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> No, if you have a nondefault ACL, that will still get applied.  This
> >> arrangement would drop comment changes, but I can't get excited about
> >> that; it's certainly far less of an inconvenience in that scenario
> >> than dumping the comment is in non-superuser-restore scenarios.
> 
> > That nondefault ACL from the system the pg_dump was run on will get
> > applied *over-top* of whatever the current ACL on the system that the
> > restore is being run on, which may or may not be what's expected.
> 
> Fair point, but doesn't it apply equally to non-default ACLs on any
> other system objects?  If you tweaked the permissions on say pg_ls_dir(),
> then dump, then tweak them some more, you're going to get uncertain
> results if you load that dump back into this database ... with or without
> --clean, because we certainly aren't going to drop pinned objects.

Yes, that's certainly true, the public schema is the only "special"
animal in this regard and making it less special (and more like
pg_ls_dir()) would definitely be nice.

> I think we could jigger things so that we dump the definition of these
> special quasi-system objects only if their ACLs are not default, but
> it's not clear to me that that's really an improvement in the long run.
> Seems like it's just making them even wartier.

Yeah, that would be worse, I agree.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> No, if you have a nondefault ACL, that will still get applied.  This
>> arrangement would drop comment changes, but I can't get excited about
>> that; it's certainly far less of an inconvenience in that scenario
>> than dumping the comment is in non-superuser-restore scenarios.

> That nondefault ACL from the system the pg_dump was run on will get
> applied *over-top* of whatever the current ACL on the system that the
> restore is being run on, which may or may not be what's expected.

Fair point, but doesn't it apply equally to non-default ACLs on any
other system objects?  If you tweaked the permissions on say pg_ls_dir(),
then dump, then tweak them some more, you're going to get uncertain
results if you load that dump back into this database ... with or without
--clean, because we certainly aren't going to drop pinned objects.

I think we could jigger things so that we dump the definition of these
special quasi-system objects only if their ACLs are not default, but
it's not clear to me that that's really an improvement in the long run.
Seems like it's just making them even wartier.

regards, tom lane



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 15:36:35 -0500, Tom Lane wrote:
> There'd be a lot of followup work to sanitize the tests better.  For
> instance, if two tests transiently create tables named "foo", it doesn't
> matter as long as they're not in the same group.  It would matter with
> this.

Right. I suspect we'd initially end up with a schedule that'd had
dependencies pretty similar to what we have now as groups. I suspect
that even with a very small number of changes we'd get a lot better
timings.


> There are things we could do to mitigate that --- one attractive idea
> is to have each test create its own schema for transient objects.
> The limiting factor is that we don't want the test scripts to change
> so much that back-patching tests becomes impossible.  (Or at least,
> I'd not like that.)

I think a lot of temporary, potentially conflicting, objects are already
created as TEMPORARY. Adding a few more temporary markers shouldn't be
too hard.

> I'm concerned that we'd end up with a higher number of irreproducible
> test failures with no good way to investigate them.

Hm. We probably should dump the used ordering of tests somwhere upon
failure, to make it easier to debug.

Greetings,

Andres Freund



Re: copy.c allocation constant

2018-01-24 Thread Thomas Munro
On Thu, Jan 25, 2018 at 9:35 AM, Bruce Momjian  wrote:
> The BSD memory allocator used to allocate in powers of two, and keep the
> header in a separate location.  They did this so they could combine two
> free, identically-sized memory blocks into a single one that was double
> the size.  I have no idea how it works now.

According to "man malloc" on my FreeBSD 11.1 box (which uses jemalloc,
I think NetBSD and OpenBSD use something else), the code Andrew showed
at the top of this thread will waste ~16kB (because it'll round up to
80kB) and nodeHash.c will waste ~8kB for every ~32kB chunk of tuples
as described in that other thread.

   +--+
   |Category   Spacing   Size |
   |Small   lg   [8]  |
   |16   [16, 32, 48, 64, 80, 96, 112, 128]   |
   |32   [160, 192, 224, 256] |
   |64   [320, 384, 448, 512] |
   |   128   [640, 768, 896, 1024]|
   |   256   [1280, 1536, 1792, 2048] |
   |   512   [2560, 3072, 3584, 4096] |
   | 1 KiB   [5 KiB, 6 KiB, 7 KiB, 8 KiB] |
   | 2 KiB   [10 KiB, 12 KiB, 14 KiB] |
   |Large2 KiB   [16 KiB] |
   | 4 KiB   [20 KiB, 24 KiB, 28 KiB, 32 KiB] |
   | 8 KiB   [40 KiB, 48 KiB, 54 KiB, 64 KiB] |
   |16 KiB   [80 KiB, 96 KiB, 112 KiB, 128 KiB]   |
   |32 KiB   [160 KiB, 192 KiB, 224 KiB, 256 KiB] |
   |64 KiB   [320 KiB, 384 KiB, 448 KiB, 512 KiB] |
   |   128 KiB   [640 KiB, 768 KiB, 896 KiB, 1 MiB]   |
   |   256 KiB   [1280 KiB, 1536 KiB, 1792 KiB]   |
   |Huge   256 KiB   [2 MiB]  |
   |   512 KiB   [2560 KiB, 3 MiB, 3584 KiB, 4 MiB]   |
   | 1 MiB   [5 MiB, 6 MiB, 7 MiB, 8 MiB] |
   | 2 MiB   [10 MiB, 12 MiB, 14 MiB, 16 MiB] |
   | 4 MiB   [20 MiB, 24 MiB, 28 MiB, 32 MiB] |
   | 8 MiB   [40 MiB, 48 MiB, 56 MiB, 64 MiB] |
   |   ...   ...  |
   |   512 PiB   [2560 PiB, 3 EiB, 3584 PiB, 4 EiB]   |
   | 1 EiB   [5 EiB, 6 EiB, 7 EiB]|
   +--+

-- 
Thomas Munro
http://www.enterprisedb.com



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 17:18:26 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > Besides larger groups, starting the next test(s) earlier, another way to
> > gain pretty large improvements would be a test schedule feature that
> > allowed to stat dependencies between tests. So instead of manually
> > grouping the schedule, have 'numerology' state that it depends on int2,
> > int4, int8, float4, float8, which means it can actually be started
> > earlier than it currently can in many cases.
> 
> Yeah, I proposed this a decade ago but never had the wits to write the
> code.

It shouldn't be too hard, right? Leaving defining the file format,
parsing it, creating the new schedule with depencencies and adapting
tests aside (hah), it mostly seems a relatively simple graph ordering /
topological sort problem, right?  If we keep the timings from an earlier
run somwhere, we can use the timing of runs as edge weights, making the
schedule better.


> It would be very useful for running tests standalone, too -- much as I
> dislike 'make installcheck' taking half a minute, I dislike much more
> having to take 5 minutes each time to figure out that create_table
> depends on box, polygon, create_function, yadda yadda.

Oh, that's a good point.

- Andres



Re: WIP Patch: Precalculate stable functions, infrastructure v1

2018-01-24 Thread Andres Freund
On 2018-01-24 15:10:56 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-01-16 17:05:01 -0500, Tom Lane wrote:
> >> I'm curious to know whether Andres has some other ideas, or whether he
> >> feels this is all a big wart on the compiled-expression concept.
>
> > I don't have too many "artistic" concerns from the compiled expression
> > POV. The biggest issue I see is that it'll make it a bit harder to
> > separate out the expression compilation phase from the expression
> > instantiation phase - something I think we definitely want.
>
> Hmm, there's no such distinction now, so could you explain what you
> have in mind there?

There's a few related concerns I have:
- For OLTP workloads using prepared statements, we often spend the
  majority of the time doing ExecInitExpr() and related tasks like
  computing tupledescs.
- For OLTP workloads the low allocation density for things hanging off
  ExprState's and PlanState nodes is a significant concern. The number
  of allocations cause overhead, the overhead wastes memory and lowers
  cache hit ratios.
- For JIT we currently end up encoding specific pointer values into the
  generated code. As these obviously prevent reuse of the generated
  function, this noticeably reduces the applicability of JITing to fewer
  usecases. JITing is actually quite beneficial for a lot of OLTP
  workloads too, but it's too expensive to do every query.

To address these, I think we may want to split the the division of labor
a bit. Expression instantiation (i.e. ExecReadyExpr()) should happen at
executor startup, but in a lot of cases "compiling" the steps itself
should happen at plan time. Obviously that means the steps themselves
can't contain plain pointers, as the per-execution memory will be
located in different places.  So I think what we should have is that
expression initialization just computes the size of required memory for
all steps and puts *offsets* into that in the steps. After that
expression instantiation either leaves them alone and evaluation uses
relative pointers (cheap-ish e.g. on x86 due to lea), or just turn the
relative pointers into absolute ones.
That means that all the memory for all steps of an ExprState would be
allocated in one chunk, reducing allocation overhead and increasing
cache hit ratios considerably.

I've experimented a bit with a rough rough hack of the above (purely at
execution time), and it doesn't seem too hard.


> Keeping the stored value of a CachedExpr in a Param slot is an
> interesting idea indeed.

We keep coming back to this, IIRC we had a pretty similar discussion
around redesigning caseValue_datum/isNull domainValue_datum/isNull to be
less ugly. There also was
https://www.postgresql.org/message-id/20171116182208.kcvf75nfaldv3...@alap3.anarazel.de
where we discussed using something similar to PARAM_EXEC Param nodes to
allow inlining of volatile functions.

ISTM, there might be some value to consider all of them in the design of
the new mechanism.

Greetings,

Andres Freund



Re: copy.c allocation constant

2018-01-24 Thread Bruce Momjian
On Thu, Jan 25, 2018 at 09:30:54AM +1300, Thomas Munro wrote:
> On Thu, Jan 25, 2018 at 7:19 AM, Robert Haas  wrote:
> > On Wed, Jan 24, 2018 at 12:25 PM, Tomas Vondra
> >  wrote:
> >> At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator
> >> with similar ideas (freelists, ...) so hopefully it's fine too.
> >>
> >> And then there are the systems without glibc, or with other libc
> >> implementations. No idea about those.
> >
> > My guess is that a fairly common pattern for larger chunks will be to
> > round the size up to a multiple of 4kB, the usual memory page size.
> 
> See also this discussion:
> 
> https://www.postgresql.org/message-id/flat/CAEepm%3D1bRyd%2B_W9eW-QmP1RGP03ti48zgd%3DK11Q6o4edQLgkcg%40mail.gmail.com#CAEepm=1bRyd+_W9eW-QmP1RGP03ti48zgd=k11q6o4edqlg...@mail.gmail.com
> 
> TL;DR glibc doesn't actually round up like that below 128kB, but many
> others including FreeBSD, macOS etc round up to various page sizes or
> size classes including 8kB (!), 512 bytes.  I find this a bit
> frustrating because it means that the most popular libc implementation
> doesn't have the problem so this kind of thing probably isn't a high
> priority, but probably on most other Unices (and I have no clue for
> Windows) including my current favourite we waste a bunch of memory.

The BSD memory allocator used to allocate in powers of two, and keep the
header in a separate location.  They did this so they could combine two
free, identically-sized memory blocks into a single one that was double
the size.  I have no idea how it works now.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: copy.c allocation constant

2018-01-24 Thread Thomas Munro
On Thu, Jan 25, 2018 at 7:19 AM, Robert Haas  wrote:
> On Wed, Jan 24, 2018 at 12:25 PM, Tomas Vondra
>  wrote:
>> At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator
>> with similar ideas (freelists, ...) so hopefully it's fine too.
>>
>> And then there are the systems without glibc, or with other libc
>> implementations. No idea about those.
>
> My guess is that a fairly common pattern for larger chunks will be to
> round the size up to a multiple of 4kB, the usual memory page size.

See also this discussion:

https://www.postgresql.org/message-id/flat/CAEepm%3D1bRyd%2B_W9eW-QmP1RGP03ti48zgd%3DK11Q6o4edQLgkcg%40mail.gmail.com#CAEepm=1bRyd+_W9eW-QmP1RGP03ti48zgd=k11q6o4edqlg...@mail.gmail.com

TL;DR glibc doesn't actually round up like that below 128kB, but many
others including FreeBSD, macOS etc round up to various page sizes or
size classes including 8kB (!), 512 bytes.  I find this a bit
frustrating because it means that the most popular libc implementation
doesn't have the problem so this kind of thing probably isn't a high
priority, but probably on most other Unices (and I have no clue for
Windows) including my current favourite we waste a bunch of memory.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Peter Geoghegan
On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro
 wrote:
> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan  wrote:
>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>
>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>> real problem. Sure enough, the coding pattern within
>> _bt_leader_heapscan() can cause us to wait forever even with commit
>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>> consequence of the patch not using tuple queues (it uses the new
>> tuplesort sharing thing instead).
>
> Just curious: does the attached also help?

I can still reproduce the problem without the fix I described (which
does work), using your patch instead.

Offhand, I suspect that the way you set ParallelMessagePending may not
always leave it set when it should be.

>> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
>> within _bt_leader_heapscan() before waiting on our condition variable
>> fixes the problem -- errors are reliably propagated, and we never end
>> up waiting forever.
>
> That does seem like a nice, simple solution and I am not against it.
> The niggling thing that bothers me about it, though, is that it
> requires the client of parallel.c to follow a slightly complicated
> protocol or risk a rare obscure failure mode, and recognise the cases
> where that's necessary.  Specifically, if you're not blocking in a
> shm_mq wait loop, then you must make a call to this new interface
> before you do any other kind of latch wait, but if you get that wrong
> you'll probably not notice since fork failure is rare!  It seems like
> it'd be nicer if we could figure out a way to make it so that any
> latch/CFI loop would automatically be safe against fork failure.

It would certainly be nicer, but I don't see much risk if we add a
comment next to nworkers_launched that said: Don't trust this until
you've called (Amit's proposed) WaitForParallelWorkersToAttach()
function, unless you're using the tuple queue infrastructure, which
lets you not need to directly care about the distinction between a
launched worker never starting, and a launched worker successfully
completing.

While I agree with what Robert said on the other thread -- "I guess
that works, but it seems more like blind luck than good design.
Parallel CREATE INDEX fails to be as "lucky" as Gather" -- that
doesn't mean that that situation cannot be formalized. And even if it
isn't formalized, then I think that that will probably be because
Gather ends up doing almost the same thing.

-- 
Peter Geoghegan



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Alvaro Herrera
Thomas Munro wrote:
> On Wed, Jan 24, 2018 at 12:10 PM, Tom Lane  wrote:

> > However, the trend over the last two months is very bad, and I do
> > not think that we can point to any large improvement in test
> > coverage that someone committed since November.
> 
> I'm not sure if coverage.postgresql.org has a way to view historical
> reports so we can see the actual percentage change,

It does not.  I have had that in my to-do list for a while, but haven't
gotten around to it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Alvaro Herrera
Andres Freund wrote:

> Besides larger groups, starting the next test(s) earlier, another way to
> gain pretty large improvements would be a test schedule feature that
> allowed to stat dependencies between tests. So instead of manually
> grouping the schedule, have 'numerology' state that it depends on int2,
> int4, int8, float4, float8, which means it can actually be started
> earlier than it currently can in many cases.

Yeah, I proposed this a decade ago but never had the wits to write the
code.

It would be very useful for running tests standalone, too -- much as I
dislike 'make installcheck' taking half a minute, I dislike much more
having to take 5 minutes each time to figure out that create_table
depends on box, polygon, create_function, yadda yadda.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Thomas Munro
On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan  wrote:
> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>
> 1. The problem of fork failure causing nbtsort.c to wait forever is a
> real problem. Sure enough, the coding pattern within
> _bt_leader_heapscan() can cause us to wait forever even with commit
> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
> consequence of the patch not using tuple queues (it uses the new
> tuplesort sharing thing instead).

Just curious: does the attached also help?

> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
> within _bt_leader_heapscan() before waiting on our condition variable
> fixes the problem -- errors are reliably propagated, and we never end
> up waiting forever.

That does seem like a nice, simple solution and I am not against it.
The niggling thing that bothers me about it, though, is that it
requires the client of parallel.c to follow a slightly complicated
protocol or risk a rare obscure failure mode, and recognise the cases
where that's necessary.  Specifically, if you're not blocking in a
shm_mq wait loop, then you must make a call to this new interface
before you do any other kind of latch wait, but if you get that wrong
you'll probably not notice since fork failure is rare!  It seems like
it'd be nicer if we could figure out a way to make it so that any
latch/CFI loop would automatically be safe against fork failure.  The
attached (if it actually works, I dunno) is the worst way, but I
wonder if there is some way to traffic just a teensy bit more
information from postmaster to leader so that it could be efficient...

-- 
Thomas Munro
http://www.enterprisedb.com


pessimistic-fork-failure-detector-v2.patch
Description: Binary data


Re: copy.c allocation constant

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 17:07:01 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > glibc's malloc does add a header. My half-informed suspicion is that
> > most newer malloc backing allocators will have a header, because
> > maintaining a shared lookup-by-address table is pretty expensive to
> > maintain. A bit of metadata indicating size and/or source of the
> > allocation makes using thread-local information a lot easier.
>
> Sounds like it'd be smart to allocate something closer to
> M_MMAP_THRESHOLD (which with typical values would be about double the
> amount of memory the current RAW_BUF_SIZE value), minus a few dozen
> bytes to allow for palloc's and malloc's respective headers.  That way
> we bet for a mmap()ed allocation with minimum space wastage across all
> layers.

In general there might be cases where that's worthwhile, although
M_MMAP_THRESHOLD IIRC isn't a constant anymore, complicating things. The
specific case this thread is discussing doesn't seem worthy of such
attention though, there's afaict no actual practical problem here.


> Not sure whether we want to try to minimize wastage through
> clever use of malloc_usable_size() on each backend's first allocation --
> that'd probably just be overengineering.

Likely also dangerous, I don't think this is a constant.

Greetings,

Andres Freund



Re: WIP Patch: Precalculate stable functions, infrastructure v1

2018-01-24 Thread Tom Lane
Andres Freund  writes:
> On 2018-01-16 17:05:01 -0500, Tom Lane wrote:
>> I'm curious to know whether Andres has some other ideas, or whether he
>> feels this is all a big wart on the compiled-expression concept.

> I don't have too many "artistic" concerns from the compiled expression
> POV. The biggest issue I see is that it'll make it a bit harder to
> separate out the expression compilation phase from the expression
> instantiation phase - something I think we definitely want.

Hmm, there's no such distinction now, so could you explain what you
have in mind there?

>> I don't think there are any existing cases where we keep any
>> meaningful state across executions of a compiled-expression data
>> structure; maybe that's a bad idea in itself.

> Storing all cached values in an EState or ExprContext (the
> latter referring to the former) somewhat alike the values for Param's
> sounds a lot more reasonable to me.
> ...
> This all reminds me a lot of the infrastructure for Params...

Yeah, one thing I was thinking about in connection with this is the
stuff associated with propagating changes in outer-reference Params
(the extParam/allParam/chgParam mess).  I wonder if we could find
a way to unify that with this feature.

Keeping the stored value of a CachedExpr in a Param slot is an
interesting idea indeed.

regards, tom lane



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 14:31:47 -0500, Tom Lane wrote:
> However ... if you spend any time looking at the behavior of that,
> the hashjoin tests are still problematic.

I think my main problem with your arguments is that you basically seem
to say that one of the more complex features in postgres can't increase
the test time. And I just don't agree with that.  If we can reduce some
unnecessary overhead (as Thomas iirc has done somwhere nearby) - great,
if we can hide the overhead by scheduling the test better or breaking it
up - also great. But if that's a good chunk of work I think it's
entirely reasonable to not necessarily consider that the best use of
time.

It doesn't seem too surprising that a test that relies on starting
multiple background processes in multiple places will be among the more
expensive ones.  We clearly would e.g. benefit from being able to reuse
workers, to avoid constantly starting/stopping them.


> (The overall runtime for "make installcheck-parallel" on this machine
> is about 17.3 seconds right now.)  The next slowest test script in
> the join test's group is "update", at 0.373 seconds; so over 1.5 sec
> of the total 17.3 sec runtime is being spent solely in the join script.

Might be worth breaking up join a bit, that won't get rid of all the
wall time overhead, but should reduce it. Reordering to run parallel to
other slow tests might also be worthwhile.


> So I continue to maintain that an unreasonable fraction of the total
> resources devoted to the regular regression tests is going into these
> new hashjoin tests.

> > One caveat is that old machines also
> > somewhat approximate testing with more instrumentation / debugging
> > enabled (say valgrind, CLOBBER_CACHE_ALWAYS, etc). So removing excessive
> > test overhead has still quite some benefits. But I definitely do not
> > want to lower coverage to achieve it.
> 
> I don't want to lower coverage either.  I do want some effort to be
> spent on achieving test coverage intelligently, rather than just throwing
> large test cases at the code without consideration of the costs.

I think this accusation is unfair. Are you really suggesting that nobody
else cares about the runtime of the new tests? Just because other
people's tradeoffs come down at a somewhat different place, doesn't mean
they add tests "without consideration of the costs".


> Based on these numbers, it seems like one easy thing we could do to
> reduce parallel check time is to split the plpgsql test into several
> scripts that could run in parallel.  But independently of that,
> I think we need to make an effort to push hashjoin's time back down.

If we had a dependency based system as I suggested nearby, we could have
pg_regress order the tests so that the slowest ones that have
dependencies fulfilled are started first...

Greetings,

Andres Freund



Re: copy.c allocation constant

2018-01-24 Thread Alvaro Herrera
Andres Freund wrote:
> On 2018-01-24 14:25:37 -0500, Robert Haas wrote:
> > On Wed, Jan 24, 2018 at 1:43 PM, Andres Freund  wrote:
> > > Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most
> > > platforms though. From man mallopt:
> > >  Balancing  these  factors  leads  to a default setting of 128*1024 for 
> > > the M_MMAP_THRESHOLD parameter.
> > > Additionally, even when malloc() chooses to use mmap() to back an
> > > allocation, it'll still needs a header to know the size of the
> > > allocation and such. So exactly using a size of a multiple of 4KB will
> > > still leave you with wasted space.  Due to the latter I can't see it
> > > mattering whether or not we add +1 to a power-of-two size.
> > 
> > Well, it depends on how it works.  dsa_allocate, for example, never
> > adds a header to the size of the allocation.
> 
> glibc's malloc does add a header. My half-informed suspicion is that
> most newer malloc backing allocators will have a header, because
> maintaining a shared lookup-by-address table is pretty expensive to
> maintain. A bit of metadata indicating size and/or source of the
> allocation makes using thread-local information a lot easier.

Sounds like it'd be smart to allocate something closer to
M_MMAP_THRESHOLD (which with typical values would be about double the
amount of memory the current RAW_BUF_SIZE value), minus a few dozen
bytes to allow for palloc's and malloc's respective headers.  That way
we bet for a mmap()ed allocation with minimum space wastage across all
layers.  Not sure whether we want to try to minimize wastage through
clever use of malloc_usable_size() on each backend's first allocation --
that'd probably just be overengineering.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Robert Haas
On Tue, Jan 23, 2018 at 9:45 PM, Amit Kapila  wrote:
> Hmm, I think that case will be addressed because tuple queues can
> detect if the leader is not attached.  It does in code path
> shm_mq_receive->shm_mq_counterparty_gone.  In
> shm_mq_counterparty_gone, it can detect if the worker is gone by using
> GetBackgroundWorkerPid.  Moreover, I have manually tested this
> particular case before saying your patch is fine.  Do you have some
> other case in mind which I am missing?

Oh, right.  I had forgotten that I taught shm_mq_attach() to take an
optional BackgroundWorkerHandle precisely to solve this kind of
problem.  I can't decide whether to feel smart because I did that or
dumb because I forgot that I did it.

But I think there's still a problem: given recent commits, that will
cause us to ERROR out if we tried to read from a tuple queue for a
worker which failed to start, or that starts and then dies before
attaching to the queue.  But there's no guarantee we'll try to read
from that queue in the first place.  Normally, that gets triggered
when we get PROCSIG_PARALLEL_MESSAGE, but if the worker never started
up in the first place, it can't send that, and the postmaster won't.

In Thomas's test case, he's using 4 workers, and if even one of those
manages to start, then it'll probably do so after any fork failures
have already occurred, and the error will be noticed when that process
sends its first message to the leader through the error queue, because
it'll send PROCSIG_PARALLEL_MESSAGE via all the queues.  If all of the
workers fail to start, then that doesn't help.  But it still manages
to detect the failure in that case because it reaches
WaitForParallelWorkersToFinish, which we just patched.

But how does that happen, anyway?  Shouldn't it get stuck waiting for
the tuple queues to which nobody's attached yet?  The reason it
doesn't is because
ExecParallelCreateReaders() calls shm_mq_set_handle() for each queue,
which causes the tuple queues to be regarded as detached if the worker
fails to start.  A detached tuple queue, in general, is not an error
condition: it just means that worker has no more tuples.  So I think
what happens is as follows.  If parallel_leader_participation = on,
then the leader runs the whole plan, producing however many tuples the
plan should have generated, and then afterwards waits for the workers
to finish, tripping an error.  If parallel_leader_participation = off,
the leader doesn't try to run the plan and is perfectly happy with the
fact that every worker has produced no tuples; since, as far as it
knows, the plan is now fully executed, it waits for them to shut down,
tripping an error.

I guess that works, but it seems more like blind luck than good
design.  Parallel CREATE INDEX fails to be as "lucky" as Gather
because there's nothing in parallel CREATE INDEX that lets it skip
waiting for a worker which doesn't start -- and in general it seems
unacceptable to impose a coding requirement that all future parallel
operations must fail to notice the difference between a worker that
completed successfully and one that never ran at all.

If we made the Gather node wait only for workers that actually reached
the Gather -- either by using a Barrier or by some other technique --
then this would be a lot less fragile.  And the same kind of technique
would work for parallel CREATE INDEX.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Peter Geoghegan
On Wed, Jan 24, 2018 at 1:57 AM, Thomas Munro
 wrote:
> On Wed, Jan 24, 2018 at 5:25 PM, Thomas Munro
>  wrote:
>> If there were some way for the postmaster to cause reason
>> PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of
>> just notification via kill(SIGUSR1) when it fails to fork a parallel
>> worker, we'd get (1) for free in any latch/CFI loop code.  But I
>> understand that we can't do that by project edict.
>
> Based on the above observation, here is a terrible idea you'll all
> hate.  It is pessimistic and expensive: it thinks that every latch
> wake might be the postmaster telling us it's failed to fork() a
> parallel worker, until we've seen a sign of life on every worker's
> error queue.  Untested illustration code only.  This is the only way
> I've come up with to discover fork failure in any latch/CFI loop (ie
> without requiring client code to explicitly try to read either error
> or tuple queues).

The question, I suppose, is how expensive this is in the real world.
If it's actually not a cost that anybody is likely to notice, then I
think we should pursue this approach. I wouldn't put too much weight
on keeping this simple for users of the parallel infrastructure,
though, because something like Amit's WaitForParallelWorkersToAttach()
idea still seems acceptable. "Call this function before trusting the
finality of nworkers_launched" isn't too onerous a rule to have to
follow.

-- 
Peter Geoghegan



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Tom Lane
I wrote:
> I find that to be a completely bogus straw-man argument.  The point of
> looking at the prairiedog time series is just to see a data series in
> which the noise level is small enough to discern the signal.  If anyone's
> got years worth of data off a more modern machine, and they can extract
> a signal from that, by all means let's consider that data instead.

Just to make the point, I scraped the numbers for skink's "installcheck"
and "check" steps, which are data series I imagine at least Andres will
concede are worth paying attention to.  I made no attempt to clean
outliers, so these curves are pretty noisy, but I think there is a very
clear upward bump since mid-December.

regards, tom lane



Re: copy.c allocation constant

2018-01-24 Thread Andres Freund
On 2018-01-24 14:25:37 -0500, Robert Haas wrote:
> On Wed, Jan 24, 2018 at 1:43 PM, Andres Freund  wrote:
> > Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most
> > platforms though. From man mallopt:
> >  Balancing  these  factors  leads  to a default setting of 128*1024 for the 
> > M_MMAP_THRESHOLD parameter.
> > Additionally, even when malloc() chooses to use mmap() to back an
> > allocation, it'll still needs a header to know the size of the
> > allocation and such. So exactly using a size of a multiple of 4KB will
> > still leave you with wasted space.  Due to the latter I can't see it
> > mattering whether or not we add +1 to a power-of-two size.
> 
> Well, it depends on how it works.  dsa_allocate, for example, never
> adds a header to the size of the allocation.

glibc's malloc does add a header. My half-informed suspicion is that
most newer malloc backing allocators will have a header, because
maintaining a shared lookup-by-address table is pretty expensive to
maintain. A bit of metadata indicating size and/or source of the
allocation makes using thread-local information a lot easier.


> Allocations < 8kB are
> bucketed by size class and stored in superblocks carved up into
> equal-sized chunks.  Allocations > 8kB are rounded to a multiple of
> the 4kB page size and we grab that many consecutive free pages.  I
> didn't make those behaviors up; I copied them from elsewhere.  Some
> other allocator I read about did small-medium-large allocations: large
> with mmap(), medium with multiples of the page size, small with
> closely-spaced size classes.

Sure - all I'm trying to say that it likely won't matter whether we use
power-of-two or power-of-two + 1, because it seems likely that due to
overhead considerations we'll likely not quite fit into a size class
anyway.


> It doesn't seem like a particularly good idea to take a 64kB+1 byte
> allocation, stick a header on it, and pack it tightly up against other
> allocations on both sides.  Seems like that could lead to
> fragmentation problems.  Is that really what it does?

No, I'm fairly sure it's not.

Greetings,

Andres Freund



Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Peter Geoghegan
On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila  wrote:
> Right, but what if the worker dies due to something proc_exit(1) or
> something like that before calling BarrierArriveAndWait.  I think this
> is part of the problem we have solved in
> WaitForParallelWorkersToFinish such that if the worker exits abruptly
> at any point due to some reason, the system should not hang.

I have used Thomas' chaos-monkey-fork-process.patch to verify:

1. The problem of fork failure causing nbtsort.c to wait forever is a
real problem. Sure enough, the coding pattern within
_bt_leader_heapscan() can cause us to wait forever even with commit
2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
consequence of the patch not using tuple queues (it uses the new
tuplesort sharing thing instead).

2. Simply adding a single call to WaitForParallelWorkersToFinish()
within _bt_leader_heapscan() before waiting on our condition variable
fixes the problem -- errors are reliably propagated, and we never end
up waiting forever.

3. This short term fix works just as well with
parallel_leader_participation=off.

At this point, my preferred solution is for someone to go implement
Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
like the logical person for the job). Once that's committed, I can
post a new version of the patch that uses that new infrastructure --
I'll add a call to the new function, without changing anything else.
Failing that, we could actually just use
WaitForParallelWorkersToFinish(). I still don't want to use a barrier,
mostly because it complicates  parallel_leader_participation=off,
something that Amit is in agreement with [2][3].

For now, I am waiting for feedback from Robert on next steps.

[1] 
https://postgr.es/m/CAH2-Wzm6dF=g9lywthgcqzrc4dzbe-8tv28yvg0xj8q6e4+...@mail.gmail.com
[2] 
https://postgr.es/m/CAA4eK1LEFd28p1kw2Fst9LzgBgfMbDEq9wPh9jWFC0ye6ce62A%40mail.gmail.com
[3] 
https://postgr.es/m/caa4ek1+a0of4m231vbgpr_0ygg_bnmrgzlib7wqde-fybsy...@mail.gmail.com
-- 
Peter Geoghegan



Re: pgindent run?

2018-01-24 Thread Andres Freund


On January 24, 2018 11:34:07 AM PST, Tom Lane  wrote:
>Andres Freund  writes:
>> There'd be one or two edge cases of bad formatting, but the
>> end result would be far less painful than what we have today, were
>> basically nobody can format their patches without a lot of manual
>> cherry-picking or large scale unrelated whitespace changes.
>
>IMO, the big problem here is that people commit large changes without
>having pgindent'd them first.

Well, I think it'd really have to be every patch that's indented properly. And 
that's hard given the way typedefs.list is maintained.  Without most new 
typedefs included one continually reindents with a lot of damage.

It'd be less bad is we automated the maintenance of the lists so
a) patch authors can automatically add to the list and include that in commits
b) the committed list gets updated automatically every few days based on bf 
results
c) there's automated whole tree pgindent runs every few days.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: pgindent run?

2018-01-24 Thread Tom Lane
Andres Freund  writes:
> FWIW, I think this problem could just as well be addressed with a few
> printing heuristics instead of actually needing an actual list of
> typedefs.

Step right up and implement that, and we'd all be happier.  Certainly
the typedefs list is a pain in the rear.

> There'd be one or two edge cases of bad formatting, but the
> end result would be far less painful than what we have today, were
> basically nobody can format their patches without a lot of manual
> cherry-picking or large scale unrelated whitespace changes.

IMO, the big problem here is that people commit large changes without
having pgindent'd them first.

regards, tom lane



Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > I'm afraid we may still get some push-back from existing users of
> > --clean since, with the change you're proposing, we wouldn't be cleaning
> > up anything that's been done to the public schema when it comes to
> > comment changes or ACL changes, right?
> 
> No, if you have a nondefault ACL, that will still get applied.  This
> arrangement would drop comment changes, but I can't get excited about
> that; it's certainly far less of an inconvenience in that scenario
> than dumping the comment is in non-superuser-restore scenarios.

That nondefault ACL from the system the pg_dump was run on will get
applied *over-top* of whatever the current ACL on the system that the
restore is being run on, which may or may not be what's expected.

That's different from the case where the public schema is dropped and
recreated because, in that case, the schema will start out with an
empty ACL and the resulting ACL should end up matching what was on the
source system (ignoring the current issue with a non-clean pg_dump into
a custom format dump being used with a pg_restore --clean).

Just to be clear, I'm not suggesting that's a huge deal, but it's
certainly a change and something we should at least recognize and
contemplate.

I'm also not really worried about losing the comment.  I definitely
don't think it's worth putting in the kind of infrastructure that we put
in for init ACLs to handle comments being changed on initdb-time
objects.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [HACKERS] Proposal: Local indexes for partitioned table

2018-01-24 Thread Alvaro Herrera
I think this is the right fix for this problem.  I wonder about
exploring other callers of RelationGetIndexList to see who else could be
confused ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 53506fd3a9f0cb81334024bf5a0e8856fd8e5e82 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 24 Jan 2018 14:40:26 -0300
Subject: [PATCH] Ignore partitioned indexes in get_relation_info

---
 src/backend/optimizer/util/plancat.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index 8c60b35068..60f21711f4 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -208,6 +208,16 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, 
bool inhparent,
}
 
/*
+* Ignore partitioned indexes, since they are not 
usable for
+* queries.
+*/
+   if (indexRelation->rd_rel->relkind == 
RELKIND_PARTITIONED_INDEX)
+   {
+   index_close(indexRelation, NoLock);
+   continue;
+   }
+
+   /*
 * If the index is valid, but cannot yet be used, 
ignore it; but
 * mark the plan we are generating as transient. See
 * src/backend/access/heap/README.HOT for discussion.
-- 
2.11.0



Re: Foreign keys and partitioned tables

2018-01-24 Thread Alvaro Herrera
Alvaro Herrera wrote:
> This patch enables foreign key constraints to and from partitioned
> tables.

This version is rebased on current master.

0001: fix for a get_relation_info bug in current master.
  Posted in <20180124174134.ma4ui2kczmqwb4um@alvherre.pgsql>
0002: Allows local partitioned index to be unique;
  Posted in <2018015559.7pbzomvgp5iwmath@alvherre.pgsql>
0003: Allows FOR EACH ROW triggers on partitioned tables;
  Posted in <20180123221027.2qenwwpvgplrrx3d@alvherre.pgsql>

0004: the actual matter of this thread.
0005: bugfix for 0004, after recent changes I introduced in 0004.
  It's separate because I am undecided about it being the best
  approach; maybe further changes in 0003 are a better approach.

No further changes from the version I posted upthread.  Tests pass.  I'm
going to review this code now to see what further changes are needed (at
the very least, I think some dependency changes are in order; plus need
to add a few more tests for various ri_triggers.c code paths.)

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 53506fd3a9f0cb81334024bf5a0e8856fd8e5e82 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 24 Jan 2018 14:40:26 -0300
Subject: [PATCH v2 1/5] Ignore partitioned indexes in get_relation_info

---
 src/backend/optimizer/util/plancat.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index 8c60b35068..60f21711f4 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -208,6 +208,16 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, 
bool inhparent,
}
 
/*
+* Ignore partitioned indexes, since they are not 
usable for
+* queries.
+*/
+   if (indexRelation->rd_rel->relkind == 
RELKIND_PARTITIONED_INDEX)
+   {
+   index_close(indexRelation, NoLock);
+   continue;
+   }
+
+   /*
 * If the index is valid, but cannot yet be used, 
ignore it; but
 * mark the plan we are generating as transient. See
 * src/backend/access/heap/README.HOT for discussion.
-- 
2.11.0

>From f67c07baa0a211a42e42b1480c4ffd47b0a7fb06 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 6 Nov 2017 17:04:55 +0100
Subject: [PATCH v2 2/5] allow indexes on partitioned tables to be unique

---
 doc/src/sgml/ref/alter_table.sgml |   9 +-
 doc/src/sgml/ref/create_table.sgml|  16 +-
 src/backend/bootstrap/bootparse.y |   2 +
 src/backend/catalog/index.c   |  45 -
 src/backend/catalog/pg_constraint.c   |  76 
 src/backend/catalog/toasting.c|   4 +-
 src/backend/commands/indexcmds.c  | 125 +++--
 src/backend/commands/tablecmds.c  |  62 ++-
 src/backend/parser/analyze.c  |   7 +
 src/backend/parser/parse_utilcmd.c|  31 +---
 src/backend/tcop/utility.c|   1 +
 src/include/catalog/index.h   |   5 +-
 src/include/catalog/pg_constraint_fn.h|   4 +-
 src/include/commands/defrem.h |   1 +
 src/include/parser/parse_utilcmd.h|   3 +-
 src/test/regress/expected/alter_table.out |   8 -
 src/test/regress/expected/create_table.out|  12 --
 src/test/regress/expected/indexing.out| 254 +-
 src/test/regress/expected/insert_conflict.out |   2 +-
 src/test/regress/sql/alter_table.sql  |   2 -
 src/test/regress/sql/create_table.sql |   8 -
 src/test/regress/sql/indexing.sql | 151 ++-
 22 files changed, 740 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index 286c7a8589..c00fd09fe1 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -804,8 +804,9 @@ ALTER TABLE [ IF EXISTS ] name
   This form attaches an existing table (which might itself be partitioned)
   as a partition of the target table. The table can be attached
   as a partition for specific values using FOR VALUES
-   or as a default partition by using DEFAULT
-  .  For each index in the target table, a corresponding
+   or as a default partition by using
+  DEFAULT.
+  For each index in the target table, a corresponding
   one will be created in the attached table; or, if an equivalent
   index already exists, will be attached to the target table's index,
   as if ALTER INDEX ATTACH 

Re: copy.c allocation constant

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 1:43 PM, Andres Freund  wrote:
> Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most
> platforms though. From man mallopt:
>  Balancing  these  factors  leads  to a default setting of 128*1024 for the 
> M_MMAP_THRESHOLD parameter.
> Additionally, even when malloc() chooses to use mmap() to back an
> allocation, it'll still needs a header to know the size of the
> allocation and such. So exactly using a size of a multiple of 4KB will
> still leave you with wasted space.  Due to the latter I can't see it
> mattering whether or not we add +1 to a power-of-two size.

Well, it depends on how it works.  dsa_allocate, for example, never
adds a header to the size of the allocation.  Allocations < 8kB are
bucketed by size class and stored in superblocks carved up into
equal-sized chunks.  Allocations > 8kB are rounded to a multiple of
the 4kB page size and we grab that many consecutive free pages.  I
didn't make those behaviors up; I copied them from elsewhere.  Some
other allocator I read about did small-medium-large allocations: large
with mmap(), medium with multiples of the page size, small with
closely-spaced size classes.

It doesn't seem like a particularly good idea to take a 64kB+1 byte
allocation, stick a header on it, and pack it tightly up against other
allocations on both sides.  Seems like that could lead to
fragmentation problems.  Is that really what it does?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Would a BGW need shmem_access or database_connection to enumerate databases?

2018-01-24 Thread Bruce Momjian
On Wed, Jan 24, 2018 at 01:48:05PM -0500, Chapman Flack wrote:
> Thanks! I had actually registered that one (with a related one)
> for CF 2018-03, having missed the deadline for -01:
> 
> https://commitfest.postgresql.org/17/1467/

OK, thanks.  I added a commitfest item annotiation to say that the doc
part of this entry has been applied.

---


> 
> -Chap
> 
> On 01/24/2018 01:20 PM, Bruce Momjian wrote:
> > On Thu, Dec 14, 2017 at 06:12:35PM -0500, Chapman Flack wrote:
> >> On 12/04/2017 09:13 AM, Craig Ringer wrote:
> >>> On 1 December 2017 at 23:04, Chapman Flack  wrote:
>  Can I call RegisterDynamicBackgroundWorker when not in the postmaster,
>  but also not in a "regular backend", but rather another BGW?
> 
> >>> Yes. BDR does it a lot.
> >>
> >> Would this doc patch be acceptable to clarify that, in case
> >> I'm not the last person who might wonder?
> > 
> > Thanks, patch applied to head.
> > 
> > ---
> > 
> >> >From 3308ef5647e8ce4a84855b4d0ca09ba6aeb7 Mon Sep 17 00:00:00 2001
> >> From: Chapman Flack 
> >> Date: Thu, 14 Dec 2017 18:09:14 -0500
> >> Subject: [PATCH] Clarify that a BGW can register a dynamic BGW.
> >>
> >> ---
> >>  doc/src/sgml/bgworker.sgml | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
> >> index 4bc2b69..e490bb8 100644
> >> --- a/doc/src/sgml/bgworker.sgml
> >> +++ b/doc/src/sgml/bgworker.sgml
> >> @@ -41,7 +41,7 @@
> >>*worker, BackgroundWorkerHandle **handle).  Unlike
> >>RegisterBackgroundWorker, which can only be called 
> >> from within
> >>the postmaster, RegisterDynamicBackgroundWorker 
> >> must be
> >> -  called from a regular backend.
> >> +  called from a regular backend, possibly another background worker.
> >>   
> >>  
> >>   
> >> -- 
> >> 1.8.3.1
> >>
> > 
> > 
> 
> 

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: WIP Patch: Precalculate stable functions, infrastructure v1

2018-01-24 Thread Andres Freund
Hi,


On 2018-01-16 17:05:01 -0500, Tom Lane wrote:
> [ I'm sending this comment separately because I think it's an issue
> Andres might take an interest in. ]

Thanks for that. I indeed am interested. Sorry for the late response,
was very deep into the JIT patch.


> Marina Polyakova  writes:
> > [ v7-0001-Precalculate-stable-and-immutable-functions.patch ]
> 
> Another thing that's bothering me is that the execution semantics
> you're proposing for CachedExpr seem rather inflexible.  AFAICS, once a
> CachedExpr has run once, it will hang on to the result value and keep
> returning that for the entire lifespan of the compiled expression.
> We already noted that that breaks plpgsql's "simple expression"
> logic, and it seems inevitable to me that it will be an issue for
> other places as well.  I think it'd be a better design if we had some
> provision for resetting the cached values, short of recompiling the
> expression from scratch.

Hm. Yes, that makes me uncomfortable as well.


> One way that occurs to me to do this is to replace the simple boolean
> isExecuted flags with a generation counter, and add a master generation
> counter to ExprState.  The rule for executing CachedExpr would be "if my
> generation counter is different from the ExprState's counter, then
> evaluate the subexpression and copy the ExprState's counter into mine".
> Then the procedure for forcing recalculation of cached values is just to
> increment the ExprState's counter.  There are other ways one could imagine
> doing this --- for instance, I initially thought of keeping the master
> counter in the ExprContext being used to run the expression.  But you need
> some way to remember what counter value was used last with a particular
> expression, so probably keeping it in ExprState is better.

I'm not a big fan of this solution. We seem to be inventing more and
more places we keep state, rather than the contrary.


> Or we could just brute-force it by providing a function that runs through
> a compiled expression step list and resets the isExecuted flag for each
> EEOP_CACHEDEXPR_IF_CACHED step it finds.  A slightly less brute-force
> way is to link those steps together in a list, so that the function
> doesn't have to visit irrelevant steps.  If the reset function were seldom
> used then the extra cycles for this wouldn't be very expensive.  But I'm
> not sure it will be seldom used --- it seems like plpgsql simple
> expressions will be doing this every time --- so I think the counter
> approach might be a better idea.

Hm, that sounds like it'd not be cheap.


> I'm curious to know whether Andres has some other ideas, or whether he
> feels this is all a big wart on the compiled-expression concept.

I don't have too many "artistic" concerns from the compiled expression
POV. The biggest issue I see is that it'll make it a bit harder to
separate out the expression compilation phase from the expression
instantiation phase - something I think we definitely want.


> I don't think there are any existing cases where we keep any
> meaningful state across executions of a compiled-expression data
> structure; maybe that's a bad idea in itself.

To me, who has *not* followed the thread in detail, it sounds like the
relevant data shouldn't be stored inside the expression itself.  For
one, we do not want to have to visit every single simple expression and
reset them, for another it architecturally doesn't seem the right place
to me.  Storing all cached values in an EState or ExprContext (the
latter referring to the former) somewhat alike the values for Param's
sounds a lot more reasonable to me.

Besides that it seems to make it a lot easier to reset the values, it
also seems like it makes it a lot cleaner to cache stable functions
across multiple expressions in different places in a query? ISTM having
expression steps to actually compute the expression value in every
referencing expression is quite the waste.

This all reminds me a lot of the infrastructure for Params...

Greetings,

Andres Freund



Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Tom Lane
Stephen Frost  writes:
> I'm afraid we may still get some push-back from existing users of
> --clean since, with the change you're proposing, we wouldn't be cleaning
> up anything that's been done to the public schema when it comes to
> comment changes or ACL changes, right?

No, if you have a nondefault ACL, that will still get applied.  This
arrangement would drop comment changes, but I can't get excited about
that; it's certainly far less of an inconvenience in that scenario
than dumping the comment is in non-superuser-restore scenarios.

regards, tom lane



Re: Would a BGW need shmem_access or database_connection to enumerate databases?

2018-01-24 Thread Chapman Flack
Thanks! I had actually registered that one (with a related one)
for CF 2018-03, having missed the deadline for -01:

https://commitfest.postgresql.org/17/1467/

-Chap

On 01/24/2018 01:20 PM, Bruce Momjian wrote:
> On Thu, Dec 14, 2017 at 06:12:35PM -0500, Chapman Flack wrote:
>> On 12/04/2017 09:13 AM, Craig Ringer wrote:
>>> On 1 December 2017 at 23:04, Chapman Flack  wrote:
 Can I call RegisterDynamicBackgroundWorker when not in the postmaster,
 but also not in a "regular backend", but rather another BGW?

>>> Yes. BDR does it a lot.
>>
>> Would this doc patch be acceptable to clarify that, in case
>> I'm not the last person who might wonder?
> 
> Thanks, patch applied to head.
> 
> ---
> 
>> >From 3308ef5647e8ce4a84855b4d0ca09ba6aeb7 Mon Sep 17 00:00:00 2001
>> From: Chapman Flack 
>> Date: Thu, 14 Dec 2017 18:09:14 -0500
>> Subject: [PATCH] Clarify that a BGW can register a dynamic BGW.
>>
>> ---
>>  doc/src/sgml/bgworker.sgml | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
>> index 4bc2b69..e490bb8 100644
>> --- a/doc/src/sgml/bgworker.sgml
>> +++ b/doc/src/sgml/bgworker.sgml
>> @@ -41,7 +41,7 @@
>>*worker, BackgroundWorkerHandle **handle).  Unlike
>>RegisterBackgroundWorker, which can only be called 
>> from within
>>the postmaster, RegisterDynamicBackgroundWorker must 
>> be
>> -  called from a regular backend.
>> +  called from a regular backend, possibly another background worker.
>>   
>>  
>>   
>> -- 
>> 1.8.3.1
>>
> 
> 




Re: copy.c allocation constant

2018-01-24 Thread Andres Freund
On 2018-01-24 13:19:19 -0500, Robert Haas wrote:
> On Wed, Jan 24, 2018 at 12:25 PM, Tomas Vondra
>  wrote:
> > At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator
> > with similar ideas (freelists, ...) so hopefully it's fine too.
> >
> > And then there are the systems without glibc, or with other libc
> > implementations. No idea about those.
> 
> My guess is that a fairly common pattern for larger chunks will be to
> round the size up to a multiple of 4kB, the usual memory page size.

Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most
platforms though. From man mallopt:
 Balancing  these  factors  leads  to a default setting of 128*1024 for the 
M_MMAP_THRESHOLD parameter.
Additionally, even when malloc() chooses to use mmap() to back an
allocation, it'll still needs a header to know the size of the
allocation and such. So exactly using a size of a multiple of 4KB will
still leave you with wasted space.  Due to the latter I can't see it
mattering whether or not we add +1 to a power-of-two size.

There's malloc_usable_size() returning the actual size of an
allocation. It'd probably be worthwhile to use that in a few of our own
allocator routines. If not available on a platform we can just have a
stub that returns the requested size...

Greetings,

Andres Freund



Re: [PROPOSAL] Shared Ispell dictionaries

2018-01-24 Thread Arthur Zakirov
2018-01-24 20:57 GMT+03:00 Tomas Vondra :
>
> Thanks. I don't have time to review/test this before FOSDEM, but a
> couple of comments regarding some of the points you mentioned.
>

Thank you for your thoughts.


> > I thought about it. And it seems to me that we can use functions
> > ts_unload() and ts_reload() instead of new syntax. We already have
> > text search functions like ts_lexize() and ts_debug(), and it is
> > better to keep consistency.
>
> This argument seems a bit strange. Both ts_lexize() and ts_debug() are
> operating on text values, and are meant to be executed as functions from
> SQL - particularly ts_lexize(). It's hard to imagine this implemented as
> DDL commands.
>
> The unload/reload is something that operates on a database object
> (dictionary), which already has create/drop/alter DDL. So it seems
> somewhat natural to treat unload/reload as another DDL action.
>
> Taken to an extreme, this argument would essentially mean we should not
> have any DDL commands because we have SQL functions.
>
> That being said, I'm not particularly attached to having this DDL now.
> Implementing it seems straight-forward (particularly when we already
> have the stuff implemented as functions), and some of the other open
> questions seem more important to tackle now.
>

I understood your opinion. I haven't strong opinion on the subject yet.
And I agree that they can be implemented in future improvements for shared
dictionaries.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: pgindent run?

2018-01-24 Thread Andres Freund
On 2018-01-23 22:22:47 -0500, Bruce Momjian wrote:
> On Tue, Nov 28, 2017 at 04:38:12PM -0500, Tom Lane wrote:
> > Thomas Munro  writes:
> > > On Wed, Nov 29, 2017 at 9:47 AM, Tom Lane  wrote:
> > >> I think that'd be taking it too far, especially given that the dependency
> > >> on a typedefs list means that the git hook might have a different idea
> > >> of what's correctly indented than the committer does.  It'd be very hard
> > >> to debug such discrepancies and figure out what would satisfy the hook.
> > 
> > > Is there no way to get reasonable indentation that doesn't depend on
> > > that typedefs list?
> > 
> > Perhaps, but not with the tool we've got.
> > 
> > It's well known that C is unparseable without knowing which identifiers
> > are typedefs, so it doesn't exactly surprise me that it might not be
> > sanely indentable without knowing that.  But I've not thought hard about
> > it, nor looked for alternate tools.
> 
> For people curious about the C typedef parsing details:
> 
>   http://calculist.blogspot.com/2009/02/c-typedef-parsing-problem.html
>   
> https://stackoverflow.com/questions/243383/why-cant-c-be-parsed-with-a-lr1-parser

FWIW, I think this problem could just as well be addressed with a few
printing heuristics instead of actually needing an actual list of
typedefs. There'd be one or two edge cases of bad formatting, but the
end result would be far less painful than what we have today, were
basically nobody can format their patches without a lot of manual
cherry-picking or large scale unrelated whitespace changes.

Greetings,

Andres Freund



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Andres Freund
Hi,

On 2018-01-24 13:11:22 -0500, Robert Haas wrote:
> So for me, the additional hash index tests don't cost anything
> measurable and the additional hash join tests cost about a second.  I
> think this probably accounts for why committers other than you keep
> "adding so much time to the regression tests".  On modern hardware,
> the costs just don't matter.

I very much agree with the general sentiment, but a second of a 25s test
certainly isn't nothing.  As I've just written a few messages upthread,
I think we can hide the overall timing costs to a much larger degree
than we're doing, but I don't think we need not pay attention at all.


> Now, how much should we care about the performance of software with a
> planned release date of 2018 on hardware discontinued in 2001,
> hardware that is apparently about 20 times slower than a modern
> laptop?  Some, perhaps, but maybe not a whole lot.  Removing tests
> that have found actual bugs because they cost runtime on ancient
> systems that nobody uses for serious work doesn't make sense to me.

I again agree with the sentiment. One caveat is that old machines also
somewhat approximate testing with more instrumentation / debugging
enabled (say valgrind, CLOBBER_CACHE_ALWAYS, etc). So removing excessive
test overhead has still quite some benefits. But I definitely do not
want to lower coverage to achieve it.

Greetings,

Andres Freund



Re: PATCH: Exclude unlogged tables from base backups

2018-01-24 Thread David Steele
Hi Masahiko,

Thanks for the review!

On 1/22/18 3:14 AM, Masahiko Sawada wrote:
> On Thu, Dec 14, 2017 at 11:58 PM, Robert Haas  wrote:
>>
>> We would also have a problem if the missing file caused something in
>> recovery to croak on the grounds that the file was expected to be
>> there, but I don't think anything works that way; I think we just
>> assume missing files are an expected failure mode and silently do
>> nothing if asked to remove them.
> 
> I also couldn't see a problem in this approach.
> 
> Here is the first review comments.
> 
> +   unloggedDelim = strrchr(path, '/');
> 
> I think it doesn't work fine on windows. How about using
> last_dir_separator() instead?

I think this function is OK on Windows -- we use it quite a bit.
However, last_dir_separator() is clearer so I have changed it.

> 
> + * Find all unlogged relations in the specified directory and return
> their OIDs.
> 
> What the ResetUnloggedrelationsHash() actually returns is a hash
> table. The comment of this function seems not appropriate.

Fixed.

> +   /* Part of path that contains the parent directory. */
> +   int parentPathLen = unloggedDelim - path;
> +
> +   /*
> +* Build the unlogged relation hash if the parent path is 
> either
> +* $PGDATA/base or a tablespace version path.
> +*/
> +   if (strncmp(path, "./base", parentPathLen) == 0 ||
> +   (parentPathLen >=
> (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) &&
> +strncmp(unloggedDelim -
> (sizeof(TABLESPACE_VERSION_DIRECTORY) - 1),
> +TABLESPACE_VERSION_DIRECTORY,
> +
> sizeof(TABLESPACE_VERSION_DIRECTORY) - 1) == 0))
> +   unloggedHash = ResetUnloggedRelationsHash(path);
> +   }
> 
> How about using get_parent_directory() to get parent directory name?

get_parent_directory() munges the string that is passed to it which I
was trying to avoid (we'd need a copy) - and I don't think it makes the
rest of the logic any simpler without constructing yet another string to
hold the tablespace path.

I know performance isn't the most important thing here, so if the
argument is for clarity perhaps it makes sense. Otherwise I don't know
if it's worth it.

> Also, I think it's better to destroy the unloggedHash after use.

Whoops! Fixed.

> +   /* Exclude all forks for unlogged tables except the
> init fork. */
> +   if (unloggedHash && ResetUnloggedRelationsMatch(
> +   unloggedHash, de->d_name) == unloggedOther)
> +   {
> +   elog(DEBUG2, "unlogged relation file \"%s\"
> excluded from backup",
> +de->d_name);
> +   continue;
> +   }
> 
> I think it's better to log this debug message at DEBUG2 level for
> consistency with other messages.

I think you mean DEBUG1?  It's already at DEBUG2.

I considered using DEBUG1 but decided against it.  The other exclusions
will produce a limited amount of output because there are only a few of
them.  In the case of unlogged tables there could be any number of
exclusions and I thought that was too noisy for DEBUG1.

> +   ok(!-f "$tempdir/tbackup/tblspc1/$tblspc1UnloggedBackupPath",
> +   'unlogged imain fork not in tablespace backup');
> 
> s/imain/main/

Fixed.

> If a new unlogged relation is created after constructed the
> unloggedHash before sending file, we cannot exclude such relation. It
> would not be problem if the taking backup is not long because the new
> unlogged relation unlikely becomes so large. However, if takeing a
> backup takes a long time, we could include large main fork in the
> backup.

This is a good point.  It's per database directory which makes it a
little better, but maybe not by much.

Three options here:

1) Leave it as is knowing that unlogged relations created during the
backup may be copied and document it that way.

2) Construct a list for SendDir() to work against so the gap between
creating that and creating the unlogged hash is as small as possible.
The downside here is that the list may be very large and take up a lot
of memory.

3) Check each file that looks like a relation in the loop to see if it
has an init fork.  This might affect performance since an
opendir/readdir loop would be required for every relation.

Personally, I'm in favor of #1, at least for the time being.  I've
updated the docs as indicated in case you and Adam agree.

New patches attached.

Thanks!
-- 
-David
da...@pgmasters.net
diff --git a/src/test/recovery/t/014_unlogged_reinit.pl 
b/src/test/recovery/t/014_unlogged_reinit.pl
new file mode 100644
index 00..ac2e251158
--- /dev/null
+++ b/src/test/recovery/t/014_unlogged_reinit.pl
@@ -0,0 +1,117 @@
+# Tests that unlogged tables are properly reinitialized after a crash.
+#
+# The 

Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> In further testing of that, I noticed that it made the behavior of our
> >> other bugaboo, the public schema, rather inconsistent.  With this
> >> builtin-extensions hack, the plpgsql extension doesn't get dumped,
> >> whether or not you say "clean".  But the public schema does get
> >> dropped and recreated if you say "clean".  That's not helpful for
> >> non-superuser users of pg_dump, so I think we should try to fix it.
> 
> > I'm not entirely sure about trying to also support --clean for
> > non-superusers..  We've long had that the public schema is dropped and
> > recreated with --clean and it seems likely that at least some users are
> > depending on us doing that.  In any case, it's certainly not a change
> > that I think we could backpatch.  Perhaps we could just change it moving
> > forward (which would make me happier, really, since what I think we do
> > with these initdb-time things currently is a bit bizarre).
> 
> Sure, I was not proposing this for back-patch --- it depends on the
> other stuff we've committed recently, anyway.
> 
> > Yes, having that in getNamespaces() isn't correct but we need to do
> > something there, and I've been trying to figure out what.
> 
> I claim this is what ;-)

Well, that would make things in v11 better, certainly.  I suppose for
back-patching, I can just go make the change to pg_restore that I
outlined and that would deal with the complaints we've gotten there.

I'm afraid we may still get some push-back from existing users of
--clean since, with the change you're proposing, we wouldn't be cleaning
up anything that's been done to the public schema when it comes to
comment changes or ACL changes, right?

A typical use-case of pg_dump with --clean being to 'reset' a system
after, say, dumping out some subset of a production system and then
loading it into the development environment that all of the devs have
full access to, and where there might have been changes to the 'public'
that you want to revert (to get it back to looking like how 'prod' was).
In current versions this should result in at least the ACLs on public
matching what they are on prod, along with the comment and any other
changes done to it, but we would lose that if we stop doing drop/create
of the public schema on --clean.

Then again, the DROP SCHEMA will fail if any objects end up created in
the public schema anyway, so it isn't like that's a complete solution
regardless.  I suppose it's a bit surprising that we haven't been asked
for a --clean-cascade option.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: FOR EACH ROW triggers on partitioned tables

2018-01-24 Thread Robert Haas
On Tue, Jan 23, 2018 at 5:10 PM, Alvaro Herrera  wrote:
> The main question is this: when running the trigger function, it is
> going to look as it is running in the context of the partition, not in
> the context of the parent partitioned table (TG_RELNAME etc).  That
> seems mildly ugly: some users may be expecting that the partitioning
> stuff is invisible to the rest of the system, so if you have triggers on
> a regular table and later on decide to partition that table, the
> behavior of triggers will change, which is maybe unexpected.  Maybe this
> is not really a problem, but I'm not sure and would like further
> opinions.

It doesn't seem either great or horrible.

Also, what about logical replication?  Amit just raised this issue for
the UPDATE row movement patch, and it seems like the issues are
similar here.  If somebody's counting on the same kinds of per-row
triggers to fire during logical replication as we do during the
original operation, they will be disappointed.

>> Also, does ALTER TABLE ... ENABLE/DISABLE TRIGGER do the right things on
>> partitioned tables?
>
> Haven't done this yet, either.  I like Simon's suggestion of outright
> disallowing this.

Why not just make it work?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Would a BGW need shmem_access or database_connection to enumerate databases?

2018-01-24 Thread Bruce Momjian
On Thu, Dec 14, 2017 at 06:12:35PM -0500, Chapman Flack wrote:
> On 12/04/2017 09:13 AM, Craig Ringer wrote:
> > On 1 December 2017 at 23:04, Chapman Flack  wrote:
> >> Can I call RegisterDynamicBackgroundWorker when not in the postmaster,
> >> but also not in a "regular backend", but rather another BGW?
> >>
> > Yes. BDR does it a lot.
> 
> Would this doc patch be acceptable to clarify that, in case
> I'm not the last person who might wonder?

Thanks, patch applied to head.

---

> >From 3308ef5647e8ce4a84855b4d0ca09ba6aeb7 Mon Sep 17 00:00:00 2001
> From: Chapman Flack 
> Date: Thu, 14 Dec 2017 18:09:14 -0500
> Subject: [PATCH] Clarify that a BGW can register a dynamic BGW.
> 
> ---
>  doc/src/sgml/bgworker.sgml | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
> index 4bc2b69..e490bb8 100644
> --- a/doc/src/sgml/bgworker.sgml
> +++ b/doc/src/sgml/bgworker.sgml
> @@ -41,7 +41,7 @@
>*worker, BackgroundWorkerHandle **handle).  Unlike
>RegisterBackgroundWorker, which can only be called 
> from within
>the postmaster, RegisterDynamicBackgroundWorker must 
> be
> -  called from a regular backend.
> +  called from a regular backend, possibly another background worker.
>   
>  
>   
> -- 
> 1.8.3.1
> 


-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Andres Freund
On 2018-01-23 14:24:56 -0500, Robert Haas wrote:
> Right, but this doesn't seem to show any big spike in the runtime at
> the time when parallel hash was committed, or when the preparatory
> patch to add test coverage for hash joins got committed.  Rather,
> there's a gradual increase over time.  Either we're making the server
> slower (which would be bad) or we're adding proper test coverage for
> all the new features that we're adding (which would be good).  We
> can't expect every feature patch to preserve the runtime of the tests
> absolutely unchanged; figuring out what can be optimized is a separate
> exercise from adding test coverage either for new things or for things
> that weren't previously covered.

Agreed.

One the improvement front, my observation is that we rarely are actually
cpu bound across processes. One thing I've been wondering is whether we
can get a pretty large win from just rescheduling
parallel_schedule. There definitely are individual testfiles that take a
lot longer than others, but their positining in groups doesn't
necessarily reflect that.

Besides manually reordering the schedule, I think it might be time that
we improve pg_regress's scheduling. One big first step would e.g. be to
not manually limit the number of parallel tests in a group to 20, but
instead allow larger groups and only run a limited number of them in
parallel. If done right we could start the next test in a group as soon
as *one* task in a group has finished, rather than waiting for all of
them to finish as we currently do for (sub-)groups.

Besides larger groups, starting the next test(s) earlier, another way to
gain pretty large improvements would be a test schedule feature that
allowed to stat dependencies between tests. So instead of manually
grouping the schedule, have 'numerology' state that it depends on int2,
int4, int8, float4, float8, which means it can actually be started
earlier than it currently can in many cases.

Greetings,

Andres Freund



Re: copy.c allocation constant

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 12:25 PM, Tomas Vondra
 wrote:
> At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator
> with similar ideas (freelists, ...) so hopefully it's fine too.
>
> And then there are the systems without glibc, or with other libc
> implementations. No idea about those.

My guess is that a fairly common pattern for larger chunks will be to
round the size up to a multiple of 4kB, the usual memory page size.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pgsql: Add parallel-aware hash joins.

2018-01-24 Thread Robert Haas
On Tue, Jan 23, 2018 at 6:10 PM, Tom Lane  wrote:
> Looking more closely at the shorter series, there are four pretty obvious
> step changes since 2016-09.  The PNG's x-axis doesn't have enough
> resolution to match these up to commits, but looking at the underlying
> data, they clearly correspond to:
>
> Branch: master Release: REL_10_BR [b801e1200] 2016-10-18 15:57:58 -0400
> Improve regression test coverage for hash indexes.
>
> Branch: master Release: REL_10_BR [4a8bc39b0] 2017-04-12 16:17:53 -0400
> Speed up hash_index regression test.
>
> Branch: master [fa330f9ad] 2017-11-29 16:06:50 -0800
> Add some regression tests that exercise hash join code.
>
> Branch: master [180428404] 2017-12-21 00:43:41 -0800
> Add parallel-aware hash joins.
>
> I thought that the hash index test case was excessively expensive for
> what it covered, and I'm now thinking the same about hash joins.

Hmm.  I guess I'm insulated from some of the problem here by my choice
of hardware. On my laptop, 'make check' takes between 25.5 and 26
seconds (on 28e04155f17cabda7a18aee31d130aa10e25ee86).  If I remove
the hash_index test from parallel_schedule, it still takes between
25.5 and 26 seconds.  If I also remove the join test in its entirety,
it drops down to 24-24.5 seconds.  If I put hash_index and join back
in the schedule file but revert join.sql and join.out to the version
just before fa330f9ad, it takes about 24.5 seconds.  So for me, the
additional hash index tests don't cost anything measurable and the
additional hash join tests cost about a second.  I think this probably
accounts for why committers other than you keep "adding so much time
to the regression tests".  On modern hardware, the costs just don't
matter.  As a further point of reference, on this machine, 9.5 stable
is 24.5-25 seconds, and 9.3 is 25.5-26 seconds, so from here it looks
like in the last 5 years the speed of 'make check' is within a half
second or so of the performance we had 5 years ago even though the
volume of the regression tests in terms of lines of SQL code has
increased by more than 50% in the same time period.

Now, how much should we care about the performance of software with a
planned release date of 2018 on hardware discontinued in 2001,
hardware that is apparently about 20 times slower than a modern
laptop?  Some, perhaps, but maybe not a whole lot.  Removing tests
that have found actual bugs because they cost runtime on ancient
systems that nobody uses for serious work doesn't make sense to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> In further testing of that, I noticed that it made the behavior of our
>> other bugaboo, the public schema, rather inconsistent.  With this
>> builtin-extensions hack, the plpgsql extension doesn't get dumped,
>> whether or not you say "clean".  But the public schema does get
>> dropped and recreated if you say "clean".  That's not helpful for
>> non-superuser users of pg_dump, so I think we should try to fix it.

> I'm not entirely sure about trying to also support --clean for
> non-superusers..  We've long had that the public schema is dropped and
> recreated with --clean and it seems likely that at least some users are
> depending on us doing that.  In any case, it's certainly not a change
> that I think we could backpatch.  Perhaps we could just change it moving
> forward (which would make me happier, really, since what I think we do
> with these initdb-time things currently is a bit bizarre).

Sure, I was not proposing this for back-patch --- it depends on the
other stuff we've committed recently, anyway.

> Yes, having that in getNamespaces() isn't correct but we need to do
> something there, and I've been trying to figure out what.

I claim this is what ;-)

regards, tom lane



Re: Is it valid to have logical replication between 2 databases on the same postgres server?

2018-01-24 Thread Ryan Murphy
Thanks all!  I'll try creating the replication slot manually.


Re: [PROPOSAL] Shared Ispell dictionaries

2018-01-24 Thread Tomas Vondra
Hi,

On 01/24/2018 06:20 PM, Arthur Zakirov wrote:
> On Sat, Jan 13, 2018 at 06:22:41PM +0300, Arthur Zakirov wrote:
>> I think your proposals may be implemented in several patches, so
>> they can be applyed independently but consistently. I suppose I
>> will prepare new version of the patch with fixes and with initial
>> design of new functions and commands soon.
> 
> I attached new version of the patch.
> 

Thanks. I don't have time to review/test this before FOSDEM, but a
couple of comments regarding some of the points you mentioned.

>> 3) How do I unload a dictionary from the shared memory?
>> ...
>> ALTER TEXT SEARCH DICTIONARY x UNLOAD
>>
>> 4) How do I reload a dictionary?
>> ...
>> ALTER TEXT SEARCH DICTIONARY x RELOAD
> 
> I thought about it. And it seems to me that we can use functions 
> ts_unload() and ts_reload() instead of new syntax. We already have
> text search functions like ts_lexize() and ts_debug(), and it is
> better to keep consistency.

This argument seems a bit strange. Both ts_lexize() and ts_debug() are
operating on text values, and are meant to be executed as functions from
SQL - particularly ts_lexize(). It's hard to imagine this implemented as
DDL commands.

The unload/reload is something that operates on a database object
(dictionary), which already has create/drop/alter DDL. So it seems
somewhat natural to treat unload/reload as another DDL action.

Taken to an extreme, this argument would essentially mean we should not
have any DDL commands because we have SQL functions.

That being said, I'm not particularly attached to having this DDL now.
Implementing it seems straight-forward (particularly when we already
have the stuff implemented as functions), and some of the other open
questions seem more important to tackle now.

> I think there are to approach for ts_unload():> - use DSM's pin and unpin 
> methods and the invalidation callback, as
> it done during fixing memory leak. It has the drawback that it won't
> have an immediate effect, because DSM will be released only when all
> backends unpin DSM mapping.
> - use DSA and dsa_free() method. As far as I understand dsa_free() 
> frees allocated memory immediatly. But it requires more work to do, 
> because we will need some more locks. For instance, what happens
> when someone calls ts_lexize() and someone else calls dsa_free() at
> the same time.


No opinion on this yet, I have to think about it for a bit and look at
the code first.

>> 7) You mentioned you had to get rid of the compact_palloc0 - can you
>> elaborate a bit why that was necessary? Also, when benchmarking the
>> impact of this make sure to measure not only the time but also memory
>> consumption.
> 
> It seems to me that there is no need compact_palloc0() anymore. Tests
> show that czech dictionary doesn't consume more memory after the
> patch.
> 

That's interesting. I'll do some additional tests to verify the finding.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

2018-01-24 Thread Stephen Frost
Greetings,

* chenhj (chjis...@163.com) wrote:
> 
> At 2018-01-23 09:56:48, "Stephen Frost"  wrote:
> >I've only read through the thread to try and understand what's going on
> >and the first thing that comes to mind is that you're changing
> >pg_rewind to not remove the WAL from before the divergence (split)
> >point, but I'm not sure why.  As noted, that WAL isn't needed for
> >anything (it's from before the split, after all), so why keep it?  Is
> >there something in this optimization that depends on the old WAL being
> >there and, if so, what and why?
> 
> After run pg_rewind, the first startup of postgres will do crash recovery.
> And crash recovery will begin from the previous redo point preceding the 
> divergence.
> So, the WAL after the redo point and before the divergence is needed.

Right.

> Of course, the WAL before the redo point is not needed, but in my point of 
> view,
> recycling unwanted WAL does not have to be done by pg_rewind.

That's what pg_rewind has been doing though, isn't it?  And it's not
like that WAL is useful for anything, is it?  That's also how
pg_basebackup works.

> >That's also different from how pg_basebackup works, which I don't think
> >is good (seems like pg_rewind should operate in a pretty similar manner
> >to pg_basebackup).
> 
> Thanks for your comments!
> I also considered copy WAL just like how pg_basebackup does,but a implement 
> similar to pg_basebackup's manner may be not so simple.

Using the replication protocol to fetch WAL would be a good thing to do
(actually, making pg_rewind entirely work through a connection to the
current primary would be great) but that's independent of what I'm
asking for here.  Here I'm just suggesting that we not change what
pg_rewind is doing today when it comes to the existing WAL on the
old-primary.

> And the WAL which contains the previous redo point preceding the divergence 
> may be only exists in target server and had been recycled in source. That's 
> different between pg_rewind and pg_basebackup.

Hm, pg_rewind was removing that and expecting it to be on the new
primary?  If that's the case then I could see an argument for keeping
WAL that's from the divergence point onward, but I still don't think we
should have pg_rewind just leave all of the prior WAL in place.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> I wrote:
> > I went looking and realized that actually what we're interested in here
> > is the plpgsql extension, not the plpgsql language ... and in fact the
> > behavior I was thinking of is already there, except for some reason it's
> > only applied during binary upgrade.  So I think we should give serious
> > consideration to the attached, which just removes the binary_upgrade
> > condition (and adjusts nearby comments).
> 
> In further testing of that, I noticed that it made the behavior of our
> other bugaboo, the public schema, rather inconsistent.  With this
> builtin-extensions hack, the plpgsql extension doesn't get dumped,
> whether or not you say "clean".  But the public schema does get
> dropped and recreated if you say "clean".  That's not helpful for
> non-superuser users of pg_dump, so I think we should try to fix it.

I'm not entirely sure about trying to also support --clean for
non-superusers..  We've long had that the public schema is dropped and
recreated with --clean and it seems likely that at least some users are
depending on us doing that.  In any case, it's certainly not a change
that I think we could backpatch.  Perhaps we could just change it moving
forward (which would make me happier, really, since what I think we do
with these initdb-time things currently is a bit bizarre).

> Hence, the attached updated patch also makes selection of the public
> schema use the DUMP_COMPONENT infrastructure rather than hardwired
> code.
> 
> I note btw that the removed bit in getNamespaces() is flat out wrong
> independently of these other considerations, because it makes the SQL
> put into an archive file vary depending on whether --clean was specified
> at pg_dump time.  That's not supposed to happen.

Yes, having that in getNamespaces() isn't correct but we need to do
something there, and I've been trying to figure out what.  The reason
it's there in the first place is that if you do --clean then the public
schema is dropped and recreated *but* the initdb-time ACL isn't put back
for it (and there's no ACL entries in the dump file, be it text or
custom, if the user didn't change the ACL from the initdb-time one).

What I was planning to do there was somehow inject the initdb-time ACL
for public into the set of things to restore when we're asked to do a
restore from a custom-format (or directory or other type which is
handled by pg_restore) dump.  We have to account for someone asking for
pg_dump --clean --no-privileges also though.  We would still need to do
something for text-based output though..

Note that this fix really needs to be back-patched and that we had
complaints about the --clean issue with the public schema on text-based
pg_dump's (which is why the hack was added to getNamespaces() in the
first place) and, more recently, for non-text based pg_dump's.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: copy.c allocation constant

2018-01-24 Thread Bruce Momjian
On Wed, Jan 24, 2018 at 06:25:01PM +0100, Tomas Vondra wrote:
> > 
> 
> I think there there are two things to consider - aset.c and glibc.
> 
> In AllocSet this is handled as over-sized chunk, that is chunk greater
> than ALLOC_CHUNK_LIMIT (which ends up being 8kB). Which means it's
> allocated as a separate block, and does not go through the freelists. So
> the size does not follow the usual "power of 2" pattern and is only
> maxalign-ed, and then freed immediately on pfree.
> 
> So for AllocSet we're fine, and this should not result in any memory
> inefficiency, assuming there are not many such requests (which would
> result in many malloc/free calls, which may be somewhat expensive).
> 
> At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator
> with similar ideas (freelists, ...) so hopefully it's fine too.
> 
> And then there are the systems without glibc, or with other libc
> implementations. No idea about those.

Thanks for the analysis.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: copy.c allocation constant

2018-01-24 Thread Tomas Vondra


On 01/24/2018 04:14 AM, Bruce Momjian wrote:
> On Tue, Nov 28, 2017 at 11:51:28AM -0500, Andrew Dunstan wrote:
>>
>>
>> While reading copy.c I noticed this line:
>>
>>
>> #define RAW_BUF_SIZE 65536        /* we palloc RAW_BUF_SIZE+1 bytes */
>>
>>
>> Doesn't that seem rather odd? If we're adding 1 wouldn't it be better as
>> 65535 so we palloc a power of 2?
>>
>>
>> I have no idea if this affects performance, but it did strike me
>> as strange.
> 
> Coming in late here, but it does seem very odd.
> 

I think there there are two things to consider - aset.c and glibc.

In AllocSet this is handled as over-sized chunk, that is chunk greater
than ALLOC_CHUNK_LIMIT (which ends up being 8kB). Which means it's
allocated as a separate block, and does not go through the freelists. So
the size does not follow the usual "power of 2" pattern and is only
maxalign-ed, and then freed immediately on pfree.

So for AllocSet we're fine, and this should not result in any memory
inefficiency, assuming there are not many such requests (which would
result in many malloc/free calls, which may be somewhat expensive).

At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator
with similar ideas (freelists, ...) so hopefully it's fine too.

And then there are the systems without glibc, or with other libc
implementations. No idea about those.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re:Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

2018-01-24 Thread chenhj

At 2018-01-23 09:56:48, "Stephen Frost"  wrote:

>
>I've only read through the thread to try and understand what's going on
>and the first thing that comes to mind is that you're changing
>pg_rewind to not remove the WAL from before the divergence (split)
>point, but I'm not sure why.  As noted, that WAL isn't needed for
>anything (it's from before the split, after all), so why keep it?  Is
>there something in this optimization that depends on the old WAL being
>there and, if so, what and why?



After run pg_rewind, the first startup of postgres will do crash recovery.
And crash recovery will begin from the previous redo point preceding the 
divergence.
So, the WAL after the redo point and before the divergence is needed.

Of course, the WAL before the redo point is not needed, but in my point of view,
recycling unwanted WAL does not have to be done by pg_rewind.

reference code:

pg_rewind.c:
main(int argc, char **argv)
{
...
findLastCheckpoint(datadir_target, divergerec,
   lastcommontliIndex,
   , , );
...
createBackupLabel(chkptredo, chkpttli, chkptrec);
...
}
...
createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr 
checkpointloc)
{
...
len = snprintf(buf, sizeof(buf),
   "START WAL LOCATION: %X/%X (file %s)\n"
   "CHECKPOINT LOCATION: %X/%X\n"
   "BACKUP METHOD: pg_rewind\n"
   "BACKUP FROM: standby\n"
   "START TIME: %s\n",
/* omit LABEL: line */
   (uint32) (startpoint >> 32), (uint32) 
startpoint, xlogfilename,
   (uint32) (checkpointloc >> 32), (uint32) 
checkpointloc,
   strfbuf);
...
}




>That's also different from how pg_basebackup works, which I don't think
>is good (seems like pg_rewind should operate in a pretty similar manner
>to pg_basebackup).
>

Thanks for your comments!
I also considered copy WAL just like how pg_basebackup does,but a implement 
similar to pg_basebackup's manner may be not so simple.
Twice transport of files from source to target may be needed,first for data 
files, and another for WAL.
And the WAL which contains the previous redo point preceding the divergence may 
be only exists in target server and had been recycled in source. That's 
different between pg_rewind and pg_basebackup.


Regards,
Chen Huajun







Re: [PROPOSAL] Shared Ispell dictionaries

2018-01-24 Thread Arthur Zakirov
On Sat, Jan 13, 2018 at 06:22:41PM +0300, Arthur Zakirov wrote:
> I think your proposals may be implemented in several patches, so they can
> be applyed independently but consistently. I suppose I will prepare new
> version of the patch with fixes and with initial design of new functions
> and commands soon.

I attached new version of the patch.

0001-Fix-ispell-memory-handling-v3.patch:

> allocate memory in the buildCxt. What about adding tmpstrdup to copy a
> string into the context? I admit this is mostly nitpicking though.

Fixed. Added tmpstrdup.

0002-Retreive-shmem-location-for-ispell-v3.patch:

dshash.c is used now instead of dynahash.c. A hash table is created during 
first call of a text search function in an instance. A hash table uses OID of a 
dictionary instead of file names, so there is no cross-db sharing at all.

Added max_shared_dictionaries_size GUC instead of shared_dictionaries. In 
current version it can be set only at server start. If a dictionary is 
allocated in a backend's memory instead of shared memory then LOG message will 
be raised which includes OID of the dictionary.

Fixed memory leak. During removing a dictionary and invalidating dictionaries 
cash ts_dict_shmem_release() is called. It unpins mapping of a dictionary, if 
reference count reaches zero then DSM segment will be unpinned. So allocated 
shared memory will be released by Postgres.

0003-Store-ispell-structures-in-shmem-v3.patch:

Added documentation fixes. dispell_init() (tmplinit too) has second argument, 
dictid.

0004-Update-tmplinit-arguments-v3.patch:

It is necessary to fix all dictionaries including contrib extensions because of 
second argument for tmplinit.

tmplinit has the following signature now:

dict_init(internal, internal)

0005-pg-ts-shared-dictinaries-view-v3.patch:

Added pg_ts_shared_dictionaries() function and pg_ts_shared_dictionaries system 
view. They return a list of dictionaries currently in shared memory, with the 
columns:
- dictoid
- schemaname
- dictname
- size

0006-Shared-memory-ispell-option-v3.patch:

Added SharedMemory option for Ispell dictionary template. It is true by 
default, because I think it would be good that people will haven't to do 
anything to allocate dictionaries in shared memory.

Setting SharedMemory=false during ALTER TEXT SEARCH DICTIONARY hasn't immediate 
effect. It is because ALTER doesn't force to invalidate dictionaries cache, if 
I'm not mistaken.

> 3) How do I unload a dictionary from the shared memory?
> ...
> ALTER TEXT SEARCH DICTIONARY x UNLOAD
>
> 4) How do I reload a dictionary?
> ...
> ALTER TEXT SEARCH DICTIONARY x RELOAD

I thought about it. And it seems to me that we can use functions ts_unload() 
and ts_reload() instead of new syntax. We already have text search functions 
like ts_lexize() and ts_debug(), and it is better to keep consistency. I think 
there are to approach for ts_unload():
- use DSM's pin and unpin methods and the invalidation callback, as it done 
during fixing memory leak. It has the drawback that it won't have an immediate 
effect, because DSM will be released only when all backends unpin DSM mapping.
- use DSA and dsa_free() method. As far as I understand dsa_free() frees 
allocated memory immediatly. But it requires more work to do, because we will 
need some more locks. For instance, what happens when someone calls ts_lexize() 
and someone else calls dsa_free() at the same time.

> 7) You mentioned you had to get rid of the compact_palloc0 - can you
> elaborate a bit why that was necessary? Also, when benchmarking the
> impact of this make sure to measure not only the time but also memory
> consumption.

It seems to me that there is no need compact_palloc0() anymore. Tests show that 
czech dictionary doesn't consume more memory after the patch.

Tests
-

I've measured creation time of dictionaries on my 64-bit machine. You can get 
them from [1]. Here the master is 434e6e1484418c55561914600de9e180fc408378. 
I've measured french dictionary too because it has even bigger affix file than 
czech dictionary.

With patch:
czech_hunspell - 247 ms
english_hunspell - 59 ms
french_hunspell - 103 ms

Master:
czech_hunspell - 224 ms
english_hunspell - 52 ms
french_hunspell - 101 ms

Memory:

With patch (shared memory size + backend's memory):
czech_hunspell - 9573049 + 192584 total in 5 blocks; 1896 free (11 chunks); 
190688 used
english_hunspell - 1985299 + 21064 total in 6 blocks; 7736 free (13 chunks); 
13328 used
french_hunspell - 4763456 + 626960 total in 7 blocks; 7680 free (14 chunks); 
619280 used

Here french dictionary uses more backend's memory because it has big affix 
file. Regular expression structures are stored in backend's memory still.

Master (backend's memory):
czech_hunspell - 17181544 total in 2034 blocks; 3584 free (10 chunks); 17177960 
used
english_hunspell - 4160120 total in 506 blocks; 2792 free (10 chunks); 4157328 
used
french_hunspell - 11439184 total in 1187 blocks; 18832 free (171 

Re: [HACKERS] Patch: Add --no-comments to skip COMMENTs with pg_dump

2018-01-24 Thread Tom Lane
I wrote:
> I went looking and realized that actually what we're interested in here
> is the plpgsql extension, not the plpgsql language ... and in fact the
> behavior I was thinking of is already there, except for some reason it's
> only applied during binary upgrade.  So I think we should give serious
> consideration to the attached, which just removes the binary_upgrade
> condition (and adjusts nearby comments).

In further testing of that, I noticed that it made the behavior of our
other bugaboo, the public schema, rather inconsistent.  With this
builtin-extensions hack, the plpgsql extension doesn't get dumped,
whether or not you say "clean".  But the public schema does get
dropped and recreated if you say "clean".  That's not helpful for
non-superuser users of pg_dump, so I think we should try to fix it.

Hence, the attached updated patch also makes selection of the public
schema use the DUMP_COMPONENT infrastructure rather than hardwired
code.

I note btw that the removed bit in getNamespaces() is flat out wrong
independently of these other considerations, because it makes the SQL
put into an archive file vary depending on whether --clean was specified
at pg_dump time.  That's not supposed to happen.

I've also updated the regression test this time, as I think this
is committable.

regards, tom lane

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index f55aa36..fb03793 100644
*** a/src/bin/pg_dump/pg_backup_archiver.c
--- b/src/bin/pg_dump/pg_backup_archiver.c
*** _printTocEntry(ArchiveHandle *AH, TocEnt
*** 3453,3481 
  {
  	RestoreOptions *ropt = AH->public.ropt;
  
- 	/*
- 	 * Avoid dumping the public schema, as it will already be created ...
- 	 * unless we are using --clean mode (and *not* --create mode), in which
- 	 * case we've previously issued a DROP for it so we'd better recreate it.
- 	 *
- 	 * Likewise for its comment, if any.  (We could try issuing the COMMENT
- 	 * command anyway; but it'd fail if the restore is done as non-super-user,
- 	 * so let's not.)
- 	 *
- 	 * XXX it looks pretty ugly to hard-wire the public schema like this, but
- 	 * it sits in a sort of no-mans-land between being a system object and a
- 	 * user object, so it really is special in a way.
- 	 */
- 	if (!(ropt->dropSchema && !ropt->createDB))
- 	{
- 		if (strcmp(te->desc, "SCHEMA") == 0 &&
- 			strcmp(te->tag, "public") == 0)
- 			return;
- 		if (strcmp(te->desc, "COMMENT") == 0 &&
- 			strcmp(te->tag, "SCHEMA public") == 0)
- 			return;
- 	}
- 
  	/* Select owner, schema, and tablespace as necessary */
  	_becomeOwner(AH, te);
  	_selectOutputSchema(AH, te->namespace);
--- 3453,3458 
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d65ea54..50399c9 100644
*** a/src/bin/pg_dump/pg_dump.c
--- b/src/bin/pg_dump/pg_dump.c
*** selectDumpableNamespace(NamespaceInfo *n
*** 1407,1412 
--- 1407,1425 
  		/* Other system schemas don't get dumped */
  		nsinfo->dobj.dump_contains = nsinfo->dobj.dump = DUMP_COMPONENT_NONE;
  	}
+ 	else if (strcmp(nsinfo->dobj.name, "public") == 0)
+ 	{
+ 		/*
+ 		 * The public schema is a strange beast that sits in a sort of
+ 		 * no-mans-land between being a system object and a user object.  We
+ 		 * don't want to dump creation or comment commands for it, because
+ 		 * that complicates matters for non-superuser use of pg_dump.  But we
+ 		 * should dump any ACL changes that have occurred for it, and of
+ 		 * course we should dump contained objects.
+ 		 */
+ 		nsinfo->dobj.dump = DUMP_COMPONENT_ACL;
+ 		nsinfo->dobj.dump_contains = DUMP_COMPONENT_ALL;
+ 	}
  	else
  		nsinfo->dobj.dump_contains = nsinfo->dobj.dump = DUMP_COMPONENT_ALL;
  
*** selectDumpableAccessMethod(AccessMethodI
*** 1617,1637 
   * selectDumpableExtension: policy-setting subroutine
   *		Mark an extension as to be dumped or not
   *
!  * Normally, we dump all extensions, or none of them if include_everything
!  * is false (i.e., a --schema or --table switch was given).  However, in
!  * binary-upgrade mode it's necessary to skip built-in extensions, since we
   * assume those will already be installed in the target database.  We identify
   * such extensions by their having OIDs in the range reserved for initdb.
   */
  static void
  selectDumpableExtension(ExtensionInfo *extinfo, DumpOptions *dopt)
  {
  	/*
! 	 * Use DUMP_COMPONENT_ACL for from-initdb extensions, to allow users to
! 	 * change permissions on those objects, if they wish to, and have those
! 	 * changes preserved.
  	 */
! 	if (dopt->binary_upgrade && extinfo->dobj.catId.oid <= (Oid) g_last_builtin_oid)
  		extinfo->dobj.dump = extinfo->dobj.dump_contains = DUMP_COMPONENT_ACL;
  	else
  		extinfo->dobj.dump = extinfo->dobj.dump_contains =
--- 1630,1650 
   * selectDumpableExtension: policy-setting subroutine
   *		Mark an extension as to be dumped or not
   *
!  * Built-in extensions should be 

Documentation update

2018-01-24 Thread John Scalia
Not sure if this is the correct place to report this, but a colleague and I
discovered a problem with the systemd service file as described in the
documentation.  In the sample, there is a need for a line reading

"After=syslogd.target network.target"

under the [unit] tag. What we found on Redhat 7.4 was that replica servers
were coming up unable to find their primary servers as DNS wasn't quite
available yet, but otherwise the server was operational. In any case, we
don't think this addition would break anyone's configuration, and the
current documentation does not reflect this.

--
Jay


  1   2   >