Re: AS OF queries

2017-12-27 Thread Konstantin Knizhnik



On 27.12.2017 10:29, Craig Ringer wrote:
On 25 December 2017 at 15:59, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:




On 25.12.2017 06:26, Craig Ringer wrote:

On 24 December 2017 at 04:53, konstantin knizhnik
mailto:k.knizh...@postgrespro.ru>> wrote:



But what if I just forbid to change recent_global_xmin?
If it is stalled at FirstNormalTransactionId and never changed?
Will it protect all versions from been deleted?


That's totally impractical, you'd have unbounded bloat and a
nonfunctional system in no time.

You'd need a mechanism - akin to what we have with replication
slots - to set a threshold for age.


Well, there are systems with "never delete" and "append only"
semantic.
For example, I have participated in SciDB project: database for
scientific applications.
One of the key requirements for scientific researches is
reproducibility.
From the database point of view it means that we need to store all
raw data and never delete it.


PostgreSQL can't cope with that for more than 2^31 xacts, you have to 
"forget" details of which xacts created/updated tuples and the 
contents of deleted tuples, or you exceed our xid limit. You'd need 
64-bit XIDs, or a redo-buffer based heap model (like the zheap stuff) 
with redo buffers marked with an xid epoch, or something like that.


Yes, but PgPro-EE already has 64-bit xids and we have spent a lot of 
time trying to push it to community.



I am not sure that it should be similar with logical replication slot.

Here semantic is quite clear: we preserve segments of WAL until
them are replicated to the subscribers.


Er, what?

This isn't to do with restart_lsn. That's why I mentioned *logical* 
replication slots.


I'm talking about how they interact with GetOldestXmin using their 
xmin and catalog_xmin.


You probably won't want to re-use slots, but you'll want something 
akin to that, a transaction age threshold. Otherwise your system has a 
finite end date where it can no longer function due to xid count, or 
if you solve that, it'll slowly choke on table bloat etc. I guess if 
you're willing to accept truly horrible performance...


Definitely supporting time travel through frequently updated data may 
cause database bloat and awful performance.
I still think that this feature will be mostly interesting for 
append-only/rarely updated data.


In any case I have set vacuum_defer_cleanup_age = 100 and run 
pgbench during several limits.

There was no significant performance degradation.

Unfortunately  replication slots, neither  vacuum_defer_cleanup_age 
allows to keep versions just for particular table(s).

And it seems to be the major problem I do not know how to solve now.


With time travel situation is less obscure: we may specify some
threshold for age - keep data for example for one year.


Sure. You'd likely do that by mapping commit timestamps => xids and 
using an xid threshold though.


But unfortunately trick with snapshot (doesn't matter how we setup
oldest xmin horizon) affect all tables.


You'd need to be able to pass more info into HeapTupleSatisfiesMVCC 
etc. I expect you'd probably add a new snapshot type (like logical 
decoding did with historic snapshots), that has a new Satisfies 
function. But you'd have to be able to ensure all snapshot Satisfies 
callers had the required extra info - like maybe a Relation - which 
could be awkward for some call sites.




Yes, it seems to be the only possible choice.

The user would have to be responsible for ensuring sanity of FK 
relationships etc when specifying different snapshots for different 
relations.


Per-relation time travel doesn't seem totally impractical so long as 
you can guarantee that there is some possible snapshot for which the 
catalogs defining all the relations and types are simultaneously 
valid, i.e. there's no disjoint set of catalog changes. Avoiding messy 
performance implications with normal queries might not even be too bad 
if you use a separate snapshot model, so long as you can avoid 
callsites having to do extra work in the normal case.


Dealing with dropped columns and rewrites would be a pain though. 
You'd have to preserve the dropped column data when you re-projected 
the rewrite tuples.


There is similar (but not the same) problem with logical
replication: assume that we need to replicate only one small
table. But we have to pin in WAL all updates of other huge table
which is not involved in logical replication at all.


I don't really see how that's similar. It's concerned with WAL, wheras 
what you're looking at is heaps and bloat from old versions. 
Completely different, unless you propose to somehow reconstruct data 
from old WAL to do historic queries, which would be o_O ...


Well, I am really not sure about user's demands to time travel.
This is one of the reasons of initiating this discussio

Re: Postgres with pthread

2017-12-27 Thread Konstantin Knizhnik



On 21.12.2017 16:25, Konstantin Knizhnik wrote:

I continue experiments with my pthread prototype.
Latest results are the following:

1. I have eliminated all (I hope) calls of non-reentrant functions 
(getopt, setlocale, setitimer, localtime, ...). So now parallel tests 
are passed.


2. I have implemented deallocation of top memory context (at thread 
exit) and cleanup of all opened file descriptors.
I have to replace several place where malloc is used with top_malloc: 
allocation in top context.


3. Now my prototype is passing all regression tests now. But handling 
of errors is still far from completion.


4. I have performed experiments with replacing synchronization 
primitives used in Postgres with pthread analogues.

Unfortunately it has almost now influence on performance.

5. Handling large number of connections.
The maximal number of postgres connections is almost the same: 100k.
But memory footprint in case of pthreads was significantly smaller: 
18Gb vs 38Gb.

And difference in performance was much higher: 60k TPS vs . 600k TPS.
Compare it with performance for 10k clients: 1300k TPS.
It is read-only pgbench -S test with 1000 connections.
As far as pgbench doesn't allow to specify more than 1000 clients, I 
spawned several instances of pgbench.


Why handling large number of connections is important?
It allows applications to access postgres directly, not using 
pgbouncer or any other external connection pooling tool.
In this case an application can use prepared statements which can 
reduce speed of simple queries almost twice.


Unfortunately Postgres sessions are not lightweight. Each backend 
maintains its private catalog and relation caches, prepared statement 
cache,...
For real database size of this caches in memory will be several 
megabytes and warming this caches can take significant amount of time.
So if we really want to support large number of connections, we should 
rewrite caches to be global (shared).
It will allow to save a lot of memory but add synchronization 
overhead. Also at NUMA private caches may be more efficient than one 
global cache.


My proptotype can be found at: 
git://github.com/postgrespro/postgresql.pthreads.git




Finally I managed to run Postgres with 100k active connections.
Not sure that this result can pretend for been mentioned in Guiness 
records, but I am almost sure that nobody has done it before (at least 
with original version of Postgres).
But it was really "Pyrrhic victory". Performance for 100k connections is 
1000 times slower than for 10k. All threads are blocked in semaphores.
This is more or less expected result, but still scale of degradation is 
impressive:



#Connections
TPS
100k
550
10k
558k
6k
745k
4k
882k
2k
1100k
1k
1300k



As it is clear from this stacktraces, shared catalog and statement cache 
are highly needed to provide good performance with such large number of 
active backends:



(gdb) thread apply all bt

Thread 17807 (LWP 660863):
#0  0x7f4c1cb46576 in do_futex_wait.constprop () from 
/lib64/libpthread.so.0
#1  0x7f4c1cb46668 in __new_sem_wait_slow.constprop.0 () from 
/lib64/libpthread.so.0

#2  0x00697a32 in PGSemaphoreLock ()
#3  0x00702a64 in LWLockAcquire ()
#4  0x006fbf2d in LockAcquireExtended ()
#5  0x006f9fa3 in LockRelationOid ()
#6  0x004b2ffd in relation_open ()
#7  0x004b31d6 in heap_open ()
#8  0x007f1ed1 in CatalogCacheInitializeCache ()
#9  0x007f3835 in SearchCatCache1 ()
#10 0x00800510 in get_tablespace ()
#11 0x008006e1 in get_tablespace_page_costs ()
#12 0x0065a4e1 in cost_seqscan ()
#13 0x0068bf92 in create_seqscan_path ()
#14 0x006568b4 in set_rel_pathlist ()
#15 0x00656eb8 in make_one_rel ()
#16 0x006740d0 in query_planner ()
#17 0x00676526 in grouping_planner ()
#18 0x00679812 in subquery_planner ()
#19 0x0067a66c in standard_planner ()
#20 0x0070ffe1 in pg_plan_query ()
#21 0x007100b6 in pg_plan_queries ()
#22 0x007f6c6f in BuildCachedPlan ()
#23 0x007f6e5c in GetCachedPlan ()
#24 0x00711ccf in PostgresMain ()
#25 0x006a5535 in backend_main_proc ()
#26 0x006a353d in thread_trampoline ()
#27 0x7f4c1cb3d36d in start_thread () from /lib64/libpthread.so.0
#28 0x7f4c1c153b8f in clone () from /lib64/libc.so.6

Thread 17806 (LWP 660861):
#0  0x7f4c1cb46576 in do_futex_wait.constprop () from 
/lib64/libpthread.so.0
#1  0x7f4c1cb46668 in __new_sem_wait_slow.constprop.0 () from 
/lib64/libpthread.so.0

#2  0x00697a32 in PGSemaphoreLock ()
#3  0x00702a64 in LWLockAcquire ()
#4  0x006fbf2d in LockAcquireExtended ()
#5  0x006f9fa3 in LockRelationOid ()
#6  0x004b2ffd in relation_open ()
#7  0x004b31d6 in heap_open ()
#8  0x007f1ed1 in CatalogCacheInitializeCache ()
#9  0x000

Re: [HACKERS] [PATCH] Tap test support for backup with tablespace mapping

2017-12-27 Thread Tels
Dear Vaishnavi,

On Tue, December 26, 2017 8:58 pm, Vaishnavi Prabakaran wrote:
> Hi All,
>
> I have added support in Postgres TAP test framework to backup a data
> directory with tablespace mapping. Also added support to move the backup
> directory contents to standby node, because current option to init the
> standby from backup does not support copying softlinks, which is needed
> when tablespace mapping is involved in backup.
>
> Added a new test to existing streaming replication tap test to demonstrate
> the usage of these new APIs.
>
> Attached the patch, Hope this enhancement is useful.
>
> Thanks & Regards,
> Vaishnavi,
> Fujitsu Australia.
>

Thank you for the path, I saw these things:

* backup_withtablespace() does not have any documentation?

* The mkdir calls do not set a mask for the created dir, defaulting to
0777 - is this what is wanted here?

* none of the mkdir, chdir etc. calls check any error code, e.g. what
happens if one of them fails?

* different indentation between rmdir and move (tab vs. spaces):

+rmdir($data_path);
+   move("$backup_path", "$self->{_basedir}/pgdata")

Best regards,

Tels



Re: [PROPOSAL] Shared Ispell dictionaries

2017-12-27 Thread Arthur Zakirov
On Tue, Dec 26, 2017 at 07:03:48PM +0100, Pavel Stehule wrote:
> 
> Tomas had some workable patches related to this topic
> 

Tomas, have you planned to propose it?

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-27 Thread Andrey Borodin
Hi!

> 21 дек. 2017 г., в 5:51, Michael Paquier  
> написал(а):
> 
> On Thu, Dec 21, 2017 at 7:35 AM, Robert Haas  wrote:
>> On Wed, Dec 20, 2017 at 3:45 PM, Tomas Vondra
>>  wrote:
 Isn't more effective hold this info in Postgres than in backup sw?
 Then any backup sw can use this implementation
>> [Skipped]
>> I agree with all of that.
> 
> +1. This summarizes a bunch of concerns about all kinds of backend
> implementations proposed. Scanning for a list of blocks modified via
> streaming gives more availability, but knowing that you will need to
> switch to a new segment anyway when finishing a backup, does it really
> matter? Doing it once a segment has finished would be cheap enough,
> and you can even do it in parallel with a range of segments.
> 
> Also, since 9.4 and the introduction of the new WAL API to track
> modified blocks, you don't need to know about the record types to know
> which blocks are being changed. Here is an example of tool I hacked up
> in a couple of hours that does actually what you are looking for, aka
> a scanner of the blocks modified per record for a given WAL segment
> using xlogreader.c:
> https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks
> 
> You could just use that and shape the data in the way you want and you
> would be good to go.

Michael, that's almost what I want. I've even filed GSoC proposal for this [0].
But can we have something like this in Postgres?
The tool I'm hacking is in Go, I cannot just embed bunch of Postgres C into it. 
That is why API, like PTRACK, suits my needs better. Not because it uses any 
superior mechanics, but because it is API, ready for external 3rd party backup 
software (having backup software in Pg would be even better).


Anastasia, I've implemented PTRACK support in WAL-G.
First, there are few minor issues with patch:
1. There is malformed comment
2. Function pg_ptrack_version() is absent

Then, I think that API is far from perfect: pg_ptrack_get_and_clear() changes 
global ptrack_clear_lsn, which introduces some weakness (for paranoids). May be 
use something like "pg_ptrack_get_and_clear(oid,oid,previous_lsn)" which will 
fail if previous_lsn do not match? Also, function pg_ptrack_get_and_clear() do 
not return errors when there is no table with this oid. Finally, I had to 
interpret any empty map as absence of map. From my POV, function must fail on 
errors like: invaid oid passed, no table found, no PTRACK map exists, et c.
I use external file-tracking mechanics, so function 
pg_ptrack_init_get_and_clear() was of no use for me.

Last, but most important for me: my tests showed lost page updates. Probably, 
it is bug or paranoia in my test software. But may I ask you to check this [1] 
code, which converts PTRACK map to number of block numbers. Do I get meaning of 
PTRACK map right? Thank you very much.

[0] 
https://wiki.postgresql.org/index.php?title=GSoC_2018#WAL-G_delta_backups_with_WAL_scanning_.282018.29
[1] https://github.com/wal-g/wal-g/blob/ptrack/pagefile.go#L167-L173




Re: Postgres with pthread

2017-12-27 Thread james

> All threads are blocked in semaphores.
That they are blocked is inevitable - I guess the issue is that they are 
thrashing.
I guess it would be necessary to separate the internals to have some 
internal queueing and effectively reduce the number of actively 
executing threads.

In effect make the connection pooling work internally.

Would it be possible to make the caches have persistent (functional) 
data structures - effectively CoW?


And how easy would it be to abort if the master view had subsequently 
changed when it comes to execution?





Re: Postgres with pthread

2017-12-27 Thread Andres Freund


On December 27, 2017 11:05:52 AM GMT+01:00, james 
 wrote:
> > All threads are blocked in semaphores.
>That they are blocked is inevitable - I guess the issue is that they
>are 
>thrashing.
>I guess it would be necessary to separate the internals to have some 
>internal queueing and effectively reduce the number of actively 
>executing threads.
>In effect make the connection pooling work internally.
>
>Would it be possible to make the caches have persistent (functional) 
>data structures - effectively CoW?
>
>And how easy would it be to abort if the master view had subsequently 
>changed when it comes to execution?

Optimizing for this seems like a pointless exercise. If the goal is efficient 
processing of 100k connections the solution is a session / connection 
abstraction and a scheduler.   Optimizing for this amount of concurrency just 
will add complexity and slowdowns for a workload that nobody will run.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Using ProcSignal to get memory context stats from a running backend

2017-12-27 Thread Maksim Milyutin

On 27.12.2017 10:44, Craig Ringer wrote:

On 22 December 2017 at 23:19, Maksim Milyutin > wrote:


Noticing the interest in the calling some routines on the remote
backend through signals, in parallel thread[1] I have proposed the
possibility to define user defined signal handlers in extensions.
There is a patch taken from pg_query_state module.

1.

https://www.postgresql.org/message-id/3f905f10-cf7a-d4e0-64a1-7fd9b8351592%40gmail.com




Haven't read the patch, but the idea seems useful if a bit hairy in 
practice. It'd be done on a "use at your own risk, and if it breaks 
you get to keep the pieces" basis, where no guarantees are made by Pg 
about how and when the function is called so it must be utterly 
defensive. The challenge there being that you can't always learn 
enough about the state of the system without doing things that might 
break it, so lots of things you could do would be fragile.


Yes, I agree with your thesis that the state of the system have to be 
checked/rechecked before starting its manipulation in signal handler. 
And the responsibility is down to developer of extension to provide the 
necessary level of defensive. Perhaps, it makes sense to add try-catch 
block around signal handler to interrupt potential errors therefrom.


--
Regards,
Maksim Milyutindefe



Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-27 Thread Magnus Hagander
On Thu, Dec 21, 2017 at 1:51 AM, Michael Paquier 
wrote:

> On Thu, Dec 21, 2017 at 7:35 AM, Robert Haas 
> wrote:
> > On Wed, Dec 20, 2017 at 3:45 PM, Tomas Vondra
> >  wrote:
> >>> Isn't more effective hold this info in Postgres than in backup sw?
> >>> Then any backup sw can use this implementation.
> >>
> >> I don't think it means it can't be implemented in Postgres, but does it
> >> need to be done in backend?
> >>
> >> For example, it might be a command-line tool similar to pg_waldump,
> >> which processes WAL segments and outputs list of modified blocks,
> >> possibly with the matching LSN. Or perhaps something like pg_receivewal,
> >> doing that in streaming mode.
> >>
> >> This part of the solution can still be part of PostgreSQL codebase, and
> >> the rest has to be part of backup solution anyway.
> >
> > I agree with all of that.
>
> +1. This summarizes a bunch of concerns about all kinds of backend
> implementations proposed. Scanning for a list of blocks modified via
> streaming gives more availability, but knowing that you will need to
> switch to a new segment anyway when finishing a backup, does it really
> matter? Doing it once a segment has finished would be cheap enough,
> and you can even do it in parallel with a range of segments.
>

There's definitely a lot of value to that, in particular the being able to
do out entirely outside the backend making it possible to extract the data
from an existing log archive.

Just to throw another option out  there, it could also be implemented at
least partially as a walsender command. That way you can get it out through
a replication connection and piggyback on things like replication slots to
make sure you have the data you need, without having to send the full
volume of data. And it would make it possible to do
incremental/differential *without* having a WAL archive in the first place.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Postgres with pthread

2017-12-27 Thread james

On 27/12/2017 10:08, Andres Freund wrote:

Optimizing for this seems like a pointless exercise. If the goal is efficient 
processing of 100k connections the solution is a session / connection 
abstraction and a scheduler.   Optimizing for this amount of concurrency just 
will add complexity and slowdowns for a workload that nobody will run.


Isn't that what's suggested?
The difference is that the session scheduler is inside the server.
Accepting 100k connections is no problem these days - giving each of 
them an active thread seems to be the issue.





Re: Postgres with pthread

2017-12-27 Thread Konstantin Knizhnik



On 27.12.2017 13:08, Andres Freund wrote:


On December 27, 2017 11:05:52 AM GMT+01:00, james 
 wrote:

All threads are blocked in semaphores.

That they are blocked is inevitable - I guess the issue is that they
are
thrashing.
I guess it would be necessary to separate the internals to have some
internal queueing and effectively reduce the number of actively
executing threads.
In effect make the connection pooling work internally.

Would it be possible to make the caches have persistent (functional)
data structures - effectively CoW?

And how easy would it be to abort if the master view had subsequently
changed when it comes to execution?

Optimizing for this seems like a pointless exercise. If the goal is efficient 
processing of 100k connections the solution is a session / connection 
abstraction and a scheduler.   Optimizing for this amount of concurrency just 
will add complexity and slowdowns for a workload that nobody will run.
I agree with you that supporting 100k active connections has not so much 
practical sense now.
But there are many systems with hundreds of cores and to utilize them we 
still need spawn thousands of backends.

In this case Postgres snaphots and local caches becomes inefficient.
Switching to CSN allows to somehow solve the problem with snapshots.
But the problems with private caches should also be addressed: it seems 
to be very stupid to perform the same work 1000x times and maintain 
1000x copies.
Also, in case of global prepared statements, presence of global cache 
allows to spend more time in plan optimization use manual tuning.


Switching to pthreads model significantly simplify  development of 
shared caches: there are no problems with statically allocated shared 
address space or dynamic segments mapped on different address, not 
allowing to use normal pointer. Also invalidation of shared cache is 
easier: on need to send invalidation notifications to all backends.
But still it requires a lot of work. For example catalog cache is 
tightly integrated with resource owner's information.
Also shared cache requires synchronization and this synchronization 
itself can become a bottleneck.



Andres


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Ethiopian calendar year(DATE TYPE) are different from the Gregorian calendar year

2017-12-27 Thread Hannu Krosing
On 26.12.2017 09:23, Lelisa Diriba wrote:
> In Ethiopia the year have 13 months and but in Gregorian calendar the
> year have 12 months,
> The Ethiopian society's are want to use his Ethiopian calendar year,i
> have the algorithm,
> but the way i add to the postgresql source code as EXTENSION? or to
> the backend(to kernel)? 
> 13th month have 5 days and in fourth year 13th month have 6 days.that
> means 1 year have 365.25 days,my algorithm converts  Gregorian
> calendar year to Ethiopian calendar year.

You might want to check first, if the system locale support already
solves this for you.
If so, you may want to introduce this not as an extension, but as a locale

At least a cursory search shows that on some systems the locales do

There was some discussion on this

https://stackoverflow.com/questions/20001463/postgresql-ethiopian-date-format

and also at least Apple seems to have an Ethiopian calendar locale

https://developer.apple.com/documentation/foundation/nscalendar

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
https://2ndquadrant.com/




Re: AS OF queries

2017-12-27 Thread Hannu Krosing
On 20.12.2017 14:45, Konstantin Knizhnik wrote:
> I wonder if Postgres community is interested in supporting time travel
> queries in PostgreSQL (something like AS OF queries in Oracle:
> https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm).
> As far as I know something similar is now developed for MariaDB.
>
> It seems to me that it will be not so difficult to implement them in
> Postgres - we already have versions of tuples.
> Looks like we only need to do three things:
> 1. Disable autovacuum (autovacuum = off)
In the design for original University Postgres ( which was a full
history database geared towards WORM drives )
it was the task of vacuum to move old tuples to "an archive" from where
the AS OF queries would then fetch
them as needed.

This might also be a good place to do Commit LSN to Commit Timestamp
translation

Hannu

> 2. Enable commit timestamp (track_commit_timestamp = on)
> 3. Add asofTimestamp to snapshot and patch XidInMVCCSnapshot to
> compare commit timestamps when it is specified in snapshot.
>
>
> Attached please find my prototype implementation of it.
> Most of the efforts are needed to support asof timestamp in grammar
> and add it to query plan.
> I failed to support AS OF clause (as in Oracle) because of
> shift-reduce conflicts with aliases,
> so I have to introduce new ASOF keyword. May be yacc experts can
> propose how to solve this conflict without introducing new keyword...
>
> Please notice that now ASOF timestamp is used only for data snapshot,
> not for catalog snapshot.
> I am not sure that it is possible (and useful) to travel through
> database schema history...
>
> Below is an example of how it works:
>
> postgres=# create table foo(pk serial primary key, ts timestamp
> default now(), val text);
> CREATE TABLE
> postgres=# insert into foo (val) values ('insert');
> INSERT 0 1
> postgres=# insert into foo (val) values ('insert');
> INSERT 0 1
> postgres=# insert into foo (val) values ('insert');
> INSERT 0 1
> postgres=# select * from foo;
>  pk | ts |  val
> ++
>   1 | 2017-12-20 14:59:17.715453 | insert
>   2 | 2017-12-20 14:59:22.933753 | insert
>   3 | 2017-12-20 14:59:27.87712  | insert
> (3 rows)
>
> postgres=# select * from foo asof timestamp '2017-12-20 14:59:25';
>  pk | ts |  val
> ++
>   1 | 2017-12-20 14:59:17.715453 | insert
>   2 | 2017-12-20 14:59:22.933753 | insert
> (2 rows)
>
> postgres=# select * from foo asof timestamp '2017-12-20 14:59:20';
>  pk | ts |  val
> ++
>   1 | 2017-12-20 14:59:17.715453 | insert
> (1 row)
>
> postgres=# update foo set val='upd',ts=now() where pk=1;
> UPDATE 1
> postgres=# select * from foo asof timestamp '2017-12-20 14:59:20';
>  pk | ts |  val
> ++
>   1 | 2017-12-20 14:59:17.715453 | insert
> (1 row)
>
> postgres=# select * from foo;
>  pk | ts |  val
> ++
>   2 | 2017-12-20 14:59:22.933753 | insert
>   3 | 2017-12-20 14:59:27.87712  | insert
>   1 | 2017-12-20 15:09:17.046047 | upd
> (3 rows)
>
> postgres=# update foo set val='upd2',ts=now() where pk=1;
> UPDATE 1
> postgres=# select * from foo asof timestamp '2017-12-20 15:10';
>  pk | ts |  val
> ++
>   2 | 2017-12-20 14:59:22.933753 | insert
>   3 | 2017-12-20 14:59:27.87712  | insert
>   1 | 2017-12-20 15:09:17.046047 | upd
> (3 rows)
>
>
> Comments and feedback are welcome:)
>

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
https://2ndquadrant.com/




Re: Add hint about replication slots when nearing wraparound

2017-12-27 Thread Michael Paquier
On Wed, Dec 27, 2017 at 08:47:20AM +0100, Feike Steenbergen wrote:
> Changed the block from a note to a caution,

Thanks for the new version.

- "You might also need to commit or roll back old prepared transactions.")));
+ "You might also need to commit or roll back old prepared transactions, or 
drop stale replication slots.")));
Would "or TO drop stale replication slots" be more correct English?

ereport(WARNING,
(errmsg("oldest xmin is far in the past"),
-errhint("Close open transactions soon to avoid wraparound 
problems.")));
+errhint("Close open transactions soon to avoid wraparound 
problems. You might also need to commit or roll back old prepared transactions, 
or drop stale replication slots.")));

I am not convinced that you need this bit. autovacuum_freeze_max_age can
be set to lower to even lower values than the default.

Still, those are minor comments, so I am marking this patch as ready for
committer to get more input from higher-ups.
--
Michael


signature.asc
Description: PGP signature


postgres_fdw: perform UPDATE/DELETE .. RETURNING on a join directly

2017-12-27 Thread Etsuro Fujita
(2017/04/08 10:28), Robert Haas wrote:
> On Wed, Mar 22, 2017 at 6:20 AM, Etsuro Fujita
>   wrote:
>> On 2017/02/22 19:57, Rushabh Lathia wrote:
>>> Marked this as Ready for Committer.
>>
>> I noticed that this item in the CF app was incorrectly marked as
Committed.
>> This patch isn't committed, so I returned it to the previous status.
 I also
>> rebased the patch.  Attached is a new version of the patch.
>
> Sorry, I marked the wrong patch as committed.  Apologies for that.

No problem.  My apologies for the long long delay.

> This doesn't apply any more because of recent changes.
>
> git diff --check complains:
> contrib/postgres_fdw/postgres_fdw.c:3653: space before tab in indent.

I rebased the patch.

> +/* Shouldn't contain the target relation. */
> +Assert(target_rel == 0);
>
> This comment should give a reason.

Done.

>   void
>   deparseDirectUpdateSql(StringInfo buf, PlannerInfo *root,
>  Index rtindex, Relation rel,
> +   RelOptInfo *foreignrel,
>  List *targetlist,
>  List *targetAttrs,
>  List *remote_conds,
>
> Could you add a comment explaining the meaning of these various
> arguments?  It takes rtindex, rel, and foreignrel, which apparently
> are all different things, but the meaning is not explained.

Done.

>   /*
> + * Add a RETURNING clause, if needed, to an UPDATE/DELETE on a join.
> + */
> +static void
> +deparseExplicitReturningList(List *rlist,
> + List **retrieved_attrs,
> + deparse_expr_cxt *context)
> +{
> +deparseExplicitTargetList(rlist, true, retrieved_attrs, context);
> +}
>
> Do we really want to add a function for one line of code?

I don't have any strong opinion about that, so I removed that function
and modified deparseDirectUpdateSql/deparseDirectDeleteSql so it calls
deparseExplicitTargetList directly.

> +/*
> + * Look for conditions mentioning the target relation in the given
join tree,
> + * which will be pulled up into the WHERE clause.  Note that this is
safe due
> + * to the same reason stated in comments in deparseFromExprForRel.
> + */
>
> The comments for deparseFromExprForRel do not seem to address the
> topic of why this is safe.  Also, the answer to the question "safe
> from what?" is not clear.

Maybe my explanation would not be enough, but I think the reason why
this is safe would be derived from the comments for
deparseFromExprForRel.  BUT: on reflection, I think I made the deparsing
logic complex beyond necessity.  So I simplified the logic, which
doesn't pull_up_target_conditions any more, and added comments about that.

> -deparseReturningList(buf, root, rtindex, rel, false,
> - returningList, retrieved_attrs);
> +if (foreignrel->reloptkind == RELOPT_JOINREL)
> +deparseExplicitReturningList(returningList,
retrieved_attrs,&context);
> +else
> +deparseReturningList(buf, root, rtindex, rel, false,
> + returningList, retrieved_attrs);
>
> Why do these cases need to be handled differently?  Maybe add a brief
comment?

The reason for that is 1)

+   /*
+* When performing an UPDATE/DELETE .. RETURNING on a join directly,
+* we fetch from the foreign server any Vars specified in RETURNING
+* that refer not only to the target relation but to non-target
+* relations.  So we'll deparse them into the RETURNING clause
of the
+* remote query;

and 2) deparseReturningList can retrieve Vars of the target relation,
but can't retrieve Vars of non-target relations.

> +if ((outerrel->reloptkind == RELOPT_BASEREL&&
> + outerrel->relid == target_rel) ||
> +(innerrel->reloptkind == RELOPT_BASEREL&&
> + innerrel->relid == target_rel))
>
> 1. Surely it's redundant to check the RelOptKind if the RTI matches?

Good catch!  Revised.

> 2. Generally, the tests in this patch against various RelOptKind
> values should be adapted to use the new macros introduced in
> 7a39b5e4d11229ece930a51fd7cb29e535db4494.

I left some of the tests alone because I think that's more strict.

> The regression tests remove every remaining case where an update or
> delete gets fails to get pushed to the remote side.  I think we should
> still test that path, because we've still got that code.  Maybe use a
> non-pushable function in the join clause, or something.

Done.

> The new test cases could use some brief comments explaining their purpose.

Done.  Also, I think some of the tests in the previous version are
redundant, so I simplified the tests a bit.

>   if (plan->returningLists)
> +{
>   returningList = (List *) list_nth(plan->returningLists,
subplan_index);
>
> +/*
> + * If UPDATE/DELETE on a join, create a RETURNING list used
in the
> + * remote query.
> + */
> +if (fscan->scan.sc

Re: [HACKERS] Pluggable storage

2017-12-27 Thread Alexander Korotkov
Hi!

On Wed, Dec 27, 2017 at 6:54 AM, Haribabu Kommi 
wrote:

>
> On Tue, Dec 12, 2017 at 3:06 PM, Haribabu Kommi 
> wrote:
>
>>
>> I restructured that patch files to avoid showing unnecessary
>> modifications,
>> and also it will be easy for adding of new API's based on the all the
>> functions
>> that are exposed by heapam module easily compared earlier.
>>
>> Attached are the latest set of patches. I will work on the remaining
>> pending
>> items.
>>
>
> Apart from rebase to the latest master code, following are the additional
> changes,
>
> 1. Added API for bulk insert and rewrite functionality(Logical rewrite is
> not touched yet)
> 2. Tuple lock API interface redesign to remove the traversal logic from
> executor module.
>

Great, thank you!


> The tuple lock API interface changes are from "Alexander Korotkov" from
> "PostgresPro".
> Thanks Alexander. Currently we both are doing joint development for faster
> closure of
> open items that are pending to bring the "pluggable storage API" into a
> good shape.
>

Thank you for announcing this.  Yes, pluggable storage API requires a lot
of work to get into committable shape.  This is why I've decided to join
the development.

Let me explain the idea behind new tuple lock API and further patches I
plan to send.  As I noted upthread, I consider possibility of alternative
MVCC implementations as vital property of pluggable storage API.  These
include undo log option when tuple is updated in-place while old version of
tuple is displaced to some special area.  In this case, new version of
tuple would reside on same TID as old version of tuple.  This is an
important point, because TID is not really tuple identifier anymore.
Naturally, TID becomes a row identifier while tuple may be identified by
pair (tid, snapshot).  For current heap, snapshot is redundant and can be
used just for assert checking (tuple on given tid is really visible using
given snapshot).  For heap with undo log, appropriate tuple could be found
by snapshot in the undo chain associated with given tid.

One of consequences of above is that we cannot use fact that tid isn't
changed after update as sign that tuple was deleted.  This is why I've
introduced HTSU_Result  HeapTupleDeleted.  Another consequence is that our
tid traverse logic in the executor layer is not valid anymore.  For
instance, this traversal from older tuple to latter tuple doesn't make any
sense for heap with undo log where latter tuple is more easily accessible
than older tuple.  This is why I decided to hide this logic in storage
layer and provide TUPLE_LOCK_FLAG_FIND_LAST_VERSION flag which indicates
that lock_tuple() have to find latest updated version and lock it.  I've
also changed follow_updates bool to more explicit
TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS flag in order to not mess it with
previous flag which also kind of follow updates.  Third consequence is that
we have to pass snapshot to tuple_update() and tuple_delete() methods to
let them check if row was concurrently updated while residing on the same
TID.  I'm going to provide this change as separate patch.

Also, I appreciate that now tuple_insert() and tuple_update() methods are
responsible for inserting index tuples.  This unleash pluggable storages to
implement another way of interaction with indexes.  However, I didn't get
the point of passing InsertIndexTuples IndexFunc to them.  Now, we're
always passing ExecInsertIndexTuples() to this argument.  As I understood
storage is free to either call ExecInsertIndexTuples() or implement its own
logic of interaction with indexes.  But, I don't understand why do we need
a callback when tuple_insert() and tuple_update() can call
ExecInsertIndexTuples() directly if needed.  Another thing is that
tuple_delete() could also interact with indexes (especially when we will
enhance index access method API), and we need to pass meta-information
about indexes to tuple_delete() too.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Comment typo in postgres_fdw.c

2017-12-27 Thread Etsuro Fujita
Here is a small patch for fixing $Subject: s/it's/its/.

Best regards,
Etsuro Fujita
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fb65e2e..44db449 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -5115,7 +5115,7 @@ conversion_error_callback(void *arg)
 
 		/*
 		 * Target list can have Vars and expressions.  For Vars, we can get
-		 * it's relation, however for expressions we can't.  Thus for
+		 * its relation, however for expressions we can't.  Thus for
 		 * expressions, just show generic context message.
 		 */
 		if (IsA(tle->expr, Var))


Re: [HACKERS] [PATCH] Lockable views

2017-12-27 Thread Yugo Nagata
On Tue, 26 Dec 2017 22:22:33 +0900
Michael Paquier  wrote:

> On Tue, Dec 26, 2017 at 06:37:06PM +0900, Yugo Nagata wrote:
> > I have created a new entry in CF-2017-1 and registered this thread again.
> 
> Fine for me. Thanks for the update. And I guess that you are planning to
> send a new version before the beginning of the next commit fest using
> the feedback provided?

Yes. I'll update the patch according to Ishii-san's comments.

> --
> Michael


-- 
Yugo Nagata 



Re: AS OF queries

2017-12-27 Thread PostgreSQL - Hans-Jürgen Schönig


On 12/20/2017 01:45 PM, Konstantin Knizhnik wrote:
> I wonder if Postgres community is interested in supporting time travel
> queries in PostgreSQL (something like AS OF queries in Oracle:
> https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm).
> As far as I know something similar is now developed for MariaDB.
>
> It seems to me that it will be not so difficult to implement them in
> Postgres - we already have versions of tuples.
> Looks like we only need to do three things:
> 1. Disable autovacuum (autovacuum = off)
> 2. Enable commit timestamp (track_commit_timestamp = on)
> 3. Add asofTimestamp to snapshot and patch XidInMVCCSnapshot to
> compare commit timestamps when it is specified in snapshot.
>

that sounds really awesome ... i would love to see that.
my question is: while MVCC is fine when a tuple is still there ...
what are you going to do with TRUNCATE and so on?
it is not uncommon that a table is truncated frequently. in this case
MVCC won't help.
what are your thoughts on this ?

    many thanks,

        hans

-- 
Hans-Jürgen Schönig
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com




Re: [HACKERS] [PATCH] Lockable views

2017-12-27 Thread Yugo Nagata
Hi,

Attached is the updated patch.

On Mon, 16 Oct 2017 10:07:48 +0900 (JST)
Tatsuo Ishii  wrote:
> >> > It would be nice if the message would be something like:
> >> > 
> >> > DETAIL:  Views that return aggregate functions are not lockable

> You could add a flag to view_query_is_auto_updatable() to switch the
> message between
> 
>  DETAIL:  Views that return aggregate functions are not automatically 
> updatable.
> 
> and
> 
>  DETAIL:  Views that return aggregate functions are not lockable

I didn't want to change the interface of view_query_is_auto_updatable()
because this might be called from other third-patry software, so I renamed
this function to view_query_is_auto_updatable_or_lockable() and added the flag
to this. I created view_query_is_auto_updatable() as a wrapper of this function.
I also made view_query_is_lockable() that returns a other message than 
view_query_is_auto_updatable().

> On Tue, 17 Oct 2017 11:59:05 +0900 (JST)
> Tatsuo Ishii  wrote:
> > 1) Leave as it is (ignore tables appearing in a subquery)
> > 
> > 2) Lock all tables including in a subquery
> > 
> > 3) Check subquery in the view 

> > So it seem #1 is the most reasonable way to deal with the problem
> > assuming that it's user's responsibility to take appropriate locks on
> > the tables in the subquery.

I adopted #1 and I didn't change anything about this.

Regards,


-- 
Yugo Nagata 
diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index 9fe9e02..80e00f7 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -19,15 +19,20 @@
 #include "commands/lockcmds.h"
 #include "miscadmin.h"
 #include "parser/parse_clause.h"
+#include "parser/parsetree.h"
 #include "storage/lmgr.h"
 #include "utils/acl.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
+#include "rewrite/rewriteHandler.h"
+#include "access/heapam.h"
 
-static void LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait);
-static AclResult LockTableAclCheck(Oid relid, LOCKMODE lockmode);
+static void LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait, Oid userid);
+static AclResult LockTableAclCheck(Oid relid, LOCKMODE lockmode, Oid userid);
 static void RangeVarCallbackForLockTable(const RangeVar *rv, Oid relid,
 			 Oid oldrelid, void *arg);
+static void LockViewRecurse(Oid reloid, Oid root_reloid, LOCKMODE lockmode, bool nowait);
+static void LockViewCheck(Relation view, Query *viewquery);
 
 /*
  * LOCK TABLE
@@ -62,8 +67,10 @@ LockTableCommand(LockStmt *lockstmt)
 		  RangeVarCallbackForLockTable,
 		  (void *) &lockstmt->mode);
 
-		if (recurse)
-			LockTableRecurse(reloid, lockstmt->mode, lockstmt->nowait);
+		if (get_rel_relkind(reloid) == RELKIND_VIEW)
+			LockViewRecurse(reloid, reloid, lockstmt->mode, lockstmt->nowait);
+		else if (recurse)
+			LockTableRecurse(reloid, lockstmt->mode, lockstmt->nowait, GetUserId());
 	}
 }
 
@@ -86,15 +93,17 @@ RangeVarCallbackForLockTable(const RangeVar *rv, Oid relid, Oid oldrelid,
 		return;	/* woops, concurrently dropped; no permissions
  * check */
 
-	/* Currently, we only allow plain tables to be locked */
-	if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE)
+
+	/* Currently, we only allow plain tables or auto-updatable views to be locked */
+	if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE &&
+		relkind != RELKIND_VIEW)
 		ereport(ERROR,
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("\"%s\" is not a table",
+ errmsg("\"%s\" is not a table or view",
 		rv->relname)));
 
 	/* Check permissions. */
-	aclresult = LockTableAclCheck(relid, lockmode);
+	aclresult = LockTableAclCheck(relid, lockmode, GetUserId());
 	if (aclresult != ACLCHECK_OK)
 		aclcheck_error(aclresult, ACL_KIND_CLASS, rv->relname);
 }
@@ -107,7 +116,7 @@ RangeVarCallbackForLockTable(const RangeVar *rv, Oid relid, Oid oldrelid,
  * multiply-inheriting children more than once, but that's no problem.
  */
 static void
-LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait)
+LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait, Oid userid)
 {
 	List	   *children;
 	ListCell   *lc;
@@ -120,7 +129,7 @@ LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait)
 		AclResult	aclresult;
 
 		/* Check permissions before acquiring the lock. */
-		aclresult = LockTableAclCheck(childreloid, lockmode);
+		aclresult = LockTableAclCheck(childreloid, lockmode, userid);
 		if (aclresult != ACLCHECK_OK)
 		{
 			char	   *relname = get_rel_name(childreloid);
@@ -157,15 +166,82 @@ LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait)
 			continue;
 		}
 
-		LockTableRecurse(childreloid, lockmode, nowait);
+		LockTableRecurse(childreloid, lockmode, nowait, userid);
 	}
 }
 
 /*
+ * Apply LOCK TABLE recursively over a view
+ */
+static void
+LockViewRecurse(Oid reloid, Oid root_reloid, LOCKMODE lockmode, bool nowait)
+{
+	Relation		 view;
+	Query			*viewquery;
+	RangeTblRef

Re: TupleDescCopy doesn't clear atthasdef, attnotnull, attidentity

2017-12-27 Thread Vik Fearing
On 11/29/2017 08:58 AM, Thomas Munro wrote:
> Hi hackers,
> 
> Andrew Gierth complained off-list that TupleDescCopy() doesn't clear
> atthasdef.  Yeah, that's an oversight.  The function is new in commit
> cc5f81366c36 and was written by me to support "flat" (pointer-free)
> tuple descriptors for use in DSM.  Following the example of
> CreateTupleDescCopy() I think it should also clear attnotnull and
> attidentity.  Please see attached.

This trivial patch is a clear oversight in the original patch.  Thank
you to Andrew for some explanation off-list to help me understand tuple
descriptors better.

Ready for committer.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support



Re: AS OF queries

2017-12-27 Thread Konstantin Knizhnik



On 27.12.2017 17:14, PostgreSQL - Hans-Jürgen Schönig wrote:


On 12/20/2017 01:45 PM, Konstantin Knizhnik wrote:

I wonder if Postgres community is interested in supporting time travel
queries in PostgreSQL (something like AS OF queries in Oracle:
https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm).
As far as I know something similar is now developed for MariaDB.

It seems to me that it will be not so difficult to implement them in
Postgres - we already have versions of tuples.
Looks like we only need to do three things:
1. Disable autovacuum (autovacuum = off)
2. Enable commit timestamp (track_commit_timestamp = on)
3. Add asofTimestamp to snapshot and patch XidInMVCCSnapshot to
compare commit timestamps when it is specified in snapshot.


that sounds really awesome ... i would love to see that.
my question is: while MVCC is fine when a tuple is still there ...
what are you going to do with TRUNCATE and so on?
it is not uncommon that a table is truncated frequently. in this case
MVCC won't help.
what are your thoughts on this ?


You should not use drop/truncate if you want to access old versions:)
Yes, truncate is much more faster than delete but it is because it 
operates on file level.

I think that it is quite natural limitation.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: AS OF queries

2017-12-27 Thread Konstantin Knizhnik



On 27.12.2017 00:52, Jeff Janes wrote:
On Thu, Dec 21, 2017 at 6:00 AM, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:


There is still one significant difference of my prototype
implementation with SQL standard: it associates timestamp with
select statement, not with particular table.
It seems to be more difficult to support and I am not sure that
joining tables from different timelines has much sense.
But certainly it also can be fixed.


I think the main use I would find for this feature is something like:

select * from foo except select * from foo as old_foo as of '';

So I would be grateful if you can make that work. Also, I think 
conforming to the standards is pretty important where it is feasible 
to do that.


Cheers,

Jeff


I attach ne version of the patch which supports "standard" syntax, where 
AS OF clause is associated with table reference.

So it is possible to write query like:

    select * from SomeTable as t as of timestamp '2017-12-27 14:54:40' 
where id=100;


Also I introduced "time_travel" GUC which implicitly assigns some others 
GUCs:


        track_commit_timestamp = true;
        vacuum_defer_cleanup_age = 10;
        vacuum_freeze_min_age = 10;
        autovacuum_freeze_max_age = 20;
        autovacuum_multixact_freeze_max_age = 20;
        autovacuum_start_daemon = false;

So it disables autovacuum and microvacuum and enable commit timestamps 
tracking.

It provides access in the past up to milliard of transactions.

There is still no way to keep all versions only for particular tables or 
truncate too old versions.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index eb5bbb5..7acaf30 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -78,6 +78,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	ExprContext *econtext;
 	HeapScanDesc scan;
 	TIDBitmap  *tbm;
+	EState	   *estate;
 	TBMIterator *tbmiterator = NULL;
 	TBMSharedIterator *shared_tbmiterator = NULL;
 	TBMIterateResult *tbmres;
@@ -85,11 +86,13 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	TupleTableSlot *slot;
 	ParallelBitmapHeapState *pstate = node->pstate;
 	dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
+	TimestampTz outerAsofTimestamp = 0;
 
 	/*
 	 * extract necessary information from index scan node
 	 */
 	econtext = node->ss.ps.ps_ExprContext;
+	estate = node->ss.ps.state;
 	slot = node->ss.ss_ScanTupleSlot;
 	scan = node->ss.ss_currentScanDesc;
 	tbm = node->tbm;
@@ -99,6 +102,25 @@ BitmapHeapNext(BitmapHeapScanState *node)
 		shared_tbmiterator = node->shared_tbmiterator;
 	tbmres = node->tbmres;
 
+	outerAsofTimestamp = estate->es_snapshot->asofTimestamp;
+
+	if (node->ss.asofExpr)
+	{
+		if (!node->ss.asofTimestampSet)
+		{
+			Datum		val;
+			bool		isNull;
+
+			val = ExecEvalExprSwitchContext(node->ss.asofExpr,
+			node->ss.ps.ps_ExprContext,
+			&isNull);
+			/* Interpret NULL timestamp as no timestamp */
+			node->ss.asofTimestamp = isNull ? 0 : DatumGetInt64(val);
+			node->ss.asofTimestampSet = true;
+		}
+		estate->es_snapshot->asofTimestamp = node->ss.asofTimestamp;
+	}
+
 	/*
 	 * If we haven't yet performed the underlying index scan, do it, and begin
 	 * the iteration over the bitmap.
@@ -364,11 +386,21 @@ BitmapHeapNext(BitmapHeapScanState *node)
 			}
 		}
 
-		/* OK to return this tuple */
+		/*
+		 * Restore ASOF timestamp for the current snapshot
+		 */
+		estate->es_snapshot->asofTimestamp = outerAsofTimestamp;
+
+	/* OK to return this tuple */
 		return slot;
 	}
 
 	/*
+	 * Restore ASOF timestamp for the current snapshot
+	 */
+	estate->es_snapshot->asofTimestamp = outerAsofTimestamp;
+
+	/*
 	 * if we get here it means we are at the end of the scan..
 	 */
 	return ExecClearTuple(slot);
@@ -746,6 +778,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
 
+	node->ss.asofTimestampSet = false;
+
 	/* rescan to release any page pin */
 	heap_rescan(node->ss.ss_currentScanDesc, NULL);
 
@@ -902,7 +936,8 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	 * most cases it's probably not worth working harder than that.
 	 */
 	scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
- node->scan.plan.targetlist == NIL);
+ node->scan.plan.targetlist == NIL &&
+ node->scan.asofTimestamp == NULL);
 
 	/*
 	 * Miscellaneous initialization
@@ -920,6 +955,18 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 		ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
 
 	/*
+	 * Initlialize AS OF expression of any
+	 */
+	if (node->scan.asofTimestamp)
+	{
+		scanstate->ss.asofExpr = ExecInitExpr((Expr *) node->scan.asofTimestamp,
+		   &scanstat

Re: Deadlock in multiple CIC.

2017-12-27 Thread Jeremy Finzel
On Tue, Dec 26, 2017 at 11:23 AM, Jeff Janes  wrote:

> On Tue, Dec 26, 2017 at 8:31 AM, Alvaro Herrera 
> wrote:
>
>> Jeff Janes wrote:
>> > c3d09b3bd23f5f6 fixed it so concurrent CIC would not deadlock (or at
>> least
>> > not as reliably as before) by dropping its own snapshot before waiting
>> for
>> > all the other ones to go away.
>> >
>> > With commit 8aa3e47510b969354ea02a, concurrent CREATE INDEX
>> CONCURRENTLY on
>> > different tables in the same database started deadlocking against each
>> > other again quite reliably.
>> >
>> > I think the solution is simply to drop the catalog snapshot as well, as
>> in
>> > the attached.
>>
>> Thanks for the analysis -- it sounds reasonable to me.  However, I'm
>> wondering why you used the *Conditionally() version instead of plain
>> InvalidateCatalogSnapshot().
>
>
> My thinking was that if there was for some reason another snapshot hanging
> around, that dropping the catalog snapshot unconditionally would be a
> correctness bug, while doing it conditionally would just fail to avoid a
> theoretically avoidable deadlock.  So it seemed safer.
>
>
>>   I think they must have the same effect in
>> practice (the assumption being that you can't run CIC in a transaction
>> that may have other snapshots) but the theory seems simpler when calling
>> the other routine: just do away with the snapshot always, period.
>>
>
> That is probably true.  But I never even knew that catalog snapshots
> existed until yesterday, so didn't want to make make assumptions about what
> else might exist, to avoid introducing new bugs similar to the one that
> 8aa3e47510b969354ea02a fixed.
>
>
>>
>> This is back-patchable to 9.4, first branch which has MVCC catalog
>> scans.  It's strange that this has gone undetected for so long.
>
>
> Since the purpose of CIC is to build an index with minimal impact on other
> users, I think wanting to use it in concurrent cases might be rather rare.
> In a maintenance window, I wouldn't want to use CIC because it is slower
> and I'd rather just hold the stronger lock and do it fast, and as a hot-fix
> outside a maintenance window I usually wouldn't want to hog the CPU with
> concurrent builds when I could do them sequentially instead.  Also, since
> deadlocks are "expected" errors rather than "should never happen" errors,
> and since the docs don't promise that you can do parallel CIC without
> deadlocks, many people would probably shrug it off (which I initially did)
> rather than report it as a bug.  I was looking into it as an enhancement
> rather than a bug until I discovered that it was already enhanced and then
> undone.
>
> Cheers,
>
> Jeff
>

I was able to get this compiled, and ran the test before on stock 9.6.6,
then on this patched version.  I indeed reproduced it on 9.6.6, but on the
patched version, it indeed fixes my issue.

Let me know if I can be of further help.

Thanks,
Jeremy


Re: [HACKERS] generated columns

2017-12-27 Thread Peter Eisentraut
On 9/12/17 15:35, Jaime Casanova wrote:
> On 10 September 2017 at 00:08, Jaime Casanova
>  wrote:
>>
>> During my own tests, though, i found some problems:

Here is an updated patch that should address the problems you have found.

> also is interesting that in triggers, both before and after, the
> column has a null. that seems reasonable in a before trigger but not
> in an after trigger

Logically, you are correct.  But it seems excessive to compute all
virtual columns for every trigger.  I don't know how to consolidate
that, especially with the current trigger API that lets
you look more or less directly into the tuple.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 62f578fcd7f5cbf1dbafd43cbb6de0df2e6c148b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 27 Dec 2017 11:49:35 -0500
Subject: [PATCH v2] Generated columns

This is an SQL-standard feature that allows creating columns that are
computed from expressions rather than assigned, similar to a view or
materialized view but on a column basis.

The plan to is implement two kinds of generated columns:
virtual (computed on read) and stored (computed on write).  This patch
only implements the virtual kind, leaving stubs to implement the stored
kind later.
---
 doc/src/sgml/catalogs.sgml  |  11 +
 doc/src/sgml/information_schema.sgml|  10 +-
 doc/src/sgml/ref/copy.sgml  |   3 +-
 doc/src/sgml/ref/create_table.sgml  |  46 ++-
 src/backend/access/common/tupdesc.c |   5 +
 src/backend/catalog/genbki.pl   |   3 +
 src/backend/catalog/heap.c  |  93 +-
 src/backend/catalog/index.c |   1 +
 src/backend/catalog/information_schema.sql  |   8 +-
 src/backend/commands/copy.c |  10 +-
 src/backend/commands/indexcmds.c|  24 +-
 src/backend/commands/tablecmds.c| 145 -
 src/backend/commands/trigger.c  |  36 +++
 src/backend/commands/typecmds.c |   6 +-
 src/backend/executor/execMain.c |   7 +-
 src/backend/nodes/copyfuncs.c   |   2 +
 src/backend/nodes/equalfuncs.c  |   2 +
 src/backend/nodes/outfuncs.c|   9 +
 src/backend/parser/gram.y   |  26 +-
 src/backend/parser/parse_agg.c  |  11 +
 src/backend/parser/parse_expr.c |   5 +
 src/backend/parser/parse_func.c |  12 +
 src/backend/parser/parse_utilcmd.c  |  87 -
 src/backend/rewrite/rewriteHandler.c| 149 -
 src/backend/utils/cache/lsyscache.c |  32 ++
 src/backend/utils/cache/relcache.c  |   1 +
 src/bin/pg_dump/pg_dump.c   |  39 ++-
 src/bin/pg_dump/pg_dump.h   |   1 +
 src/bin/pg_dump/pg_dump_sort.c  |  10 +
 src/bin/pg_dump/t/002_pg_dump.pl|  39 +++
 src/bin/psql/describe.c |  28 +-
 src/include/catalog/heap.h  |   5 +-
 src/include/catalog/pg_attribute.h  |  23 +-
 src/include/catalog/pg_class.h  |   2 +-
 src/include/nodes/parsenodes.h  |  12 +-
 src/include/parser/kwlist.h |   2 +
 src/include/parser/parse_node.h |   3 +-
 src/include/rewrite/rewriteHandler.h|   1 +
 src/include/utils/lsyscache.h   |   1 +
 src/test/regress/expected/create_table_like.out |  46 +++
 src/test/regress/expected/generated.out | 407 
 src/test/regress/parallel_schedule  |   2 +-
 src/test/regress/serial_schedule|   1 +
 src/test/regress/sql/create_table_like.sql  |  14 +
 src/test/regress/sql/generated.sql  | 233 ++
 45 files changed, 1520 insertions(+), 93 deletions(-)
 create mode 100644 src/test/regress/expected/generated.out
 create mode 100644 src/test/regress/sql/generated.sql

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3f02202caf..76237d9eb8 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1160,6 +1160,17 @@ pg_attribute 
Columns
   
  
 
+ 
+  attgenerated
+  char
+  
+  
+   If a zero byte (''), then not a generated column.
+   Otherwise, s = stored, v =
+   virtual.
+  
+ 
+
  
   attisdropped
   bool
diff --git a/doc/src/sgml/information_schema.sgml 
b/doc/src/sgml/information_schema.sgml
index 0faa72f1d3..6edf04cc69 100644
--- a/doc/src/sgml/information_schema.sgml
+++ b/doc/src/sgml/information_schema.sgml
@@ -1648,13 +1648,19 @@ columns Columns
  
   is_generated
   character_data
-  Applies to a feature not available in 
PostgreSQL
+  
+   If the column i

Re: Deadlock in multiple CIC.

2017-12-27 Thread Jerry Sievers
Jeff Janes  writes:

> On Tue, Dec 26, 2017 at 8:31 AM, Alvaro Herrera <
> alvhe...@alvh.no-ip.org> wrote:
>
> Jeff Janes wrote:
> > c3d09b3bd23f5f6 fixed it so concurrent CIC would not deadlock
> (or at least
> > not as reliably as before) by dropping its own snapshot before
> waiting for
> > all the other ones to go away.
> >
> > With commit 8aa3e47510b969354ea02a, concurrent CREATE INDEX
> CONCURRENTLY on
> > different tables in the same database started deadlocking
> against each
> > other again quite reliably.
> >
> > I think the solution is simply to drop the catalog snapshot as
> well, as in
> > the attached.
>
> Thanks for the analysis -- it sounds reasonable to me.  However,
> I'm
> wondering why you used the *Conditionally() version instead of
> plain
> InvalidateCatalogSnapshot().
>
>
> My thinking was that if there was for some reason another snapshot
> hanging around, that dropping the catalog snapshot unconditionally
> would be a correctness bug, while doing it conditionally would just
> fail to avoid a theoretically avoidable deadlock.  So it seemed
> safer.
>  
>
>   I think they must have the same effect in
> practice (the assumption being that you can't run CIC in a
> transaction
> that may have other snapshots) but the theory seems simpler when
> calling
> the other routine: just do away with the snapshot always, period.
>
>
> That is probably true.  But I never even knew that catalog snapshots
> existed until yesterday, so didn't want to make make assumptions
> about what else might exist, to avoid introducing new bugs similar to
> the one that 8aa3e47510b969354ea02a fixed.
>  
>
>
> This is back-patchable to 9.4, first branch which has MVCC
> catalog
> scans.  It's strange that this has gone undetected for so long.
>
>
> Since the purpose of CIC is to build an index with minimal impact on
> other users, I think wanting to use it in concurrent cases might be
> rather rare.  In a maintenance window, I wouldn't want to use CIC
> because it is slower and I'd rather just hold the stronger lock and
> do it fast, and as a hot-fix outside a maintenance window I usually
> wouldn't want to hog the CPU with concurrent builds when I could do

Hmmm, given that most/all large sites lately are probably running on hw
with dozens or perhaps hundreds of CPUs/threads, I can see DBAs not
being too concerned about "hogging".

> them sequentially instead.  Also, since deadlocks are "expected"
> errors rather than "should never happen" errors, and since the docs
> don't promise that you can do parallel CIC without deadlocks, many
> people would probably shrug it off (which I initially did) rather
> than report it as a bug.  I was looking into it as an enhancement
> rather than a bug until I discovered that it was already enhanced and

Agree such an edge case not a high priority to support for the above
reasons but good to assuming no breakage in some other regard :-)

> then undone.
>
> Cheers,
>
> Jeff
>
>
>
>

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consult...@comcast.net
p: 312.241.7800



Re: [HACKERS] taking stdbool.h into use

2017-12-27 Thread Peter Eisentraut
On 12/26/17 23:10, Michael Paquier wrote:
> It would be nice to do something like that for GinTernaryValue in
> tsginidx.c by mapping directly to GIN_FALSE and GIN_TRUE depending on
> the input coming in gin_tsquery_consistent. The fix is more trivial
> there.

For GinTernaryValue, I think it's easier to just make it the same size
as bool, since it doesn't go onto disk.  My earlier patch did that.  I'm
not sure it's worth adding more code to copy the array around.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Deadlock in multiple CIC.

2017-12-27 Thread Jeff Janes
On Wed, Dec 27, 2017 at 8:50 AM, Jeremy Finzel  wrote:


> I was able to get this compiled, and ran the test before on stock 9.6.6,
> then on this patched version.  I indeed reproduced it on 9.6.6, but on the
> patched version, it indeed fixes my issue.
>
> Let me know if I can be of further help.
>

Thanks for reporting and testing.

Cheers,

Jeff


Re: Deadlock in multiple CIC.

2017-12-27 Thread Jeremy Finzel
>
> Since the purpose of CIC is to build an index with minimal impact on other
> users, I think wanting to use it in concurrent cases might be rather rare.
> In a maintenance window, I wouldn't want to use CIC because it is slower
> and I'd rather just hold the stronger lock and do it fast, and as a hot-fix
> outside a maintenance window I usually wouldn't want to hog the CPU with
> concurrent builds when I could do them sequentially instead.  Also, since
> deadlocks are "expected" errors rather than "should never happen" errors,
> and since the docs don't promise that you can do parallel CIC without
> deadlocks, many people would probably shrug it off (which I initially did)
> rather than report it as a bug.  I was looking into it as an enhancement
> rather than a bug until I discovered that it was already enhanced and then
> undone.
>
>
FWIW, yes I agree it is a rather rare use case.  For me, it's for doing
concurrent index builds on logical replica tables especially after initial
copy with non-indexed tables, where replicated tables may have traffic
coming in constantly.  That means DBA doesn't have to wait for hours for
them to build 1 by 1, and also doesn't have to worry about long locks.

However IIRC, we have also run into deadlocks before when trying to build
multiple indexes in an OLTP system, which may have been due to this issue
as opposed to only trying to index the same table.


Converting plpgsql to use DTYPE_REC for named composite types

2017-12-27 Thread Tom Lane
Those with long memories will recall that for some time now I've been
arguing that plpgsql should be changed to treat all composite-type
variables (both "record" and named composite types) via its DTYPE_REC
code paths, instead of the current situation where it handles variables of
named composite types via its DTYPE_ROW code paths.  DTYPE_ROW is great
for what it was meant for, which is to allow multiple target variables
in usages like "SELECT ... INTO a, b, c" to be represented by a single
PLpgSQL_datum.  It's not great for composite-type variables.  There are
two really fundamental problems with doing things that way:

* DTYPE_ROW cannot represent a simple NULL value of the composite datum;
the closest it can manage is to set the component variables to nulls,
which is equivalent to ROW(NULL, NULL, ...), which is not the same thing
as a simple NULL.  We've had complaints about this before, eg:
https://www.postgresql.org/message-id/flat/52A9010D.3070202%40ultimeth.com
https://www.postgresql.org/message-id/flat/20120920210519.241290%40gmx.com

* If the composite type changes, say by adding or renaming a column,
plpgsql cannot hope to cope with that without fully recompiling the
function, since its symbol table (list of PLpgSQL_datums) would have
to change.  Currently it doesn't even try.  We've had complaints about
that too:
https://www.postgresql.org/message-id/flat/CAL870DWGgkr7W6ZW%3DjGeqE4bHi0E%3DTLwjcQx95C9pYAVL3%3DU%3DQ%40mail.gmail.com
https://www.postgresql.org/message-id/flat/CA%2BTgmoYDf7dkXhKtk7u_YnAfSt47SgK5J8gD8C1JfSiouU194g%40mail.gmail.com
https://www.postgresql.org/message-id/flat/CAFj8pRC8_-Vppe_sx7%2Bjn-4UQ_YVXGdhWP5O0rtmv6qZxShmFg%40mail.gmail.com

A slightly lesser problem is that we've been discouraged from adding
features like allowing composite-typed variables to be given initializers
or marked CONSTANT because we'd have to fix two completely different
code paths, doubling the work needed.  This also applies to the issue
of fixing plpgsql to support domains over composites correctly, which
is what got me interested in doing something about it now.

Hence, attached is a proposed patch to convert plpgsql to use DTYPE_REC
for all user-declared composite-typed variables.  DTYPE_ROW remains, but
is used only for the purpose of collecting lists of target variables.

Since the main objection that's been raised to this change in the past
is possible performance loss in some cases, I've gone to considerable
trouble to try to minimize such losses.  Work remains to be done for
most of the feature issues mentioned above, though this does fix the
issue of allowing composite-typed variables to be simple NULLs.

(I did yield to the temptation to allow plpgsql functions to take
arguments declared as just "record", since there seems no good reason
why that's still disallowed.)

I believe I've gotten things to the point where this is acceptable from
a performance standpoint.  Testing various microbenchmarks on variables
declared as named composites, some things are faster and some are slower,
but nothing seems to get more than circa 5% worse.  The same benchmarks
on variables declared as "record" are uniformly improvements from HEAD,
by significant margins ranging up to 2X faster.  The worst issues I've
noted are with trivial trigger functions, where there's a noticeable
increase in startup overhead.  I have some basically-independent patches
that can buy that back, but since this patch is more than large enough,
I'll post those as a separate thread.

The core idea of the patch is to introduce an implementation of composite
values as "expanded objects", extending the infrastructure that we used
to improve performance of plpgsql array variables in commit 1dc5ebc90
and follow-on patches.  So far, only plpgsql has been taught about such
objects --- later it might be interesting to extend some of the core
operations such as FieldSelect to deal with them explicitly.

My plan is that expandedrecord.c will grow the ability to store values of
domains-over-composite, including the ability to apply domain constraint
checks during assignments.  That's not there yet, though some comments
make reference to it, and a few bits of code are ready for it.

Also, this expands the idea I had in commit 687f096ea to get the typcache
to assign unique-within-a-process numbers to different versions of a
composite type, so that dependent code can easily recognize that a change
has happened.  Now, the numbers are unique within a process across all
composite types, rather than being just unique per type.  Since they're
64-bit counters there seems no risk of wraparound within the lifespan
of a backend process.

I added a bunch of new regression test cases.  Those are mainly meant
to ensure adequate test coverage of the new code.  Running those cases
on HEAD shows no behavioral changes except the expected ones around
handling of composite NULLs and the addition of RECORD as an allowed
argument type.

I'll stick this into the Januar

Re: [HACKERS] pow support for pgbench

2017-12-27 Thread Robert Haas
On Tue, Dec 26, 2017 at 4:45 PM, Raúl Marín Rodríguez
 wrote:
> I've implemented the overflow checks and made some benchmarks and the ipow()
> version became slower except with some specific inputs (base 0 for example).
> It's true that the new auxiliary functions could be optimized, but I don't
> think it makes sense to keep working on them just to match pow() speed.
>
> I'm attaching both patches in case someone wants to have a look but I would
> go with the simpler solution (pgbench_pow_v10.patch).

Committed the simpler solution after fixing it so that it compiles.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: How to Works with Centos

2017-12-27 Thread Robert Haas
On Mon, Dec 25, 2017 at 4:07 PM, Michael Paquier
 wrote:
> On Mon, Dec 25, 2017 at 06:48:09PM -0500, Jaime Casanova wrote:
>> so you have two options:
>>
>> 1) use the packages from yum.postgresql.org for a supported version
>> 2) get commercial support for your out-of-community-support verssion
>>
>> but even if you do 2, that would be a preparatory step  looking
>> forward to upgrade to a newer version
>
> You need to think long-term here. The product that you are developing
> and/or maintaining will need to stay around for a couple of years as
> well, those are years where you should keep up with the community support
> window of 5 years for a major version of PostgreSQL. That's what I do
> on the stuff I work with, and the outcome is much better at the end
> as there is no need to finish with a private fork of an out-of-support
> version, normally at least.

+1.  Someone with whom I was speaking recently mentioned that they
were upgrading from 9.2 to 9.6, and that sounds like a pretty good
plan to me if you want to have a release that will be supported for a
while but is thought to be stable now.  We're still finding bugs in
v10 at a slightly alarming rate, but that will, I hope, settle down
over the next few months.  Meanwhile, 9.6 has been out for a year and,
as far as I've seen, all indications seem to be that it's a pretty
solid release.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PathNameCreateTemporaryDir() vs concurrency

2017-12-27 Thread Robert Haas
On Tue, Dec 26, 2017 at 3:11 AM, Thomas Munro
 wrote:
> While testing parallel hash join today, I saw a couple of errors like this:
>
> 2017-12-26 23:34:37.402 NZDT [13082] ERROR:  cannot create temporary
> subdirectory "base/pgsql_tmp/pgsql_tmp13080.0.sharedfileset": File
> exists
>
> There is a thinko in PathNameCreateTemporaryDir(), a new function
> added in commit dc6c4c9d.  It was designed to tolerate directories
> existing already but forgot to handle it in the third mkdir call, so
> it fails with unlucky timing.  Please see attached.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Comment typo in postgres_fdw.c

2017-12-27 Thread Robert Haas
On Wed, Dec 27, 2017 at 4:35 AM, Etsuro Fujita
 wrote:
> Here is a small patch for fixing $Subject: s/it's/its/.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: AS OF queries

2017-12-27 Thread Peter van Hardenberg
On Wed, Dec 27, 2017 at 7:37 AM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
> On 27.12.2017 00:52, Jeff Janes wrote:
>
> On Thu, Dec 21, 2017 at 6:00 AM, Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>
>> There is still one significant difference of my prototype implementation
>> with SQL standard: it associates timestamp with select statement, not with
>> particular table.
>> It seems to be more difficult to support and I am not sure that joining
>> tables from different timelines has much sense.
>> But certainly it also can be fixed.
>
>
> I think the main use I would find for this feature is something like:
>
> select * from foo except select * from foo as old_foo as of '';
>
>
Just a quick report from the world of ORMs and web applications.

Today the idiomatic approach for an ORM like Ruby on Rails is to support
temporal(ish) queries using three additional TIMESTAMP_TZ columns:
"created_at", "updated_at" and "deleted_at". This idiom is bundled up into
a plugin called "acts_as_paranoid" (See: https://github.com/
rubysherpas/paranoia). We used this extensively at Heroku in our production
code for auditability reasons.

In general, this gets implemented on a per-table basis and usually has no
expiry short of manual cleanup. (It would be interesting to contemplate how
an end-user would clean up a table without losing their entire history in
the event of some kind of bug or bloat.)

I think a quality PostgreSQL-core implementation would be a fantastic
enhancement, though it would obviously introduce a bunch of interesting
decisions around how to handle things like referential integrity.

Personally, I frequently used these columns to query for things like "how
many users were created in each of the last twelve months", and the ability
to index on those dates was often important.

I'm confident that if this feature made it into PostgreSQL there would be
interested people in downstream communities that would take advantage of it.

Hope all that helps,

-- 
Peter van Hardenberg
San Francisco, California
"Everything was beautiful, and nothing hurt."—Kurt Vonnegut


Contributing with code

2017-12-27 Thread Antonio Belloni
Hi,

This is my first post on the list. My name is Antonio. I am a CS grad
student and my field of study is about databases and information retrieval.
To get some practical knowledge, I've been studying Postgresql codebase for
a while.

Now I would like to contribute with some code and I've chosen the following
topic of the TODO list :

Allow reporting of which objects are in which tablespaces

This item is difficult because a tablespace can contain objects from
multiple databases. There is a server-side function that returns the
databases which use a specific tablespace, so this requires a tool that
will call that function and connect to each database to find the objects in
each database for that tablespace.
The topic suggests to use the pg_tablespace_databases to discover which
database is using a specific tablespace and then connect to each database
and find the objects in the tablespaces.
I checked the code of pg_tablespace_databases, defined in
src/backend/utils/adt/misc.c, and see that it uses a much simpler approach
: It just reads the tablespaces directories and return the name of the
directories that represents databases OIDs.
Although the function works as expected, I  can see some issues not
addressed in the code :
- It does not check for permissions. Any user can execute it;- It does not
check if the platform supports symlinks, which can cause an error because
the function is trying to follow the links defined in base/pg_tblspc.
I could use the same approach and write a function that goes down one more
level in the directory structure and find the objects' OIDs inside each
database directory, but I don't know if this is the better way to do that.
Please, could someone give me feedback and help me with this topic ?
Regards,Antonio Belloni


Contributing some code

2017-12-27 Thread Antonio Belloni
Hi,

This is my first post on the list. My name is Antonio. I am a CS grad
student and my field of study is about databases and information retrieval.
To get some practical knowledge, I've been studying Postgresql codebase for
a while.

Now I would like to contribute with some code and I've chosen the following
topic of the TODO list :

Allow reporting of which objects are in which tablespaces

This item is difficult because a tablespace can contain objects from
multiple databases. There is a server-side function that returns the
databases which use a specific tablespace, so this requires a tool that
will call that function and connect to each database to find the objects in
each database for that tablespace.
The topic suggests to use the pg_tablespace_databases to discover which
database is using a specific tablespace and then connect to each database
and find the objects in the tablespaces.
I checked the code of pg_tablespace_databases, defined in
src/backend/utils/adt/misc.c, and see that it uses a much simpler approach
: It just reads the tablespaces directories and return the name of the
directories that represents databases OIDs.
Although the function works as expected, I  can see some issues not
addressed in the code :
- It does not check for permissions. Any user can execute it;- It does not
check if the platform supports symlinks, which can cause an error because
the function is trying to follow the links defined in base/pg_tblspc.
I could use the same approach and write a function that goes down one more
level in the directory structure and find the objects' OIDs inside each
database directory, but I don't know if this is the better way to do that.
Please, could someone give me feedback and help me with this topic ?
Regards,Antonio Belloni


plpgsql function startup-time improvements

2017-12-27 Thread Tom Lane
Attached are patches for two performance-improvement ideas that came
to me while working on
https://www.postgresql.org/message-id/8962.1514399...@sss.pgh.pa.us
The three patches are logically independent and could be committed in
any order.  But they touch some overlapping code, so as presented,
you need to apply that other patch first and then these two in
sequence.

The motivation for the first patch is that I noticed that for simple
plpgsql functions, especially triggers, the per-datum palloc()s performed
by copy_plpgsql_datum() during function entry amounted to a significant
fraction of the total runtime.  To fix that, the patch simply does one
palloc for the space needed by all datums, relying on a space calculation
performed at the end of function compilation by plpgsql_finish_datums().
This does nothing much for trivial functions with only a few datums, but
for ones with more, it's a worthwhile savings.

BTW, I experimented with a more drastic solution involving separating
the "read only" and "writable" parts of PLpgSQL_datum structs and then
instantiating only the "writable" parts, thus considerably reducing the
amount of data to be copied during per-call initialization.  But that
was a lot more invasive to the code, and it seemed to be slower than
what I present here, because performance-critical accesses to variables
had to compute the addresses of both structs associated with the variable.
I've not totally given up hope for that idea, but it'll require more
tuning than I had time for.

In addition to the core idea of the patch, I noticed that there is no
good reason for PLpgSQL_expr to be treated as a kind of PLpgSQL_datum;
those structs are never members of the estate->datums[] array, nor is
there any code expecting them to be structural supersets of PLpgSQL_datum.
So the patch also removes PLPGSQL_DTYPE_EXPR and the useless fields of
PLpgSQL_expr.

Also, I changed PLpgSQL_var.isconst and PLpgSQL_var.notnull from "int"
to "bool", which is what they should have been all along, and relocated
them in the PLpgSQL_var struct.  There are two motivations for this.
It saves a whole 8 bytes per PLpgSQL_var, at least on 64-bit machines,
because the fields now fit into what had been wasted padding space;
reducing the size of what we have to copy during copy_plpgsql_datums
has to be worth something.  Second, those fields are now adjacent to
the common PLpgSQL_variable fields, which will simplify migrating them
into PLpgSQL_variable, as I anticipate we'll want to do at some point
when we allow composite variables to be marked CONSTANT and maybe NOT
NULL.


The idea of the second patch came from noticing that in simple trigger
functions, quite a large fraction of cycles went into setting up the
"built in" variables such as tg_name, even though many trigger functions
probably never read most of those variables.  We could improve that by
not computing the values until/unless they're read.  There are various
names for this technique, but the one that seemed most evocative to me
was to say that these variables have "promises" attached to them.  So
that's what the patch calls them.  We mark the variables with an enum
indicating which promise needs to be fulfilled for each one, and then
when about to read a datum, we fulfill the promise if needed.

The method I settled on for that was to invent a separate DTYPE_PROMISE,
which otherwise is treated exactly like DTYPE_VAR, and to code places
like exec_eval_datum() like this:
  
switch (datum->dtype)
{
+   case PLPGSQL_DTYPE_PROMISE:
+   /* fulfill promise if needed, then handle like regular var */
+   plpgsql_fulfill_promise(estate, (PLpgSQL_var *) datum);
+ 
+   /* FALL THRU */
+ 
case PLPGSQL_DTYPE_VAR:
{
PLpgSQL_var *var = (PLpgSQL_var *) datum;

The extra DTYPE is a little bit grotty, but it's not awful.  One
alternative I experimented with was to just treat these variables
as plain DTYPE_VAR, requiring coding like

case PLPGSQL_DTYPE_VAR:
{
PLpgSQL_var *var = (PLpgSQL_var *) datum;
  
+   if (unlikely(var->promise != PLPGSQL_PROMISE_NONE))
+   plpgsql_fulfill_promise(estate, var);
+ 
*typeid = var->datatype->typoid;
*typetypmod = var->datatype->atttypmod;
*value = var->value;

However, this way is injecting an additional test-and-branch into
hot code paths, and it was demonstrably slower.

With these patches, I see performance improvements of 10% to 20%
on simple but not totally unrealistic triggers, for example

create or replace function mytrig() returns trigger language plpgsql as
$$
begin
  if (new.f1 != new.f2) or (new.f3 != new.f4) then
new.f3 = 42;
  end if;
  return new;
end$$ stable;

(BTW, those are percentages of total INSERT runtime, not just of
the trigger proper; though I cheated to the extent of using a
temp not regular table.)

It seems possible that the "promis

Re: pgsql: Get rid of copy_partition_key

2017-12-27 Thread Alvaro Herrera
Pushed, thanks.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Unique indexes & constraints on partitioned tables

2017-12-27 Thread Alvaro Herrera
Thanks for the patch.

Amit Langote wrote:

> I mentioned this case at [1] and had a WIP patch to address that.  Please
> find it attached here.  It is to be applied on top of both of your patches.

In this bit:

> + /*
> +  * When specific arbiter indexes requested, only examine them.  
> If
> +  * this is a partition (after a tuple is routed to it from the
> +  * parent into which the original tuple has been inserted), we 
> must
> +  * check the parent index id, instead of our own id, because 
> that's
> +  * the one that appears in the arbiterIndexes list.
> +  */
>   if (arbiterIndexes != NIL &&
> - !list_member_oid(arbiterIndexes,
> -  
> indexRelation->rd_index->indexrelid))
> + !(list_member_oid(arbiterIndexes,
> +   
> indexRelation->rd_index->indexrelid) ||
> +   list_member_oid(arbiterIndexes,
> +   
> indexRelation->rd_index->indparentidx)))

I think this would fail if there is two-level partitioning (or more),
because the index mentioned in the arbiter indexes list would be the
grand-parent index and would not appear in indparentidx.  Maybe what we
need is to map the parent index ids to partition indexes, all the way up
in ExecInsert before calling ExecCheckIndexConstraints, which looks
pretty annoying.

Another I'm mildly worried about is the use of ModifyState->nominalRelation,
which is said to be "for the use of EXPLAIN"; in this new code, we're
extending its charter so that we're actually relying on it for
execution.  Maybe this is not a problem and we just need to update the
comments (if we believe that we maintain it reliably enough, which is
probably true), but I'm not certain.

I also wonder about short-circuiting the build of the on-conflict stuff
for the parent table in the partitioned case, because surely we don't
need it.  But that seems fairly minor.

Thanks again,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Reproducible builds: genbki.pl and Gen_fmgrtab.pl

2017-12-27 Thread Tom Lane
Peter Eisentraut  writes:
> On 12/21/17 08:13, Andrew Dunstan wrote:
>> Looks reasonable. Regarding the change to TestLib.pm, we should make
>> sure that the tests have unique names. There is a small amount of
>> duplication currently:
>> 
>> ./src/bin/pg_dump/t/001_basic.pl
>> ./src/bin/pg_rewind/t/001_basic.pl
>> ./src/test/modules/commit_ts/t/001_base.pl
>> ./src/test/modules/test_pg_dump/t/001_base.pl

> But that's one of the reasons one has directories, so you don't need to
> have globally unique names.

I agree that it's not obvious that we need to rename these scripts.
In the current setup for running regression tests, the associated
postmaster logs would go into different subdirectories, so the conflict
in "application" names appearing in log entries doesn't seem like it
would lead to real confusion.

> I don't actually see why the change in
> TestLib.pm is necessary or how it relates to this thread.

Basically, as the code stood:

$ENV{PGAPPNAME} = $0;

running any TAP test in a VPATH build would result in PGAPPNAME becoming a
full path, resulting in different log entries than you get in a non-VPATH
build.  That seemed undesirable to me, so I added basename() to eliminate
the difference in behavior.

There's some room to argue that given that we have duplicate script names,
a (partial) path would be a good thing.  But if you believe that, then we
should attempt to make that happen the same way with or without VPATH.

regards, tom lane



Re: [HACKERS] [PATCH] Lockable views

2017-12-27 Thread Tatsuo Ishii
> I didn't want to change the interface of view_query_is_auto_updatable()
> because this might be called from other third-patry software, so I renamed
> this function to view_query_is_auto_updatable_or_lockable() and added the flag
> to this. I created view_query_is_auto_updatable() as a wrapper of this 
> function.
> I also made view_query_is_lockable() that returns a other message than 
> view_query_is_auto_updatable().
> 
>> On Tue, 17 Oct 2017 11:59:05 +0900 (JST)
>> Tatsuo Ishii  wrote:
>> > 1) Leave as it is (ignore tables appearing in a subquery)
>> > 
>> > 2) Lock all tables including in a subquery
>> > 
>> > 3) Check subquery in the view 
> 
>> > So it seem #1 is the most reasonable way to deal with the problem
>> > assuming that it's user's responsibility to take appropriate locks on
>> > the tables in the subquery.
> 
> I adopted #1 and I didn't change anything about this.

Looks good to me except that the patch lacks documents and the
regression test needs more cases. For example, it needs a test for the
case #1 above: probably using pg_locks to make sure that the tables
appearing in the subquery do not hold locks.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] taking stdbool.h into use

2017-12-27 Thread Michael Paquier
On Wed, Dec 27, 2017 at 12:52:55PM -0500, Peter Eisentraut wrote:
> On 12/26/17 23:10, Michael Paquier wrote:
> > It would be nice to do something like that for GinTernaryValue in
> > tsginidx.c by mapping directly to GIN_FALSE and GIN_TRUE depending on
> > the input coming in gin_tsquery_consistent. The fix is more trivial
> > there.
> 
> For GinTernaryValue, I think it's easier to just make it the same size
> as bool, since it doesn't go onto disk.  My earlier patch did that.  I'm
> not sure it's worth adding more code to copy the array around.

But on prairiedog the sizeof bool and char are different, so compilation
would fail, no? checkcondition_gin is used only by
gin_tsquery_consistent so I think that it is possible to get advantage
of that by using a secondary type of GinChkVal which uses directly a
bool array and converts the check value for the operand to a
GinTernaryValue value on-the-fly. I agree that this would make the code
more complex though for not much gain on modern platform.
--
Michael


signature.asc
Description: PGP signature


MCV lists for highly skewed distributions

2017-12-27 Thread Jeff Janes
I want to revive a patch I sent  couple years ago to the performance list,
as I have seen the issue pop up repeatedly since then.

Original thread is here:
https://www.postgresql.org/message-id/CAK_s-G2tc8r0N3AqPr8fW5QsRQMbZNurgAQ%3D_ME1aaf4vOmnnA%40mail.gmail.com

The problem is that if you have a highly skewed distribution, such as
exponential frequencies, it usually stores too few entries in the MCV list.

For example:

create table foo as select floor(-log(random())/log(2))::int  as who from
generate_series(1,1000);

This will usually store 0 to 5 as MCV with the default stat target[1].
Then value 6 is not stored, and its ~75k repetitions get averaged over the
remaining distinct values.  This leads to horrible misplanning for rare
values.  For example, who=17 has 37 entries, but is estimated to have
20,000.  The correct join for 37 values is often not the correct one for
20,000.

If we stored just a few more values, their inclusion in the MCV would mean
they are depleted from the residual count, correctly lowering the estimate
we would get for very rare values not included in the sample.

So instead of having the threshold of 1.25x the average frequency over all
values, I think we should use 1.25x the average frequency of only those
values not already included in the MCV, as in the attached.

As it is, you can partially overcome the too short MCV list by cranking up
the default statistics target, but this is a weak effect and comes at a
high cost of CPU time.  In some of the real distributions I've looked at,
cranking up default statistics target is almost entirely ineffective.

I think that perhaps maxmincount should also use the dynamic
values_cnt_remaining rather than the static one.  After all, things
included in the MCV don' get represented in the histogram.  When I've seen
planning problems due to skewed distributions I also usually see redundant
values in the histogram boundary list which I think should be in the MCV
list instead. But I have not changed that here, pending discussion.

[1] Occasionally it will store a much longer MCV list, because no values
was sampled exactly once, which triggers a different code path in which all
seen values are put in the MCV and the histogram is NULL.  This is not
reliable, as whether the least-sample value is present in the sample once
or twice is pretty brittle.

Cheers,

Jeff
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f952b3c..b02ac20 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2575,7 +2575,8 @@ compute_scalar_stats(VacAttrStatsP stats,
 		 * the MCV list is not complete, it's generally worth being more
 		 * selective, and not just filling it all the way up to the stats
 		 * target.  So for an incomplete list, we try to take only MCVs that
-		 * are significantly more common than average.
+		 * are significantly more common than average frequency for values which
+		 * will not be represented in the MCV list.
 		 */
 		if (track_cnt == ndistinct && toowide_cnt == 0 &&
 			stats->stadistinct > 0 &&
@@ -2587,6 +2588,7 @@ compute_scalar_stats(VacAttrStatsP stats,
 		else
 		{
 			double		ndistinct_table = stats->stadistinct;
+			double		values_cnt_remaining = (double) values_cnt;
 			double		avgcount,
 		mincount,
 		maxmincount;
@@ -2594,25 +2596,27 @@ compute_scalar_stats(VacAttrStatsP stats,
 			/* Re-extract estimate of # distinct nonnull values in table */
 			if (ndistinct_table < 0)
 ndistinct_table = -ndistinct_table * totalrows;
-			/* estimate # occurrences in sample of a typical nonnull value */
-			avgcount = (double) nonnull_cnt / ndistinct_table;
-			/* set minimum threshold count to store a value */
-			mincount = avgcount * 1.25;
-			if (mincount < 2)
-mincount = 2;
-			/* don't let threshold exceed 1/K, however */
-			maxmincount = (double) values_cnt / (double) num_bins;
-			if (mincount > maxmincount)
-mincount = maxmincount;
-			if (num_mcv > track_cnt)
-num_mcv = track_cnt;
-			for (i = 0; i < num_mcv; i++)
-			{
+			/* estimate # of occurrences in sample of a typical nonnull value */
+  			for (i = 0; i < num_mcv; i++)
+  			{
+avgcount = values_cnt_remaining / ndistinct_table;
+/* set minimum threshold count to store a value */
+mincount = avgcount * 1.25;
+if (mincount < 2)
+	mincount = 2;
+/* don't let threshold exceed 1/K, however */
+maxmincount = (double) values_cnt / (double) (num_bins);
+if (mincount > maxmincount)
+	mincount = maxmincount;
+if (num_mcv > track_cnt)
+	num_mcv = track_cnt;
 if (track[i].count < mincount)
 {
 	num_mcv = i;
 	break;
 }
+values_cnt_remaining -= track[i].count;
+ndistinct_table--;
 			}
 		}
 


Re: [HACKERS] path toward faster partition pruning

2017-12-27 Thread David Rowley
On 22 December 2017 at 17:25, Amit Langote
 wrote:
> Please find attached updated patches.

Hi Amit,

I've just completed a pass over the v17 patch set. I've found a number
of things that need to be addressed. Some might seem a bit nit-picky,
sorry about that. However, many of the others genuinely need to be
addressed.

1. The following code calls RelationGetPartitionQual, but the output
of that is only needed when partition_bound_has_default(boundinfo) is
true.

Can you change this to only get the RelationGetPartitionQual when it's required?

Bitmapset *
get_partitions_from_clauses(Relation relation, int rt_index,
ParamListInfo prmlist, ExprContext *econtext,
List *partclauses)
{
Bitmapset*result;
List*partconstr = RelationGetPartitionQual(relation);

2. The header comment in match_clauses_to_partkey() does not give any
warning that 'clauses' is modified within the function.

The function should create a copy of the clauses before modifying
them. This will save you having to do any list_copy calls when you're
calling the function.

The header comment is also not very clear about what the return value
of the function is.

3. "method" I think should be "strategy". We've pretty much
standardised on that term everywhere else, so let's keep to standard.

/*
* Does the query specify a key to be null or not null?  Partitioning
* handles null partition keys specially depending on the partitioning
* method in use, we store this information.
*/

4. "relation" should be in single quotes, since you're talking about
the parameter named "relation". Likewise with "partclauses", otherwise
it just seems like bad English.

 * Determine the set of partitions of relation that will satisfy all
 * the clauses contained in partclauses

5. partdesc's assignment can be delayed until it's needed. This will
save generating it when constfalse == true

static Bitmapset *
get_partitions_from_clauses_recurse(Relation relation, int rt_index,
List *clauses)
{
PartitionDesc partdesc = RelationGetPartitionDesc(relation);

6. In the following comment, I'd have expected almost identical code,
just with some other List, but the code probably differs a bit too
much to use "Ditto".

/*
* Ditto, but this time or_clauses.
*/

7. Comment claims we use "set union", but we're really just collecting
the members from the other set in:

/*
* Partition sets obtained from mutually-disjunctive clauses are
* combined using set union.
*/
result = bms_add_members(result, arg_partset);

8. These arrays could just be initialized up to partkey->partnatts.
I'd imagine most of the time this will be just 1, so would save
needlessly setting the 31 other elements, although, perhaps it's a bad
idea to optimize this.

memset(keyclauses_all, 0, sizeof(keyclauses_all));
/* false means we don't know if a given key is null */
memset(keyisnull, false, sizeof(keyisnull));
/* false means we don't know if a given key is not null */
memset(keyisnotnull, false, sizeof(keyisnull));

The last two of these could just be Bitmapsets and you'd not need any
memset at all. PARTITION_MAX_KEYS just so happens to be the same as
BITS_PER_BITMAPWORD in a standard build, so you'd always be able to
mark everything with a single bitmap word. This would help a bit in
the various places that you're counting the true elements, for
example, the following code:

for (i = 0; i < partkey->partnatts; i++)
{
if (!keys->keyisnotnull[i])
{
include_def = true;
break;
}
}

could become:

include_def = !bms_is_empty(keys->keyisnotnull);

if you converted all these to Bitmapsets.

9. The following comment would be better read as: /* clause does not
match this partition key */

/* Clause not meant for this column. */

10. The following comment talks about handling less than operators for
hash opfamilies, but the code only handles <> for btree and list
partitioning.

* Handle some cases wherein the clause's operator may not
* belong to the partitioning operator family.  For example,
* operators named '<>' are not listed in any operator
* family whatsoever.  Also, ordering opertors like '<' are
* not listed in the hash operator family.

"opertors" should be spelled "operators"

11. In the following comment "operator" should be "operators":

 * are constrained by clauses containing equality operator, unless hash

Likewise for:

 * IS NULL clauses while remaining have clauses with equality operator.

12. The following code in classify_partition_bounding_keys probably
should use EXPR_MATCHES_PARTKEY.

/* Does leftop match with this partition key column? */
if ((IsA(arg, Var) &&
((Var *) arg)->varattno == partattno) ||
equal(arg, partexpr))

13. The comment for classify_partition_bounding_keys does not seem to
define well what the meaning of the return value is:

 * classify_partition_bounding_keys
 * Classify partition clauses into equal, min, and max keys, along with
 * any Nullness constraints and return that information in the output
 * argument keys (number of keys is the return value)

I had thought a

RE: [HACKERS] Transactions involving multiple postgres foreign servers

2017-12-27 Thread Tsunakawa, Takayuki
From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> I've updated documentation of patches, and fixed some bugs. I did some
> failure tests of this feature using a fault simulation tool[1] for
> PostgreSQL that I created.
> 
> 0001 patch adds a mechanism to track of writes on local server. This is
> required to determine whether we should use 2pc at commit. 0002 patch is
> the main part. It adds a distributed transaction manager (currently only
> for atomic commit), APIs for 2pc and foreign transaction manager resolver
> process. 0003 patch makes postgres_fdw support atomic commit using 2pc.
> 
> Please review patches.

I'd like to join the review and testing of this functionality.  First, some 
comments after taking a quick look at 0001:

(1)
Why don't you use the existing global variable MyXactFlags instead of the new 
TransactionDidWrite?  Or, how about using XactLastRecEnd != 0 to determine the 
transaction did any writes?  When the transaction only modified temporary 
tables on the local database and some data on one remote database, I think 2pc 
is unnecessary.

(2)
If TransactionDidWrite is necessary, I don't think you need to provide setter 
functions, because other transaction state variables are accessed globally 
without getter/setters.  And you didn't create getter function for 
TransactionDidWrite.

(3)
heap_multi_insert() doesn't modify TransactionDidWrite.  Is it sufficient to 
just remember heap modifications?  Are other modifications on the coordinator 
node covered such as TRUNCATEand and REINDEX?


Questions before looking at 0002 and 0003:

Q1: Does this functionality work when combined with XA 2pc transactions?

Q2: Does the atomic commit cascade across multiple remote databases? For 
example:
* The local transaction modifies data on remote database 1 via a foreign table.
* A trigger fires on the remote database 1, which modifies data on remote 
database 2 via a foreign table.
* The local transaction commits.

Regards
Takayuki Tsunakawa



Re: Contributing some code

2017-12-27 Thread Craig Ringer
On 28 December 2017 at 01:40, Antonio Belloni 
wrote:

> Hi,
>
> This is my first post on the list. My name is Antonio. I am a CS grad
> student and my field of study is about databases and information retrieval.
> To get some practical knowledge, I've been studying Postgresql codebase for
> a while.
>
> Now I would like to contribute with some code and I've chosen the
> following topic of the TODO list :
>
> Allow reporting of which objects are in which tablespaces
>
> This item is difficult because a tablespace can contain objects from
> multiple databases. There is a server-side function that returns the
> databases which use a specific tablespace, so this requires a tool that
> will call that function and connect to each database to find the objects in
> each database for that tablespace.
> The topic suggests to use the pg_tablespace_databases to discover which
> database is using a specific tablespace and then connect to each database
> and find the objects in the tablespaces.
> I checked the code of pg_tablespace_databases, defined in
> src/backend/utils/adt/misc.c, and see that it uses a much simpler approach
> : It just reads the tablespaces directories and return the name of the
> directories that represents databases OIDs.
> Although the function works as expected, I  can see some issues not
> addressed in the code :
> - It does not check for permissions. Any user can execute it;- It does
> not check if the platform supports symlinks, which can cause an error
> because the function is trying to follow the links defined in
> base/pg_tblspc.
> I could use the same approach and write a function that goes down one more
> level in the directory structure and find the objects' OIDs inside each
> database directory, but I don't know if this is the better way to do that.
>
>

There's a bit of a trap hidden here. The names of relation extents look
like oids, possibly with an extent number for relations bigger than 1GB.
But they aren't.  They're relfilenode numbers.

PostgreSQL maps relation oids to relfilenodes. By default on a new system,
relations will often have the same relfilenode as oid. That's a pity IMO;
it'd be way less confusing if we allocated relfilenodes from a wholly
different counter, because as it is, it gives people the false impression
they can expect the filename relfilenode to be the relation oid.

In fact, what happens (per my probably imperfect understanding) is that
PostgreSQL checks pg_class (via the relcache) for the oid of the table. It
then uses RelationIsMapped to see if it's a normal relation with the
filenode number in pg_class or not. If it's a normal (non-mapped) relation,
it uses the Relation's rd_node to find the relation's physical address
tablespace, dboid, and relfilenode. If it's a mapped relation, it instead
consults the relmapper to find the relation's storage; see
src/backend/utils/cache/relmapper.c .

See also src/backend/storage/smgr/README,

This means you can't determine relation oids from ondisk state without
scanning pg_class. And pg_class is per-database, not a shared relation, so
you must look at each db in turn, since Pg doesn't support cross-DB
queries. Logical decoding handles this with the RelidByRelfilenode
function, but there are issues there around making sure you have the right
snapshot etc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2017-12-27 Thread David Rowley
I've attached a rebased patch.

The previous patch was conflicting with parallel Hash Join (180428404)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_singleton_appends_2017-12-28.patch
Description: Binary data


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-12-27 Thread Masahiko Sawada
On Thu, Dec 28, 2017 at 11:40 AM, Tsunakawa, Takayuki
 wrote:
> From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
>> I've updated documentation of patches, and fixed some bugs. I did some
>> failure tests of this feature using a fault simulation tool[1] for
>> PostgreSQL that I created.
>>
>> 0001 patch adds a mechanism to track of writes on local server. This is
>> required to determine whether we should use 2pc at commit. 0002 patch is
>> the main part. It adds a distributed transaction manager (currently only
>> for atomic commit), APIs for 2pc and foreign transaction manager resolver
>> process. 0003 patch makes postgres_fdw support atomic commit using 2pc.
>>
>> Please review patches.
>
> I'd like to join the review and testing of this functionality.  First, some 
> comments after taking a quick look at 0001:

Thank you so much!

> (1)
> Why don't you use the existing global variable MyXactFlags instead of the new 
> TransactionDidWrite?  Or, how about using XactLastRecEnd != 0 to determine 
> the transaction did any writes?  When the transaction only modified temporary 
> tables on the local database and some data on one remote database, I think 
> 2pc is unnecessary.

Perhaps we can use (XactLastRecEnd != 0 && markXidCommitted) to see if
we did any writes on local node which requires the atomic commit. Will
fix.

>
> (2)
> If TransactionDidWrite is necessary, I don't think you need to provide setter 
> functions, because other transaction state variables are accessed globally 
> without getter/setters.  And you didn't create getter function for 
> TransactionDidWrite.
>
> (3)
> heap_multi_insert() doesn't modify TransactionDidWrite.  Is it sufficient to 
> just remember heap modifications?  Are other modifications on the coordinator 
> node covered such as TRUNCATEand and REINDEX?

I think the using (XactLastRecEnd != 0 && markXidCommitted) to check
if we did any writes on local node would be better. After changed I
will be able to deal with the all above concerns.

>
>
> Questions before looking at 0002 and 0003:
>
> Q1: Does this functionality work when combined with XA 2pc transactions?

All transaction including local transaction and foreign transactions
are prepared when PREPARE. And all transactions are
committed/rollbacked by the foreign transaction resolver process when
COMMIT/ROLLBACK PREPARED.

> Q2: Does the atomic commit cascade across multiple remote databases? For 
> example:
> * The local transaction modifies data on remote database 1 via a foreign 
> table.
> * A trigger fires on the remote database 1, which modifies data on remote 
> database 2 via a foreign table.
> * The local transaction commits.

I've not tested yet more complex failure situations but as far as I
tested on my environment, the cascading atomic commit works. I'll test
these cases more deeply.

Regards,

Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: How to Works with Centos

2017-12-27 Thread Asif Naeem
On Tue, Dec 26, 2017 at 4:48 AM, Jaime Casanova <
jaime.casan...@2ndquadrant.com> wrote:

> On 25 December 2017 at 09:39, Benyamin Guedj 
> wrote:
> >
> > Upon doing so, our DevOps team in response insisted (and still insists)
> that
> > we keep using version 9.2 as it is part of the Centos 7 distribution, and
> > they believe that version to be “best practice”, even though PostgreSQL
> 9.2
> > is no longer supported.
> >
> > My question is:
> >
> > Is working with the default distribution’s version (9.2) really the “best
> > practice”, even though it is no longer supported?
> >
>
> clearly no, our versioning page says
> (https://www.postgresql.org/support/versioning/):
> """
> The PostgreSQL project aims to fully support a major release for five
> years. After its end-of-life (EOL) month ends, a major version
> receives one final minor release. After that final minor release, bug
> fixing ceases for that major version.
> """
>
> so, if bug fixing ceases for a non-supported version it's clearly no
> "best practice" to continue using it.
>
> so you have two options:
>
> 1) use the packages from yum.postgresql.org for a supported version
> 2) get commercial support for your out-of-community-support verssion
>

I would like to add here, as your team seems more interested in packages
available from CentOS, there is a third option also available that might
interest you.
Along with PostgreSQL 9.2 package (default) CentOS 7 project also
provide PostgreSQL
9.6 but via Software Collections (SCL) that target similar problem that you
are facing i.e.

https://wiki.centos.org/AdditionalResources/Repositories/SCL
> http://mirror.centos.org/centos/7/sclo/x86_64/rh/rh-postgresql96/


I hope that will help. Thanks.

Regards,
Muhammad Asif Naeem


>
> but even if you do 2, that would be a preparatory step  looking
> forward to upgrade to a newer version
>
> --
> Jaime Casanova  www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>


Re: Getting rid of "tuple concurrently updated" elog()s with concurrent DDLs (at least ALTER TABLE)

2017-12-27 Thread Michael Paquier
On Wed, Dec 27, 2017 at 04:53:42PM +0900, Michael Paquier wrote:
> Indeed, this bit is important, and RangeVarGet does so. Looking at
> get_object_address(), it also handles locking of the object. Using that
> as interface to address all the concurrency problems could be appealing,
> and less invasive for a final result as many DDLs are visibly
> impacted (still need to make a list here). What do you think?

So, I have looked at all the object types to spot concurrency problems,
and I have found some more. And the winners are:
- database
- role
- domain
- event trigger
- FDW
- server
- user mapping
- type
- tablespace

Most of the failures show "tuple concurrently updated" using a given set
of ALTER commands running concurrently, but I have been able to trigger
as well one "cache lookup failure" for domains, and a duplicate key value
violation for databases.

I have done some extra checks with parallel ALTER commands for the
following objects as well, without spotting issues:
- access method
- cast
- attribute
- extension
- view
- policy
- publication
- subscription
- rule
- transform
- text search stuff
- statistics
- operator [family], etc.

As looking for those failures manually is no fun, I am attaching a patch
which includes a set of isolation regression tests able to trigger
them. You just need to look for unwelcome ERROR messages and you are
good to go. The goal to move forward will be to take care of all those
failures. Please note that isolation tests don't allow dynamic input, so
those have no tests but you can reproduce an error easily with ALTER
TABLESPACE SET SCHEMA between two sessions. Note as well that the domain
test will fail because the cache lookup error on domains exposes the
domain OID, but that's nothing to care about.

Opinions or thoughts?
--
Michael
diff --git a/src/test/isolation/expected/concurrent-database.out 
b/src/test/isolation/expected/concurrent-database.out
new file mode 100644
index 00..59da795117
--- /dev/null
+++ b/src/test/isolation/expected/concurrent-database.out
@@ -0,0 +1,105 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_begin s1_alterdb_with s2_begin s2_alterdb_with 
s1_commit s2_commit
+step s1_begin: BEGIN;
+step s1_alterdb_with: 
+   ALTER DATABASE regress_database WITH ALLOW_CONNECTIONS true;
+step s2_begin: BEGIN;
+step s2_alterdb_with: 
+   ALTER DATABASE regress_database WITH ALLOW_CONNECTIONS true; 
+step s1_commit: COMMIT;
+step s2_alterdb_with: <... completed>
+error in steps s1_commit s2_alterdb_with: ERROR:  tuple concurrently updated
+step s2_commit: COMMIT;
+
+starting permutation: s1_begin s1_alterdb_with s2_begin s2_alterdb_set 
s1_commit s2_commit
+step s1_begin: BEGIN;
+step s1_alterdb_with: 
+   ALTER DATABASE regress_database WITH ALLOW_CONNECTIONS true;
+step s2_begin: BEGIN;
+step s2_alterdb_set: 
+   ALTER DATABASE regress_database SET commit_delay = '10';
+step s1_commit: COMMIT;
+step s2_commit: COMMIT;
+
+starting permutation: s1_begin s1_alterdb_with s2_begin s2_alterdb_owner 
s1_commit s2_commit
+step s1_begin: BEGIN;
+step s1_alterdb_with: 
+   ALTER DATABASE regress_database WITH ALLOW_CONNECTIONS true;
+step s2_begin: BEGIN;
+step s2_alterdb_owner: 
+   ALTER DATABASE regress_database OWNER TO CURRENT_USER;
+step s1_commit: COMMIT;
+step s2_commit: COMMIT;
+
+starting permutation: s1_begin s1_alterdb_with s2_begin s2_alterdb_tbc 
s1_commit s2_commit
+step s1_begin: BEGIN;
+step s1_alterdb_with: 
+   ALTER DATABASE regress_database WITH ALLOW_CONNECTIONS true;
+step s2_begin: BEGIN;
+step s2_alterdb_tbc: 
+   ALTER DATABASE regress_database SET TABLESPACE TO DEFAULT;
+step s1_commit: COMMIT;
+step s2_commit: COMMIT;
+
+starting permutation: s1_begin s1_alterdb_set s2_begin s2_alterdb_set 
s1_commit s2_commit
+step s1_begin: BEGIN;
+step s1_alterdb_set: 
+   ALTER DATABASE regress_database SET commit_delay = '10';
+step s2_begin: BEGIN;
+step s2_alterdb_set: 
+   ALTER DATABASE regress_database SET commit_delay = '10'; 
+step s1_commit: COMMIT;
+step s2_alterdb_set: <... completed>
+error in steps s1_commit s2_alterdb_set: ERROR:  duplicate key value violates 
unique constraint "pg_db_role_setting_databaseid_rol_index"
+step s2_commit: COMMIT;
+
+starting permutation: s1_begin s1_alterdb_set s2_begin s2_alterdb_owner 
s1_commit s2_commit
+step s1_begin: BEGIN;
+step s1_alterdb_set: 
+   ALTER DATABASE regress_database SET commit_delay = '10';
+step s2_begin: BEGIN;
+step s2_alterdb_owner: 
+   ALTER DATABASE regress_database OWNER TO CURRENT_USER;
+step s1_commit: COMMIT;
+step s2_commit: COMMIT;
+
+starting permutation: s1_begin s1_alterdb_set s2_begin s2_alterdb_tbc 
s1_commit s2_commit
+step s1_begin: BEGIN;
+step s1_alterdb_set: 
+   ALTER DATABASE regress_database SET commit_delay = '10';
+step s2_begin: BEGIN;
+step s2_alterdb_tbc: 
+   ALTER DATABASE regress_database SET TABLESPACE TO DEFAULT;
+step s1_commit: COMMIT;
+step s2_commit: COMMIT;
+

Re: [JDBC] [HACKERS] Channel binding support for SCRAM-SHA-256

2017-12-27 Thread Michael Paquier
On Wed, Dec 27, 2017 at 09:27:40AM +0900, Michael Paquier wrote:
> On Tue, Dec 26, 2017 at 03:28:09PM -0500, Peter Eisentraut wrote:
>> On 12/22/17 03:10, Michael Paquier wrote:
>>> Second thoughts on 0002 as there is actually no need to move around
>>> errorMessage if the PGconn* pointer is saved in the SCRAM status data
>>> as both are linked. The attached simplifies the logic even more.
>>> 
>> 
>> That all looks pretty reasonable.
> 
> Thanks for the review. Don't you think that the the refactoring
> simplifications should be done first though? This would result in
> producing the patch set in reverse order. I'll be fine to produce them
> if need be.

Well, here is a patch set doing the reverse operation: refactoring does
first in 0001 and support for tls-server-end-point is in 0002. Hope this
helps.
--
Michael
From 551f9a037d6e38036998337f703758b41d2e1c72 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 22 Dec 2017 17:04:19 +0900
Subject: [PATCH 1/2] Refactor channel binding code to fetch cbind_data only
 when necessary

As things stand now, channel binding data is fetched from OpenSSL and
saved into the SASL exchange context for any SSL connection attempted
for a SCRAM authentication, resulting in data fetched but not used if no
channel binding is used or if a different channel binding type is used
than what the data is here for.

Refactor the code in such a way that binding data is only fetched from
the SSL stack only when a specific channel binding is used for both the
frontend and the backend. In order to achieve that, save the libpq
connection context directly in the SCRAM exchange state, and add a
dependency to SSL in the low-level SCRAM routines.

This makes the interface in charge of initializing the SCRAM context
cleaner as all its data comes from either PGconn* (for frontend) or
Port* (for the backend).
---
 src/backend/libpq/auth-scram.c   |  34 +++-
 src/backend/libpq/auth.c |  19 +
 src/include/libpq/scram.h|   6 +-
 src/interfaces/libpq/fe-auth-scram.c | 159 +--
 src/interfaces/libpq/fe-auth.c   |  27 +-
 src/interfaces/libpq/fe-auth.h   |  10 +--
 6 files changed, 104 insertions(+), 151 deletions(-)

diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c
index d52a763457..72973d3789 100644
--- a/src/backend/libpq/auth-scram.c
+++ b/src/backend/libpq/auth-scram.c
@@ -110,10 +110,8 @@ typedef struct
 
const char *username;   /* username from startup packet */
 
+   Port   *port;
charcbind_flag;
-   boolssl_in_use;
-   const char *tls_finished_message;
-   size_t  tls_finished_len;
char   *channel_binding_type;
 
int iterations;
@@ -172,21 +170,15 @@ static char *scram_mock_salt(const char *username);
  * it will fail, as if an incorrect password was given.
  */
 void *
-pg_be_scram_init(const char *username,
-const char *shadow_pass,
-bool ssl_in_use,
-const char *tls_finished_message,
-size_t tls_finished_len)
+pg_be_scram_init(Port *port,
+const char *shadow_pass)
 {
scram_state *state;
boolgot_verifier;
 
state = (scram_state *) palloc0(sizeof(scram_state));
+   state->port = port;
state->state = SCRAM_AUTH_INIT;
-   state->username = username;
-   state->ssl_in_use = ssl_in_use;
-   state->tls_finished_message = tls_finished_message;
-   state->tls_finished_len = tls_finished_len;
state->channel_binding_type = NULL;
 
/*
@@ -209,7 +201,7 @@ pg_be_scram_init(const char *username,
 */
ereport(LOG,
(errmsg("invalid SCRAM verifier 
for user \"%s\"",
-   username)));
+   
state->port->user_name)));
got_verifier = false;
}
}
@@ -220,7 +212,7 @@ pg_be_scram_init(const char *username,
 * authentication with an MD5 hash.)
 */
state->logdetail = psprintf(_("User \"%s\" does not 
have a valid SCRAM verifier."),
-   
state->username);
+   
state->port->user_name);
got_verifier = false;
}
}
@@ -242,8 +234,8 @@ pg_be_scram_init(const char *username,
 */
if (!got_verifier)
{
-   mock_scram_verifier(username, &state->iterations, &state->salt

Re: [HACKERS] pgbench more operators & functions

2017-12-27 Thread Fabien COELHO


Hello Teodor,

So, I intend to push thish patch in current state. I saw several objections 
from commiters in thread, but, seems, that objections are lifted. Am I right?


Here is a rebase after "pow" addition.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 1519fe7..3dd492c 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -904,14 +904,32 @@ pgbench  options  d
  
   Sets variable varname to a value calculated
   from expression.
-  The expression may contain integer constants such as 5432,
+  The expression may contain the NULL constant,
+  boolean constants TRUE and FALSE,
+  integer constants such as 5432,
   double constants such as 3.14159,
   references to variables :variablename,
-  unary operators (+, -) and binary operators
-  (+, -, *, /,
-  %) with their usual precedence and associativity,
-  function calls, and
-  parentheses.
+  operators
+  with their usual SQL precedence and associativity,
+  function calls,
+  SQL CASE generic conditional
+  expressions and parentheses.
+ 
+
+ 
+  Functions and most operators return NULL on
+  NULL input.
+ 
+
+ 
+  For conditional purposes, non zero numerical values are
+  TRUE, zero numerical values and NULL
+  are FALSE.
+ 
+
+ 
+  When no final ELSE clause is provided to a
+  CASE, the default value is NULL.
  
 
  
@@ -920,6 +938,7 @@ pgbench  options  d
 \set ntellers 10 * :scale
 \set aid (1021 * random(1, 10 * :scale)) % \
(10 * :scale) + 1
+\set divx CASE WHEN :x <> 0 THEN :y/:x ELSE NULL END
 
 

@@ -996,6 +1015,177 @@ pgbench  options  d
   
  
 
+ 
+  Built-In Operators
+
+  
+   The arithmetic, bitwise, comparison and logical operators listed in
+are built into pgbench
+   and may be used in expressions appearing in
+   \set.
+  
+
+  
+   pgbench Operators by increasing precedence
+   
+
+ 
+  Operator
+  Description
+  Example
+  Result
+ 
+
+
+ 
+  OR
+  logical or
+  5 or 0
+  TRUE
+ 
+ 
+  AND
+  logical and
+  3 and 0
+  FALSE
+ 
+ 
+  NOT
+  logical not
+  not false
+  TRUE
+ 
+ 
+  IS [NOT] (NULL|TRUE|FALSE)
+  value tests
+  1 is null
+  FALSE
+ 
+ 
+  ISNULL|NOTNULL
+  null tests
+  1 notnull
+  TRUE
+ 
+ 
+  =
+  is equal
+  5 = 4
+  FALSE
+ 
+ 
+  <>
+  is not equal
+  5 <> 4
+  TRUE
+ 
+ 
+  !=
+  is not equal
+  5 != 5
+  FALSE
+ 
+ 
+  <
+  lower than
+  5 < 4
+  FALSE
+ 
+ 
+  <=
+  lower or equal
+  5 <= 4
+  FALSE
+ 
+ 
+  >
+  greater than
+  5 > 4
+  TRUE
+ 
+ 
+  >=
+  greater or equal
+  5 >= 4
+  TRUE
+ 
+ 
+  |
+  integer bitwise OR
+  1 | 2
+  3
+ 
+ 
+  #
+  integer bitwise XOR
+  1 # 3
+  2
+ 
+ 
+  &
+  integer bitwise AND
+  1 & 3
+  1
+ 
+ 
+  ~
+  integer bitwise NOT
+  ~ 1
+  -2
+ 
+ 
+  <<
+  integer bitwise shift left
+  1 << 2
+  4
+ 
+ 
+  >>
+  integer bitwise shift right
+  8 >> 2
+  2
+ 
+ 
+  +
+  addition
+  5 + 4
+  9
+ 
+ 
+  -
+  substraction
+  3 - 2.0
+  1.0
+ 
+ 
+  *
+  multiplication
+  5 * 4
+  20
+ 
+ 
+  /
+  division (integer truncates the results)
+  5 / 3
+  1
+ 
+ 
+  %
+  modulo
+  3 % 2
+  1
+ 
+ 
+  -
+  opposite
+  - 2.0
+  -2.0
+ 
+
+   
+  
+ 
+
  
   Built-In Functions
 
@@ -1042,6 +1232,13 @@ pgbench  options  d
5432.0
   
   
+   exp(x)
+   double
+   exponential
+   exp(1.0)
+   2.718281828459045
+  
+  
greatest(a [, ... ] )
double if any a is double, else integer
largest value among arguments
@@ -1063,6 +1260,20 @@ pgbench  options  d
2.1
   
   
+   ln(x)
+   double
+   natural logarithm
+   ln(2.718281828459045)
+   1.0
+  
+  
+   mod(i, bj)
+   integer
+   modulo
+   mod(54, 32)
+   22
+  
+  
pi()
double
value of the constant PI
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index 74ffe5e..4104035 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -19,13 +19,17 @@
 PgBenchExpr *expr_parse_result;
 
 static PgBenchExprList *make_elist(PgBenchExpr *exp, PgBenchExprList *list);
+static PgBenchExpr *make_null_constant(void);
+static PgBenchExpr *make_boolean_constant(bool bval);
 static PgBenchExpr *make_integer_constan