Re: [HACKERS] Pluggable storage

2017-10-19 Thread Amit Kapila
On Sat, Oct 14, 2017 at 1:09 AM, Alexander Korotkov
 wrote:
> On Fri, Oct 13, 2017 at 9:41 PM, Robert Haas  wrote:
>>
>> On Fri, Oct 13, 2017 at 1:59 PM, Peter Geoghegan  wrote:
>> >> Fully agreed.
>> >
>> > If we implement that interface, where does that leave EvalPlanQual()?
>
>
> From the first glance, it seems that pluggable storage should override
> EvalPlanQualFetch(), rest of EvalPlanQual() looks quite generic.
>

I think there is more to it.  Currently, EState->es_epqTuple is a
HeapTuple which is filled as part of EvalPlanQual mechanism and then
later used during the scan.  We need to make it pluggable in some way
so that other heaps can work.  We also need some work for
EvalPlanQualFetchRowMarks as that also seems to be tightly coupled
with HeapTuple.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pluggable storage

2017-10-19 Thread Amit Kapila
On Fri, Oct 13, 2017 at 1:58 PM, Haribabu Kommi
 wrote:
>
> On Fri, Oct 13, 2017 at 11:55 AM, Robert Haas  wrote:
>>
>> On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi
>>  wrote:
>> > Currently I added a snapshot_satisfies API to find out whether the tuple
>> > satisfies the visibility or not with different types of visibility
>> > routines.
>> > I feel these
>> > are some how enough to develop a different storage methods like UNDO.
>> > The storage methods can decide internally how to provide the visibility.
>> >
>> > + amroutine->snapshot_satisfies[MVCC_VISIBILITY] =
>> > HeapTupleSatisfiesMVCC;
>> > + amroutine->snapshot_satisfies[SELF_VISIBILITY] =
>> > HeapTupleSatisfiesSelf;
>> > + amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny;
>> > + amroutine->snapshot_satisfies[TOAST_VISIBILITY] =
>> > HeapTupleSatisfiesToast;
>> > + amroutine->snapshot_satisfies[DIRTY_VISIBILITY] =
>> > HeapTupleSatisfiesDirty;
>> > + amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] =
>> > HeapTupleSatisfiesHistoricMVCC;
>> > + amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] =
>> > HeapTupleSatisfiesNonVacuumable;
>> > +
>> > + amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate;
>> > + amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum;
>> >
>> > Currently no changes are carried out in snapshot logic as that is kept
>> > seperate
>> > from storage API.
>>
>> That seems like a strange choice of API.  I think it should more
>> integrated with the scan logic.  For example, if I'm doing an index
>> scan, and I get a TID, then I should be able to just say "here's a
>> TID, give me any tuples associated with that TID that are visible to
>> the scan snapshot".  Then for the current heap it will do
>> heap_hot_search_buffer, and for zheap it will walk the undo chain and
>> return the relevant tuple from the chain.
>
>
> OK, Understood.
> I will check along these lines and come up with storage API's.
>

I think what we need here is a way to register satisfies function
(SnapshotSatisfiesFunc) in SnapshotData for different storage engines.
That is the core API to decide visibility with respect to different
storage engines.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] per-sesson errors after interrupting CLUSTER pg_attrdef

2017-10-19 Thread Justin Pryzby
This was briefly scary but seems to have been limited to my psql session (no
other errors logged).  Issue with catcache (?)

I realized that the backup job I'd kicked off was precluding the CLUSTER from
running, but that CLUSTER was still holding lock and stalling everything else
under the sun.

psql (10.0, server 9.6.5)
...

ts=# CLUSTER pg_attribute USING pg_attribute_relid_attnum_index ;   

^CCancel request sent   


ERROR:  canceling statement due to user request

ts=# \df qci_add*
ERROR:  could not read block 8 in file "base/16400/999225102": read only 0 of 
8192 bytes
ts=# \dt+ pg_att

ts=# \dt+ pg_attrdef
ERROR:  could not read block 8 in file "base/16400/999225102": read only 0 of 
8192 bytes
ts=# ^C
ts=# \q

postgres=# SELECT session_line ln, user_name, error_severity sev, 
left(message,66) , left(query,66) FROM postgres_log_2017_10_19_1700 WHERE 
session_id='59e93953.20c9'  ORDER BY 1,2 DESC ;
 ln | user_name |  sev  |left   
 |left  
  
+---+---++
  1 | pryzbyj   | LOG   | statement: CLUSTER pg_attribute USING 
pg_attribute_relid_attnum_in | 
  2 | pryzbyj   | ERROR | canceling statement due to user request   
 | CLUSTER pg_attribute USING pg_attribute_relid_attnum_index ;
  3 | pryzbyj   | LOG   | statement: SELECT n.nspname as "Schema",  
+| 
|   |   |   p.proname as "Name",
+| 
|   |   |   
 | 
  4 | pryzbyj   | ERROR | could not read block 8 in file 
"base/16400/999225102": read only 0 | SELECT n.nspname as "Schema", 
+
|   |   |   
 |   p.proname as "Name",   
 +
|   |   |   
 |   pg_catalog.
  5 | pryzbyj   | LOG   | statement: SELECT pg_catalog.quote_ident(c.relname) 
FROM pg_catalo | 
  6 | pryzbyj   | ERROR | could not read block 1 in file 
"base/16400/999225102": read only 0 | SELECT pg_catalog.quote_ident(c.relname) 
FROM pg_catalog.pg_class 
  7 | pryzbyj   | LOG   | statement: SELECT pg_catalog.quote_ident(c.relname) 
FROM pg_catalo | 
  8 | pryzbyj   | ERROR | could not read block 1 in file 
"base/16400/999225102": read only 0 | SELECT pg_catalog.quote_ident(c.relname) 
FROM pg_catalog.pg_class 
  9 | pryzbyj   | LOG   | statement: SELECT pg_catalog.quote_ident(c.relname) 
FROM pg_catalo | 
 10 | pryzbyj   | ERROR | could not read block 1 in file 
"base/16400/999225102": read only 0 | SELECT pg_catalog.quote_ident(c.relname) 
FROM pg_catalog.pg_class 
 11 | pryzbyj   | LOG   | statement: SELECT n.nspname as "Schema",  
+| 
|   |   |   c.relname as "Name",
+| 
|   |   |   
 | 
 12 | pryzbyj   | ERROR | could not read block 8 in file 
"base/16400/999225102": read only 0 | SELECT n.nspname as "Schema", 
+
|   |   |   
 |   c.relname as "Name",   
 +
|   |   |   
 |   CASE c.relk
(12 rows)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Queuing all tables for analyze after recovery

2017-10-19 Thread Tomasz Ostrowski

Hi.

Some (maybe all) row statistics are lost after the database has 
recovered after a failover. So it's recommended to ANALYZE all databases 
in a cluster after recovery.


Amazon's AWS RDS (their managed SQL databases service) even sends an 
email "consider running analyze if your database is slow" after a 
failover of so called MultiAZ  databases (with fast automatic failover 
for double price). Funny that they send it for both PostgreSQL and 
Oracle databases, which, I suppose, confuses Oracle DBA's greatly.


And in AWS RDS MultiAZ a failover is pretty common. Minor version 
upgrade - failover. A storage hiccup - failover. Out of memory - failover.


Shouldn't this analyze be queued and all tables analyzed automatically 
after failover by autovacuum daemon? With up to autovacuum_max_workers 
in parallel?


It might save some DBA's from a couple of lost sleeping hours for sure. 
What do you think? A GUC option? On by dafault? Maybe even backported, 
but off by default in released versions?


--
Tomasz "Tometzky" Ostrowski


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CUBE seems a bit confused about ORDER BY

2017-10-19 Thread Alexander Korotkov
Hi!

On Fri, Oct 20, 2017 at 12:52 AM, Tomas Vondra  wrote:

> I've noticed this suspicious behavior of "cube" data type with ORDER BY,
> which I believe is a bug in the extension (or the GiST index support).
> The following example comes directly from regression tests added by
> 33bd250f (so CC Teodor and Stas, who are mentioned in the commit).
>
> This query should produce results with ordering "ascending by 2nd
> coordinate or upper right corner". To make it clear, I've added the
> "c~>4" expression to the query, otherwise it's right from the test.
>
> test=# SELECT c~>4 "c~>4", * FROM test_cube ORDER BY c~>4 LIMIT 15;
>  c~>4 | c
> --+---
>50 | (30333, 50),(30273, 6)
>75 | (43301, 75),(43227, 43)
>   142 | (19650, 142),(19630, 51)
>   160 | (2424, 160),(2424, 81)
>   171 | (3449, 171),(3354, 108)
>   155 | (18037, 155),(17941, 109)
>   208 | (28511, 208),(28479, 114)
>   217 | (19946, 217),(19941, 118)
>   191 | (16906, 191),(16816, 139)
>   187 | (759, 187),(662, 163)
>   266 | (22684, 266),(22656, 181)
>   255 | (24423, 255),(24360, 213)
>   249 | (45989, 249),(45910, 222)
>   377 | (11399, 377),(11360, 294)
>   389 | (12162, 389),(12103, 309)
> (15 rows)
>
> As you can see, it's not actually sorted by the c~>4 coordinate (but by
> c~>2, which it the last number).
>
> Moreover, disabling index scans fixes the ordering:
>
> test=# set enable_indexscan = off;
> SET
> test=# SELECT c~>4, * FROM test_cube ORDER BY c~>4 LIMIT 15; --
> ascending by 2nd coordinate or upper right corner
>  ?column? | c
> --+---
>50 | (30333, 50),(30273, 6)
>75 | (43301, 75),(43227, 43)
>   142 | (19650, 142),(19630, 51)
>   155 | (18037, 155),(17941, 109)
>   160 | (2424, 160),(2424, 81)
>   171 | (3449, 171),(3354, 108)
>   187 | (759, 187),(662, 163)
>   191 | (16906, 191),(16816, 139)
>   208 | (28511, 208),(28479, 114)
>   217 | (19946, 217),(19941, 118)
>   249 | (45989, 249),(45910, 222)
>   255 | (24423, 255),(24360, 213)
>   266 | (22684, 266),(22656, 181)
>   367 | (31018, 367),(30946, 333)
>   377 | (11399, 377),(11360, 294)
> (15 rows)
>
>
> Seems like a bug somewhere in gist_cube_ops, I guess?
>

+1,
that definitely looks like a bug. Thank you for reporting!
I'll take a look on it in couple days.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


[HACKERS] CUBE seems a bit confused about ORDER BY

2017-10-19 Thread Tomas Vondra
Hi,

I've noticed this suspicious behavior of "cube" data type with ORDER BY,
which I believe is a bug in the extension (or the GiST index support).
The following example comes directly from regression tests added by
33bd250f (so CC Teodor and Stas, who are mentioned in the commit).

This query should produce results with ordering "ascending by 2nd
coordinate or upper right corner". To make it clear, I've added the
"c~>4" expression to the query, otherwise it's right from the test.

test=# SELECT c~>4 "c~>4", * FROM test_cube ORDER BY c~>4 LIMIT 15;
 c~>4 | c
--+---
   50 | (30333, 50),(30273, 6)
   75 | (43301, 75),(43227, 43)
  142 | (19650, 142),(19630, 51)
  160 | (2424, 160),(2424, 81)
  171 | (3449, 171),(3354, 108)
  155 | (18037, 155),(17941, 109)
  208 | (28511, 208),(28479, 114)
  217 | (19946, 217),(19941, 118)
  191 | (16906, 191),(16816, 139)
  187 | (759, 187),(662, 163)
  266 | (22684, 266),(22656, 181)
  255 | (24423, 255),(24360, 213)
  249 | (45989, 249),(45910, 222)
  377 | (11399, 377),(11360, 294)
  389 | (12162, 389),(12103, 309)
(15 rows)

As you can see, it's not actually sorted by the c~>4 coordinate (but by
c~>2, which it the last number).

Moreover, disabling index scans fixes the ordering:

test=# set enable_indexscan = off;
SET
test=# SELECT c~>4, * FROM test_cube ORDER BY c~>4 LIMIT 15; --
ascending by 2nd coordinate or upper right corner
 ?column? | c
--+---
   50 | (30333, 50),(30273, 6)
   75 | (43301, 75),(43227, 43)
  142 | (19650, 142),(19630, 51)
  155 | (18037, 155),(17941, 109)
  160 | (2424, 160),(2424, 81)
  171 | (3449, 171),(3354, 108)
  187 | (759, 187),(662, 163)
  191 | (16906, 191),(16816, 139)
  208 | (28511, 208),(28479, 114)
  217 | (19946, 217),(19941, 118)
  249 | (45989, 249),(45910, 222)
  255 | (24423, 255),(24360, 213)
  266 | (22684, 266),(22656, 181)
  367 | (31018, 367),(30946, 333)
  377 | (11399, 377),(11360, 294)
(15 rows)


Seems like a bug somewhere in gist_cube_ops, I guess?


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Queuing all tables for analyze after recovery

2017-10-19 Thread Vik Fearing
On 10/19/2017 11:26 PM, Tom Lane wrote:
> Vik Fearing  writes:
>> On 10/19/2017 10:54 PM, Tom Lane wrote:
>>> Uh ... recommended by whom?  pg_statistic has exactly the same reliability
>>> guarantees as the rest of the system catalogs.
> 
>> For data statistics, sure.  One thing I'm unhappy about is that
>> pg_stat_all_tables is blank.
> 
> Well, that's because we throw away the stats collector's stats after a
> crash -- or, in the failover case, because the promoted slave has its own
> counters and not the master's.  ANALYZE isn't going to help that at all.

Sure it will.  ANALYZE estimates n_dead_tup, which often accurate enough
for autovacuum to figure out what to do.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Queuing all tables for analyze after recovery

2017-10-19 Thread Tom Lane
Vik Fearing  writes:
> On 10/19/2017 10:54 PM, Tom Lane wrote:
>> Uh ... recommended by whom?  pg_statistic has exactly the same reliability
>> guarantees as the rest of the system catalogs.

> For data statistics, sure.  One thing I'm unhappy about is that
> pg_stat_all_tables is blank.

Well, that's because we throw away the stats collector's stats after a
crash -- or, in the failover case, because the promoted slave has its own
counters and not the master's.  ANALYZE isn't going to help that at all.

The fact that we drop those stats in a crash cycle is probably mostly
an overabundance of caution.  We could likely quit doing that, maybe
with a bit more validation effort when reading the files.

Not sure whether we ought to change anything about the failover case.
It's certainly reasonable for a standby server to have its own stats.

The one case where it might make sense to explicitly discard the counters
is when we do PITR to a previous system state.  That's not too common
though ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Queuing all tables for analyze after recovery

2017-10-19 Thread Tomasz Ostrowski

On 10/19/2017 10:54 PM, Tom Lane wrote:

Uh ... recommended by whom?  pg_statistic has exactly the same reliability
guarantees as the rest of the system catalogs.


Actually I'm not exactly sure what is lost and what is preserved. I'm 
pretty sure that pg_stat_all_tables and similar views turn out with no 
data after a failover.


Also I have some experience with badly performing databases after a 
failover, which went back to normal performance after whole cluster 
analyze. This email from AWS suggests that it's not only me.



I don't deny that there might be cases where this is worth doing, but
it does not seem so likely that it should be part of one's standard
checklist.  Much less something that we should expend a great deal
of effort to automate.


I assumed that the effort here shouldn't be that large. I imagined a 
simple check if the statistics are missing when considering tables for 
analyze by autovacuum. But I'm not a programmer, so I might misestimate 
this effort badly.


--
Regards,
Tomasz "Tometzky" Ostrowski


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Queuing all tables for analyze after recovery

2017-10-19 Thread Vik Fearing
On 10/19/2017 10:54 PM, Tom Lane wrote:
> Tomasz Ostrowski  writes:
>> Some (maybe all) row statistics are lost after the database has 
>> recovered after a failover. So it's recommended to ANALYZE all databases 
>> in a cluster after recovery.
> 
> Uh ... recommended by whom?  pg_statistic has exactly the same reliability
> guarantees as the rest of the system catalogs.
> 
> I don't deny that there might be cases where this is worth doing, but
> it does not seem so likely that it should be part of one's standard
> checklist.  Much less something that we should expend a great deal
> of effort to automate.

For data statistics, sure.  One thing I'm unhappy about is that
pg_stat_all_tables is blank.

An idea I've been throwing around in my head is to have autovacuum work
on tables that have vacuum_count and autovacuum_count both zero (and
likewise for analyze).

This will cause a flurry of activity after failover or crash, but the
alternative is autovacuum not knowing anything about the state of the
tables and allowing massive bloat to potentially occur.

For example, if you have a 1 billion row table, and crash when there are
199,999,999 dead tuples, you currently get to wait for another 200
million to die before anything gets cleaned up.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Queuing all tables for analyze after recovery

2017-10-19 Thread Tom Lane
Tomasz Ostrowski  writes:
> Some (maybe all) row statistics are lost after the database has 
> recovered after a failover. So it's recommended to ANALYZE all databases 
> in a cluster after recovery.

Uh ... recommended by whom?  pg_statistic has exactly the same reliability
guarantees as the rest of the system catalogs.

I don't deny that there might be cases where this is worth doing, but
it does not seem so likely that it should be part of one's standard
checklist.  Much less something that we should expend a great deal
of effort to automate.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Domains and arrays and composites, oh my

2017-10-19 Thread Tom Lane
I wrote:
> Andrew Dunstan  writes:
>> On 09/28/2017 01:02 PM, Tom Lane wrote:
 I do think that treating a function returning a domain-over-composite
 differently from one returning a base composite is a POLA. We'd be very
 hard put to explain the reasons for it to an end user.

>>> Do you have any thoughts about how we ought to resolve that?

>> Not offhand. Maybe we need to revisit the decision not to modify the
>> executor at all.

> I think it's more of a parse analysis change: the issue is whether to
> smash a function's result type to base when determining whether it emits
> columns.  Maybe we could just do that in that context, and otherwise leave
> domains alone.

After fooling with that for awhile, I concluded that the only reasonable
path forward is to go ahead and modify the behavior of
get_expr_result_type and sibling routines.  While this fixes the parser
behavior to be pretty much what I think we want, it means that we've got
holes to fill in a lot of other places.  Most of them will manifest as
unexpected "domaintypename is not a composite type" errors, but there
are definitely places where the net effect is to silently fail to enforce
domain constraints against a constructed row value :-(.  In the attached
still-WIP patch, I think that I've got most of the core code fixed, but
there are at least these holes remaining to fill:

* json_populate_record and sibling routines won't enforce domain
constraints; depending on how they're called, you might or might not
get a "not a composite type" error.  This is because they use two
different methods for getting the target type OID depending on whether
the input prototype record is NULL.  Maybe that was a bad idea.
(I'm disinclined to try to fix this code right now since there are
pending bug fixes nearby; better to wait till that dust settles.)

* Ditto for hstore's populate_record, which is pretty much same logic.

* plpgsql mostly seems to work, but not quite 100%: RETURN QUERY will
fail to enforce domain constraints if the return type is domain over
composite.  It also still needs feature extension to handle d-over-c
variables more fully (e.g. allow field assignment).

* I haven't looked at the other PLs much; I believe they will mostly
fail safe with "not a composite type" errors, but I wouldn't swear
that all code paths will.

It seems like this is probably the way forward, but I'm slightly
discouraged by the fact that the patch footprint is getting bigger
and there are paths where we can get domain-enforcement omissions
rather than something more benign.  Still, we had lots of
domain-enforcement omissions in the early days of the existing
domain feature, if memory serves.  Maybe we should just accept
that working through that will be a process.

>> One thought I had was that we could invent a new return
>> type of TYPEFUNC_DOMAIN_COMPOSITE so there would be less danger of a PL
>> just treating it as an unconstrained base type as it might do if it saw
>> TYPEFUNC_COMPOSITE.

> Hmm.  That would be a way of forcing the issue, no doubt ...

I did that, but it turns out not to help much; turns out a lot of the
broken code is doing stuff on the basis of type_is_rowtype(), which
this patch allows to return true for domains over composite.  Maybe
we should undo that and invent a separate type_is_rowtype_or_domain()
function to be used only by repaired code, but that seems pretty ugly :-(

Anyway, PFA an updated patch that also fixes some conflicts with the
already-committed arrays-of-domains patch.

regards, tom lane

diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 245a374..1bd8a58 100644
*** a/src/backend/catalog/pg_inherits.c
--- b/src/backend/catalog/pg_inherits.c
*** has_superclass(Oid relationId)
*** 301,306 
--- 301,311 
  /*
   * Given two type OIDs, determine whether the first is a complex type
   * (class type) that inherits from the second.
+  *
+  * This essentially asks whether the first type is guaranteed to be coercible
+  * to the second.  Therefore, we allow the first type to be a domain over a
+  * complex type that inherits from the second; that creates no difficulties.
+  * But the second type cannot be a domain.
   */
  bool
  typeInheritsFrom(Oid subclassTypeId, Oid superclassTypeId)
*** typeInheritsFrom(Oid subclassTypeId, Oid
*** 314,322 
  	ListCell   *queue_item;
  
  	/* We need to work with the associated relation OIDs */
! 	subclassRelid = typeidTypeRelid(subclassTypeId);
  	if (subclassRelid == InvalidOid)
! 		return false;			/* not a complex type */
  	superclassRelid = typeidTypeRelid(superclassTypeId);
  	if (superclassRelid == InvalidOid)
  		return false;			/* not a complex type */
--- 319,327 
  	ListCell   *queue_item;
  
  	/* We need to work with the associated relation OIDs */
! 	subclassRelid = typeOrDomainTypeRelid(subclassTypeId);
  	if (subclassRelid == 

Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains

2017-10-19 Thread Peter Geoghegan
On Thu, Oct 19, 2017 at 9:03 AM, Peter Geoghegan  wrote:
>> /me studies the problem for a while.
>>
>> What's bothering me about this is: how is cutoff_xid managing to be a
>> new enough transaction ID for this to happen in the first place?  The
>> cutoff XID should certainly be older than anything any current
>> snapshot can see, because it's computed using GetOldestXmin() -- which
>> means that XIDs older than the cutoff can't still be running (except
>> maybe if old_snapshot_threshold is set to a non-default value).

> Now that the problem is fixed by a5736bf7, this test case will prune
> and get an LP_REDIRECT ItemId (not an LP_DEAD one), as we're no longer
> confused about the continuity of HOT chains within heap_prune_chain().

In case I was unclear: the key point here is that it wasn't wrong to
prune the tuples because they might still be visible to somebody's
snapshot. It was wrong to do so in a specific way that could still
happen prior to a5736bf7, leaving only a stub LP_DEAD line pointer,
because index scans needed to do HOT chain traversal in order to get
to the visible tuple that wasn't pruned (the one that was still
frozen). You might say that the HOT chain "appeared to split in two"
from heap_prune_chain()'s perspective, a problem that has nothing to
do with an XID cutoff.

That's why after 20b65522 there was no problem with sequential scans.
There were remaining problems with index scans, though, purely because
of this HOT pruning business -- that was fixed in a5736bf7. (And, as
I've said, problems with "traversal of half-frozen update chains" are
presumed to have existed for routines like EvalPlanQual() -- even more
subtle problems.)

One thing I'm not clear on is if we could or should have limited this
new HeapTupleUpdateXmaxMatchesXmin() stuff to HOT chains, and so not
altered behavior for cross-page cases (for routines like
EvalPlanQualFetch() -- not routines like heap_prune_chain(), that only
ever deal with a single page at a time).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Queuing all tables for analyze after recovery

2017-10-19 Thread Tomasz Ostrowski

Hi.

Some (maybe all) row statistics are lost after the database has 
recovered after a failover. So it's recommended to ANALYZE all databases 
in a cluster after recovery.


Amazon's AWS RDS (their managed SQL databases service) even sends an 
email "consider running analyze if your database is slow" after a 
failover of so called MultiAZ  databases (with fast automatic failover 
for double price). Funny that they send it for both PostgreSQL and 
Oracle databases, which, I suppose, confuses Oracle DBA's greatly.


And in AWS RDS MultiAZ a failover is pretty common. Minor version 
upgrade - failover. A storage hiccup - failover. Out of memory - failover.


Shouldn't this analyze be queued and all tables analyzed automatically 
after failover by autovacuum daemon? With up to autovacuum_max_workers 
in parallel?


It might save some DBA's from a couple of lost sleeping hours for sure. 
What do you think? A GUC option? On by dafault? Maybe even backported, 
but off by default in released versions?


--
Tomasz "Tometzky" Ostrowski


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WIP: BRIN bloom indexes

2017-10-19 Thread Tomas Vondra
Hi,

The BRIN minmax opclasses work well only for data where the column is
somewhat correlated to physical location in a table. So it works great
for timestamps in append-only log tables, for example. When that is not
the case (non-correlated columns) the minmax ranges get very "wide" and
we end up scanning large portions of the tables.

A typical example are columns with types that are inherently random (or
rather non-correlated) like for example UUIDs, IP or MAC addresses, and
so on. But it can just as easily happen even with simple IDs generated
from a sequence.

This WIP patch implements another type of BRIN index, with "summary"
being represented by a bloom filter. So instead of building [min,max]
range for each page range. The index is of course larger than the
regular "minmax" BRIN index, but it's still orders of magnitude smaller
than a btree index.

Note: This is different from the index type implemented in the "bloom"
extension. Those are not related to BRIN at all, and the index builds a
bloom filter on individual rows (values in different columns).

BTW kudos to everyone who worked on the BRIN infrastructure and API. I
found it very intuitive and well-documented. Adding the new opclass was
extremely straight-forward, and


To demonstrate this on a small example, consider this table:

CREATE TABLE bloom_test (id uuid, padding text);

INSERT INTO bloom_test
SELECT md5((mod(i,100)/100)::text)::uuid, md5(i::text)
  FROM generate_series(1,2) s(i);

VACUUM ANALYZE bloom_test;

 List of relations
 Schema |Name| Type  | Owner | Size  | Description
++---+---+---+-
 public | bloom_test | table | tomas | 16 GB |
(1 row)

The table was built so that each range contains relatively small number
of distinct UUID values - this is typical e.g. for user activity with
"bursts" of actions from a particular user separated by long periods of
inactivity (with actions from other users).

Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the id
column.

create index test_brin_idx on bloom_test
 using brin (id);

create index test_bloom_idx on bloom_test
 using brin (id uuid_bloom_ops);

create index test_btree_idx on bloom_test (id);

which gives us this:

  List of relations
 Schema |  Name  | Type  | Owner |   Table|  Size
++---+---++-
 public | test_bloom_idx | index | tomas | bloom_test | 12 MB
 public | test_brin_idx  | index | tomas | bloom_test | 832 kB
 public | test_btree_idx | index | tomas | bloom_test | 6016 MB
(3 rows)

So yeah - the "bloom" index is larger than the default "minmax" index,
but it's still orders of magnitude smaller than the regular btree one.

Now, what about query performance? Unlike the "minmax" indexes, the
"bloom" filter can only handle equality conditions.

Let's see a query like this:

select * from bloom_test
 where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';

The minmax index produces this plan

QUERY PLAN

 Bitmap Heap Scan on bloom_test
 (cost=390.31..2130910.82 rows=20593 width=49)
 (actual time=197.974..22707.311 rows=2 loops=1)
   Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
   Rows Removed by Index Recheck: 19998
   Heap Blocks: lossy=2061856
   ->  Bitmap Index Scan on test_brin_idx
   (cost=0.00..385.16 rows=5493161 width=0)
   (actual time=133.463..133.463 rows=20619520 loops=1)
 Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
 Planning time: 0.165 ms
 Execution time: 22707.891 ms
(8 rows)

Which is not that great, and the long duration is a direct consequence
of "wide" ranges - the bitmap heap scan had to scan pretty much the
whole table. (A sequential scan takes only about 15 seconds.)

Now, the bloom index:

QUERY PLAN

 Bitmap Heap Scan on bloom_test
 (cost=5898.31..2136418.82 rows=20593 width=49)
 (actual time=24.229..338.324 rows=2 loops=1)
   Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
   Rows Removed by Index Recheck: 2500448
   Heap Blocks: lossy=25984
   ->  Bitmap Index Scan on test_bloom_idx
 (cost=0.00..5893.16 rows=5493161 width=0)
 (actual time=22.811..22.811 rows=259840 loops=1)
 Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
 Planning time: 0.201 ms
 Execution time: 338.946 ms
(8 rows)

So, way better.

For comparison, a simple index scan / bitmap index scan using the btree
take only about ~10ms in this case, but that's mostly thanks to the
whole dataset being entirely in-memory.

There are some remaining open questions.

1) 

Re: [HACKERS] [PATCH] Add ALWAYS DEFERRED option for constraints

2017-10-19 Thread Nico Williams
Rebased (there were conflicts in the SGML files).

Nico
-- 
>From 80d284ecefa22945d507d2822f1f1a195e2af751 Mon Sep 17 00:00:00 2001
From: Nicolas Williams 
Date: Tue, 3 Oct 2017 00:33:09 -0500
Subject: [PATCH] Add ALWAYS DEFERRED option for CONSTRAINTs

and CONSTRAINT TRIGGERs.

This is important so that one can have triggers and constraints that
must run after all of the user/client's statements in a transaction
(i.e., at COMMIT time), so that the user/client may make no further
changes (triggers, of course, still can).
---
 doc/src/sgml/catalogs.sgml | 17 -
 doc/src/sgml/ref/alter_table.sgml  |  4 +-
 doc/src/sgml/ref/create_table.sgml | 10 ++-
 doc/src/sgml/ref/create_trigger.sgml   |  2 +-
 doc/src/sgml/trigger.sgml  |  1 +
 src/backend/bootstrap/bootparse.y  |  2 +
 src/backend/catalog/heap.c |  1 +
 src/backend/catalog/index.c|  8 +++
 src/backend/catalog/information_schema.sql |  8 +++
 src/backend/catalog/pg_constraint.c|  2 +
 src/backend/catalog/toasting.c |  2 +-
 src/backend/commands/indexcmds.c   |  2 +-
 src/backend/commands/tablecmds.c   | 20 +-
 src/backend/commands/trigger.c | 28 +++--
 src/backend/commands/typecmds.c|  3 +
 src/backend/nodes/copyfuncs.c  |  3 +
 src/backend/nodes/outfuncs.c   |  4 ++
 src/backend/parser/gram.y  | 99 ++
 src/backend/parser/parse_utilcmd.c | 46 +-
 src/backend/utils/adt/ruleutils.c  |  4 ++
 src/bin/pg_dump/pg_dump.c  | 31 --
 src/bin/pg_dump/pg_dump.h  |  2 +
 src/bin/psql/describe.c| 34 +++---
 src/bin/psql/tab-complete.c|  4 +-
 src/include/catalog/index.h|  2 +
 src/include/catalog/pg_constraint.h| 42 +++--
 src/include/catalog/pg_constraint_fn.h |  1 +
 src/include/catalog/pg_trigger.h   | 16 ++---
 src/include/commands/trigger.h |  1 +
 src/include/nodes/parsenodes.h |  6 +-
 src/include/utils/reltrigger.h |  1 +
 src/test/regress/input/constraints.source  | 51 +++
 src/test/regress/output/constraints.source | 54 +++-
 33 files changed, 418 insertions(+), 93 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index ef60a58..1bc35dc 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -,6 +,13 @@ SCRAM-SHA-256$iteration count:
  
 
  
+  conalwaysdeferred
+  bool
+  
+  Is the constraint always deferred?
+ 
+
+ 
   convalidated
   bool
   
@@ -6968,6 +6975,13 @@ SCRAM-SHA-256$iteration count:
  
 
  
+  tgalwaysdeferred
+  bool
+  
+  True if constraint trigger is always deferred
+ 
+
+ 
   tgnargs
   int2
   
@@ -7029,7 +7043,8 @@ SCRAM-SHA-256$iteration count:

 When tgconstraint is nonzero,
 tgconstrrelid, tgconstrindid,
-tgdeferrable, and tginitdeferred are
+tgdeferrable, tginitdeferred, and
+tgalwaysdeferred are
 largely redundant with the referenced pg_constraint entry.
 However, it is possible for a non-deferrable trigger to be associated
 with a deferrable constraint: foreign key constraints can have some
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index b4b8dab..fe24521 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -55,7 +55,7 @@ ALTER TABLE [ IF EXISTS ] name
 ALTER [ COLUMN ] column_name SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }
 ADD table_constraint [ NOT VALID ]
 ADD table_constraint_using_index
-ALTER CONSTRAINT constraint_name [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
+ALTER CONSTRAINT constraint_name [ DEFERRABLE | NOT DEFERRABLE | ALWAYS DEFERRED ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
 VALIDATE CONSTRAINT constraint_name
 DROP CONSTRAINT [ IF EXISTS ]  constraint_name [ RESTRICT | CASCADE ]
 DISABLE TRIGGER [ trigger_name | ALL | USER ]
@@ -89,7 +89,7 @@ ALTER TABLE [ IF EXISTS ] name
 
 [ CONSTRAINT constraint_name ]
 { UNIQUE | PRIMARY KEY } USING INDEX index_name
-[ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
+[ DEFERRABLE | NOT DEFERRABLE | ALWAYS DEFERRED ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
 
  
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 2db2e9f..cf1ba1c 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -67,7 +67,7 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
   PRIMARY KEY index_parameters |
   REFERENCES reftable [ ( refcolumn ) ] [ MATCH 

[HACKERS] What is the point of setrefs.c's is_converted_whole_row_reference?

2017-10-19 Thread Tom Lane
AFAICS, setrefs.c's special treatment of "converted whole row references"
is completely pointless.  Why aren't they just treated by the regular
"non var" code paths, thus saving code space and cycles?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Add recovery_min_apply_delay_reconnect recovery option

2017-10-19 Thread Eric Radman
On Tue, Oct 17, 2017 at 12:34:17PM +0900, Michael Paquier wrote:
> On Tue, Oct 17, 2017 at 12:51 AM, Eric Radman  wrote:
> > This administrative compromise is necessary because the WalReceiver is
> > not resumed after a network interruption until all records are read,
> > verified, and applied from the archive on disk.
> 
> I see what you are trying to achieve and that seems worth it. It is
> indeed a waste to not have a WAL receiver online while waiting for a
> delay to be applied.
... 
> If you think about it, no parameters are actually needed. What you
> should try to achieve is to make recoveryApplyDelay() smarter,

This would be even better. Attached is the 2nd version of this patch
that I'm using until an alternate solution is developed.

> Your patch also breaks actually the use case of standbys doing
> recovery using only archives and no streaming

This version disarms recovery_min_apply_delay_reconnect if a primary is
not defined. Also rely on XLogCtl->recoveryWakeupLatch to return if the
WalReciver is shut down--this does work reliably.

-- 
Eric Radman  |  http://eradman.com
commit 36b5a022241c1ade9dcf5ffc46f926e46f4ee696
Author: Eric Radman 
Date:   Tue Oct 17 19:10:22 2017 -0400

Add recovery_min_apply_delay_reconnect recovery option

'recovery_min_apply_delay_reconnect' allows an administrator to specify
how a standby using 'recovery_min_apply_delay' responds when streaming
replication is interrupted.

Combining these two parameters provides a fixed delay under normal
operation while maintaining some assurance that the standby contains an
up-to-date copy of the WAL.

This administrative compromise is necessary because the WalReceiver is
not resumed after a network interruption until all records are read,
verified, and applied from the archive on disk.

It would be better if a second option was not added, but second delay
parameter provides a workaround for some use cases without complecting
xlog.c.

diff --git a/doc/src/sgml/recovery-config.sgml 
b/doc/src/sgml/recovery-config.sgml
index 4e1aa74c1f..8e395edae0 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -502,6 +502,30 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' 
 # Windows
   
  
 
+ 
+  recovery_min_apply_delay_reconnect 
(integer)
+  
+recovery_min_apply_delay_reconnect recovery 
parameter
+  
+  
+  
+   
+If the streaming replication is inturruped while
+recovery_min_apply_delay is set, WAL records will be
+replayed from the archive. After all records have been processed from
+local disk, PostgreSQL will attempt to resume streaming
+and connect to the master.
+   
+   
+This parameter is used to compromise the fixed apply delay in order to
+restablish streaming. In this way a standby server can be run in fair
+conditions with a long delay (hours or days) without while specifying
+the maximum delay that can be expected before the WAL archive is 
brought
+back up to date with the master after a network failure.
+   
+  
+ 
+
  

 
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index dd028a12a4..6f4c7bf3e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -267,6 +267,7 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 static XLogRecPtr recoveryTargetLSN;
 static int recovery_min_apply_delay = 0;
+static int recovery_min_apply_delay_reconnect = 0;
 static TimestampTz recoveryDelayUntilTime;
 
 /* options taken from recovery.conf for XLOG streaming */
@@ -5227,6 +5228,7 @@ readRecoveryCommandFile(void)
   *head = NULL,
   *tail = NULL;
boolrecoveryTargetActionSet = false;
+   const char  *hintmsg;
 
 
fd = AllocateFile(RECOVERY_COMMAND_FILE, "r");
@@ -5452,8 +5454,6 @@ readRecoveryCommandFile(void)
}
else if (strcmp(item->name, "recovery_min_apply_delay") == 0)
{
-   const char *hintmsg;
-
if (!parse_int(item->value, _min_apply_delay, 
GUC_UNIT_MS,
   ))
ereport(ERROR,
@@ -5463,6 +5463,25 @@ readRecoveryCommandFile(void)
 hintmsg ? errhint("%s", 
_(hintmsg)) : 0));
ereport(DEBUG2,

(errmsg_internal("recovery_min_apply_delay = '%s'", item->value)));
+   recovery_min_apply_delay_reconnect = 
recovery_min_apply_delay;
+   }
+   else if (strcmp(item->name, 
"recovery_min_apply_delay_reconnect") == 0)
+

Re: [HACKERS] Interest in a SECURITY DEFINER function current_user stack access mechanism?

2017-10-19 Thread Nico Williams
A bit more about why I want this.

Suppose you have an app like PostgREST (a RESTful, Haskell-coded, HTTP
front-end for PostgreSQL).  PostgREST basically a proxy for PG access.

Users authenticate to the proxy.  The proxy authenticates to PG with its
own credentials, then it does something like SET ROLE or SET SESSION
AUTHORIZATION to impersonate the user to PG.

Now suppose you want to support impersonation in such a proxy.  That is,
that user "Joe" can impersonate "Jane", for example.  So what you do is
you have the proxy do this:

  -- This is the role authenticated to the proxy:
  SET SESSION AUTHORIZATION ;

  -- This is the requested impersonation, and succeeds only if the first
  -- role has been GRANTed the second:
  SET SESSION ROLE ;

Convenient!

Now, if you want to... audit what Joe does, you may want to record both,
the session_user (Joe) and the current_user (Jane) as you write the
audit trail.  Naturally, that's desirable, and it works...

...unless the audit procedures are SECURITY DEFINER.  Then they don't
see the current_user.  They see the session_user, but can't see who the
session user was impersonating.

Hence I want to be able to look back through the [security definer]
function invocation stack to find what current_user was at the
top-level, or even just the calling function's current_user.

Now, since this is a proxy, a workaround is to store the impersonated
role name in an application defined GUC.  But if ever you wanted to give
users direct PG access (one might! it should be possible), then that's
not enough because they can set that GUC.  So it really has to be that
the audit procedures can look up the stack.

(When you give out direct PG access you really want to make those audit
procedures SECURITY DEFINER, so they can do DMLs on tables that the
session_user can't.)

This isn't urgent _for me_, but it is a real problem.

Nico
-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-19 Thread Andres Freund
On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote:
> > > + init_local_spin_delay();
> > 
> > The way you moved this around has the disadvantage that we now do this -
> > a number of writes - even in the very common case where the lwlock can
> > be acquired directly.
> 
> Excuse me, I don't understand fine.
> Do you complain against init_local_spin_delay placed here?

Yes.


> Placing it in other place will complicate code.


> Or you complain against setting `mask` and `add`?

That seems right.


> In both cases, I think simpler version should be accepted first. It acts
> as algorithm definition. And it already gives measurable improvement.

Well, in scalability. I'm less sure about uncontended performance.



> > > +  * We intentionally do not call finish_spin_delay here, because
> > > the loop
> > > +  * above usually finished by queuing into the wait list on
> > > contention, and
> > > +  * doesn't reach spins_per_delay thereby doesn't sleep inside of
> > > +  * perform_spin_delay. Also, different LWLocks has very different
> > > +  * contention pattern, and it is wrong to update spin-lock
> > > statistic based
> > > +  * on LWLock contention.
> > > +  */
> > 
> > Huh? This seems entirely unconvincing. Without adjusting this here we'll
> > just spin the same way every iteration. Especially for the case where
> > somebody else holds LW_FLAG_LOCKED that's quite bad.
> 
> LWLock's are very different. Some of them are always short-term
> (BufferLock), others are always locked for a long time.

That seems not particularly relevant. The same is true for
spinlocks. The relevant question isn't how long the lwlock is held, it's
how long LW_FLAG_LOCKED is held - and that should only depend on
contention (i.e. bus speed, amount of times put into sleep while holding
lock, etc), not on how long the lock is held.

> I've tried to place this delay into lock itself (it has 2 free bytes),
> but this attempt performed worse.

That seems unsurprising - there's a *lot* of locks, and we'd have to
tune all of them. Additionally there's a bunch of platforms where we do
*not* have free bytes (consider the embedding in BufferTag).


> Now I understand, that delays should be stored in array indexed by
> tranche. But I have no time to test this idea. And I doubt it will give
> cardinally better results (ie > 5%), so I think it is better to accept
> patch in this way, and then experiment with per-tranche delay.

I don't think tranches have any decent predictive power here.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains

2017-10-19 Thread Peter Geoghegan
On Thu, Oct 19, 2017 at 7:21 AM, Robert Haas  wrote:
> The commit message for a5736bf7 doesn't say anything about a race; it
> just claims that it is fixing traversal of half-frozen update chains,
> without making any reference to how such a thing as a half-frozen
> update chain came to exist in the first place.  I think the commit
> that talks about that race condition is 20b65522.

Well, the race was that the wrong tuple is pruned, because of naivety
in the heap_prune_chain() update chain handling, which is what started
work that became commit a5736bf7. That's by far the worst consequence
of the traversal of half-frozen update chain business that a5736bf7's
commit message describes, since we know it leads to wrong answers to
index scans (and REINDEX fails with "failed to find parent tuple").
It's the only consequence that's been directly observed, but probably
not the only one that's possible. Most other places that suffer the
problem are in obscure-to-users paths like EvalPlanQual().

> /me studies the problem for a while.
>
> What's bothering me about this is: how is cutoff_xid managing to be a
> new enough transaction ID for this to happen in the first place?  The
> cutoff XID should certainly be older than anything any current
> snapshot can see, because it's computed using GetOldestXmin() -- which
> means that XIDs older than the cutoff can't still be running (except
> maybe if old_snapshot_threshold is set to a non-default value).

The problem that we directly observed during pruning, which caused
"failed to find parent tuple" even after the first freeze-the-dead
commit (commit 20b65522), was that the mixture of frozen and unfrozen
tuples in a hot chain caused heap_prune_chain() to make an incorrect
conclusion about the continuity of a HOT chain. We'd prune the wrong
tuples -- it failed to find the visible tuple at the end of the HOT
chain, and respect that that meant it could not prune down to a stub
ItemId (blow away what it *incorrectly* thought was an entire HOT
chain).

Now that the problem is fixed by a5736bf7, this test case will prune
and get an LP_REDIRECT ItemId (not an LP_DEAD one), as we're no longer
confused about the continuity of HOT chains within heap_prune_chain().

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Supporting Windows SChannel as OpenSSL replacement

2017-10-19 Thread Robert Haas
On Thu, Oct 19, 2017 at 1:15 AM, Satyanarayana Narlapuram
 wrote:
> Tom, Robert, Microsoft is interested in supporting windows SChannel for 
> Postgres. Please let know how we can help taking this forward. We would love 
> contributing to this either by enhancing the original patch provided by 
> Heikki, or test the changes on Windows.

That would be great!  I think the first thing to do would be look over
Heikki's comments about what was left to be done and maybe try to do
some of those things.  And then test the result.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

2017-10-19 Thread Tom Lane
Leon Winter  writes:
> The loops are more complex in reality of course, more like:

> open cursor for select from table1
> loop
> { fetch some entries from cursor
>   call some external application
>   do some crazy complicated calculations based on some user input in the UI *
>   update table2
>   commit
> }

Hm, somehow it's pretty hard to credit that the materialized cursor
is the principal performance bottleneck in this configuration.

> The calculations inside the loop are written in some dynamic high-level
> language and cannot easily be translated into SQL.

You don't really have to --- PG supports functions written in non-SQL
languages.  Not sure if your problem is big enough to justify developing
a new PL interface for $random-4GL-language, but that might be something
to consider.

But, to get back to the original point: exactly what "sizeable performance
impact of declaring a cursor "with hold"" do you think there is?  It
shouldn't be noticeably more expensive than just selecting the rows would
be.  Especially not for a number of rows that wouldn't make the
above-depicted application structure completely untenable.  And for sure
I'm not understanding why you think that materializing the result on the
client side instead would be better.

> Naturally I am now wondering why the postgres cursor/portal is not also
> employing the same trick (at least as an option): Postpone
> materialization of "with hold" cursors until it is required (like a
> commit operation is dispatched).

We already do that, and have done it since the feature was invented,
AFAIR.

FWIW, the primary reason for materializing the cursor contents at commit,
rather than just holding onto the active query as you seem to think would
be better, is that materializing allows us to release locks on the
underlying table(s).  If we kept the active query we'd have to keep those
locks, as well as the query's active snapshot, thus certainly blocking
VACUUM cleanup, and possibly blocking subsequent DDL.

The approach of using a separate connection to read the cursor suffers
from exactly those same problems.  Postgres isn't that happy with very
long-lived transactions (neither is any other RDBMS I'm familiar with).
So really I think that materializing the cursor right away and then
doing your application calculations in whatever chunk size seems
convenient is probably your best bet here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

2017-10-19 Thread Geoff Winkless
On 19 October 2017 at 15:06, Leon Winter  wrote:

> The calculations inside the loop are written in some dynamic high-level
> language and cannot easily be translated into SQL.
>

​Can you not simply create a second connection to perform the updates?
​
Geoff


Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains

2017-10-19 Thread Robert Haas
On Wed, Oct 18, 2017 at 5:52 PM, Peter Geoghegan  wrote:
> There is a race where we cannot prune the page, though. That's why we
> had to revisit what I suppose was a tacit assumption, and address its
> problems in the commit that started this thread (commit a5736bf7).

The commit message for a5736bf7 doesn't say anything about a race; it
just claims that it is fixing traversal of half-frozen update chains,
without making any reference to how such a thing as a half-frozen
update chain came to exist in the first place.  I think the commit
that talks about that race condition is 20b65522.

/me studies the problem for a while.

What's bothering me about this is: how is cutoff_xid managing to be a
new enough transaction ID for this to happen in the first place?  The
cutoff XID should certainly be older than anything any current
snapshot can see, because it's computed using GetOldestXmin() -- which
means that XIDs older than the cutoff can't still be running (except
maybe if old_snapshot_threshold is set to a non-default value).  And
that means that the state of any tuple to which
heap_prepare_freeze_tuple() does anything shouldn't be able to change
between the time the HOT-prune is done and the time freezing is
completed.  To put that the other way around, an xmin or xmax that was
still running when we did the HOT prune should also be too new to get
frozen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

2017-10-19 Thread David Fetter
On Thu, Oct 19, 2017 at 04:06:47PM +0200, Leon Winter wrote:
> > What other things did you try, and how did they fail?  In particular,
> > what happened when you used
> > 
> > UPDATE table2
> > SET [things based on table1]
> > FROM table1 [qualified] JOIN table2 ON ([conditions])
> 
> well, it is not the ideal way of doing things but then again this SQL is 
> merely
> a consequence of the legacy 4GL language and runtime environment we are 
> running
> (and trying to migrate to Postgres). We have a lot of those SQL structures and
> would prefer not to change all of them to avoid this situation. Currently 
> there
> are also two database backends, one being the old legacy database and the 
> other
> being Postgres and we are extremely limited from the (lacking) capabilities  
> of
> the old database. We are surely planning to change many SQL statements to make
> better use of the database but not at this point for this issue.
> 
> The loops are more complex in reality of course, more like:
> 
> open cursor for select from table1
> loop
> { fetch some entries from cursor
>   call some external application
>   do some crazy complicated calculations based on some user input in the UI *
>   update table2
>   commit
> }
> 
> The calculations inside the loop are written in some dynamic
> high-level language and cannot easily be translated into SQL.
> 
> tl;dr: Existing code base has a lot of these patterns. General
> solution on database (interfacing) level is required.

I don't know quite how to put this, but it's not clear to me that the
difficulties in this situation are things PostgreSQL could resolve
even with much larger development resources than are currently
available.

If you're updating what are perforce small batches of records in the
UI, there's excellent reason to pull only those batches, mark them as
being "in process," process them, then update the marked ones as
"done" or whatever other states they can get to.

As to "crazy complicated calculations," this is what active databases
are all about.  SQL is Turing complete, so you really can do it.

Would you want something that compiles from the user inputs to SQL?
Might that have a more general utility?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

2017-10-19 Thread Leon Winter
> What other things did you try, and how did they fail?  In particular,
> what happened when you used
> 
> UPDATE table2
> SET [things based on table1]
> FROM table1 [qualified] JOIN table2 ON ([conditions])

well, it is not the ideal way of doing things but then again this SQL is merely
a consequence of the legacy 4GL language and runtime environment we are running
(and trying to migrate to Postgres). We have a lot of those SQL structures and
would prefer not to change all of them to avoid this situation. Currently there
are also two database backends, one being the old legacy database and the other
being Postgres and we are extremely limited from the (lacking) capabilities  of
the old database. We are surely planning to change many SQL statements to make
better use of the database but not at this point for this issue.

The loops are more complex in reality of course, more like:

open cursor for select from table1
loop
{ fetch some entries from cursor
  call some external application
  do some crazy complicated calculations based on some user input in the UI *
  update table2
  commit
}

The calculations inside the loop are written in some dynamic high-level
language and cannot easily be translated into SQL.

tl;dr: Existing code base has a lot of these patterns. General solution on
database (interfacing) level is required.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization

2017-10-19 Thread David Fetter
On Thu, Oct 19, 2017 at 03:20:48PM +0200, Leon Winter wrote:
> Hi,
> 
> I originally brought up this issue on the pgsql-performance mailing list [^] 
> to
> no avail so I am trying again here.
> 
> During implementation of a runtime environment and the adjoining database
> abstraction layer I noticed (like many before me [0] and as correctly 
> mentioned
> in the documentation) the sizeable performance impact of declaring a cursor
> "with hold" for queries with large result sets.
> 
> Our use case very often looks like this:
> 
> open cursor for select from table1
> loop
> { fetch some entries from cursor
>   update table2
>   commit
> }

This seems like a very odd construct based on ideas about databases
that aren't actually true of PostgreSQL, e.g. that joins are
expensive, or that some substantial benefit comes of committing at
some higher frequency than the logical transaction.

What other things did you try, and how did they fail?  In particular,
what happened when you used

UPDATE table2
SET [things based on table1]
FROM table1 [qualified] JOIN table2 ON ([conditions])

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Cursor With_Hold Performance Workarounds/Optimization

2017-10-19 Thread Leon Winter
Hi,

I originally brought up this issue on the pgsql-performance mailing list [^] to
no avail so I am trying again here.

During implementation of a runtime environment and the adjoining database
abstraction layer I noticed (like many before me [0] and as correctly mentioned
in the documentation) the sizeable performance impact of declaring a cursor
"with hold" for queries with large result sets.

Our use case very often looks like this:

open cursor for select from table1
loop
{ fetch some entries from cursor
  update table2
  commit
}

During iteration of the result set we commit changes to the database so we
must make sure to keep the cursor alive. One option is to use "with hold".
Unfortunately the resultset is then instantly materialzed which is a huge
performance burden.
In our use case the "commit" of changes often does not affect the iteration set.
Also the loop might be aborted before the resultset was fully read so we never
needed the whole materialzed set anyway.
To workaround these problems, we already employ some static analysis to avoid
"with hold" in all situations where there are no commits during the lifetime of
cursor or portal. For other cursors we choose to use a different database
connection inside the same application to protect the cursors from commit
operations and avoiding costly copy operations (if they would be used "with
hold" on the main database connection).
In an attempt to further minimize the performance impact I am thinking about
employing a lazy "with hold" where I would fetch all the remaining result rows
from a cursor or portal before a commit statement. This way I could at least
have great performance in all conflict-free situations until one would arrive at
an impass. Naturally I am now wondering why the postgres cursor/portal is not
also employing the same trick (at least as an option): Postpone materialization
of "with hold" cursors until it is required (like a commit operation is
dispatched).
Probably I am also missing many (internal) aspects but at that point it might be
possible to optimize further. When, for instance, no changes were made to result
set of the "with hold" cursor, it must not be materialized. From a previous
discussions [1] I heard that one can in fact accomplish that by using a 
different
database connection which is one workaround we are using.
I am not sure whether this kind of workaround/optimization work should be done
in the database abstraction/interface layer or the database itself. Since a lot
of people seem to run into the peformance issue many might profit from some
optimization magic in the database for such use cases. We are very invested in
this performance issue and are happy to resolve it on either level.

Regards,
Leon

[^]
https://www.postgresql.org/message-id/20171010122039.2xp4ipqokoke45zk%40bfw-online.de
[0] https://trac.osgeo.org/qgis/ticket/1175
https://stackoverflow.com/questions/33635405/postgres-cursor-with-hold
https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/
[1] https://bytes.com/topic/postgresql/answers/420717-cursors-transactions-why
http://www.postgresql-archive.org/setFetchSize-td4935215.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Tests for reloptions

2017-10-19 Thread Alvaro Herrera
Oh, one more thing: be careful when editing parallel_schedule.  There
are constraints on the number of entries in each group; you had added a
20th entry after the comment that the group can only have 19.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Tests for reloptions

2017-10-19 Thread Alvaro Herrera
Nikolay Shaplov wrote:
> В письме от 3 октября 2017 11:48:43 пользователь Michael Paquier написал:

> I've been thinking a lot, and rereading the patch. When I reread it I've been 
> thinking that I would like to add more tests to it now... ;-)
> 
> If the only purpose of tests is to get better coverage, then I would agree 
> with you. But I've been thinking that tests also should check that everything 
> behaves as expected (or as written in documentation).
> 
> I would say that options names is a of part of SQL dialect that postgres 
> uses, 
> kind of part of the syntax. It is good to be sure that they still supported. 
> So I would add a test for every option heap supports.

Yeah, it would perhaps be good idea to ensure we don't break things that
are documented to work.  If the tests don't take too long, I'm not
opposed to testing every single option.  As you say, code coverage is
important but it's not the only goal.

I'm hesitant to hardcode things like the number of bits in bloom, as you
had in the original.  If I understand correctly, that number could
change with compile options (different blocksize?), so I removed that
part.  I also fixed a few spelling errors.

And pushed.  Let's see what the buildfarm says about this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-19 Thread Sokolov Yura

On 2017-10-19 02:28, Andres Freund wrote:

On 2017-06-05 16:22:58 +0300, Sokolov Yura wrote:

Algorithm for LWLockWaitForVar is also refactored. New version is:
1. If lock is not held by anyone, it immediately exit.
2. Otherwise it is checked for ability to take WaitList lock, because
  variable is protected with it. If so, CAS is performed, and if it 
is

  successful, loop breaked to step 4.
3. Otherwise spin_delay perfomed, and loop returns to step 1.
4. var is checked for change.
  If it were changed, we unlock wait list lock and exit.
  Note: it could not change in following steps because we are holding
  wait list lock.
5. Otherwise CAS on setting necessary flags is performed. If it 
succeed,

  then queue Proc to wait list and exit.
6. If CAS failed, then there is possibility for LWLock to be already
  released - if so then we should unlock wait list and exit.
7. Otherwise loop returns to step 5.

So invariant is preserved:
- either we found LWLock free,
- or we found changed variable,
- or we set flags and queue self while LWLock were held.

Spin_delay is not performed at step 7, because we should release wait
list lock as soon as possible.


That seems unconvincing - by not delaying you're more likely to
*increase* the time till the current locker that holds the lock can
release the lock.


But why? If our CAS wasn't satisfied, then other's one were satisfied.
So there is always overall progress. And if it will sleep in this
loop, then other waiters will spin in first loop in this functions.
But given concrete usage of LWLockWaitForVar, probably it is not too
bad to hold other waiters in first loop in this function (they loops
until we release WaitListLock or lock released at whole).

I'm in doubts what is better. May I keep it unchanged now?


In fact, if waiter will sleep here, lock holder will not be able to set
variable we are waiting for, and therefore will not release lock.
It is stated in the comment for the loop:

+ * Note: value could not change again cause we are holding WaitList 
lock.


So delaying here we certainly will degrade.



BTW, I found a small mistake in this place: I forgot to set
LW_FLAG_LOCKED in a state before this CAS. Looks like it wasn't real
error, because CAS always failed at first loop iteration (because real
`lock->state` had LW_FLAG_LOCKED already set), and after failed CAS
state adsorbs value from `lock->state`.
I'll fix it.


Attach contains version with a fix.

--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

lwlock_v4.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A handful of typos in allpaths.c

2017-10-19 Thread Magnus Hagander
On Wed, Oct 18, 2017 at 4:45 AM, David Rowley 
wrote:

> A small patch to fix these is attached.
>

Applied, thanks.

I backpatched the actual error message typo fix. Left the comment alone in
backbranches because it conflicted, so it didn't seem worth it.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [HACKERS] Fix a typo in libpq/auth.c

2017-10-19 Thread Magnus Hagander
On Thu, Oct 19, 2017 at 12:05 PM, Masahiko Sawada 
wrote:

> Hi,
>
> Attached a patch for $subject.
>
> s/RAIDUS/RADIUS/
>

Applied, thanks.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [HACKERS] I've just started working on Full Text Search with version 10 on Ubuntu 16

2017-10-19 Thread Aleksandr Parfenov
On Wed, 18 Oct 2017 22:53:16 -0400
Ronald Jewell  wrote:

> and I'm getting error ...
> 
> ERROR:  could not open extension control file
> "/usr/share/postgresql/10/extension/tsearch2.control": No such file or
> directory
> 
> when I try to create the tsearch2 extension.

Hi,

tsearch2 is an extension which provides FTS interface for
applications were developed for old versions of PostgreSQL.

Since version 8.3 full-text search is part of PostgreSQL core.
tsearch2 solve a problem of API incompatibility in some aspects and
were kept for backward-compatibility reasons. It was removed since
version 10.

For more information about in-core full-text search API check
documentation at
https://www.postgresql.org/docs/10/static/textsearch-intro.html

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-19 Thread Sokolov Yura

Hi,

On 2017-10-19 03:03, Andres Freund wrote:

Hi,

On 2017-09-08 22:35:39 +0300, Sokolov Yura wrote:

 /*
  * Internal function that tries to atomically acquire the lwlock in 
the passed

- * in mode.
+ * in mode. If it could not grab the lock, it doesn't puts proc into 
wait

+ * queue.
  *
- * This function will not block waiting for a lock to become free - 
that's the

- * callers job.
+ * It is used only in LWLockConditionalAcquire.
  *
- * Returns true if the lock isn't free and we need to wait.
+ * Returns true if the lock isn't free.
  */
 static bool
-LWLockAttemptLock(LWLock *lock, LWLockMode mode)
+LWLockAttemptLockOnce(LWLock *lock, LWLockMode mode)


This now has become a fairly special cased function, I'm not convinced
it makes much sense with the current naming and functionality.


It just had not to be as complex as LWLockAttpemptLockOrQueueSelf.
Their functionality is too different.


+/*
+ * Internal function that tries to atomically acquire the lwlock in 
the passed

+ * in mode or put it self into waiting queue with waitmode.
+ * This function will not block waiting for a lock to become free - 
that's the

+ * callers job.
+ *
+ * Returns true if the lock isn't free and we are in a wait queue.
+ */
+static inline bool
+LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, 
LWLockMode waitmode)

+{
+   uint32  old_state;
+   SpinDelayStatus delayStatus;
+   boollock_free;
+   uint32  mask,
+   add;
+
+   AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
+
+   if (mode == LW_EXCLUSIVE)
+   {
+   mask = LW_LOCK_MASK;
+   add = LW_VAL_EXCLUSIVE;
+   }
+   else
+   {
+   mask = LW_VAL_EXCLUSIVE;
+   add = LW_VAL_SHARED;
+   }
+
+   init_local_spin_delay();


The way you moved this around has the disadvantage that we now do this 
-

a number of writes - even in the very common case where the lwlock can
be acquired directly.


Excuse me, I don't understand fine.
Do you complain against init_local_spin_delay placed here?
Placing it in other place will complicate code.

Or you complain against setting `mask` and `add`?
At least it simplifies loop body. For example, it is simpler to write
specialized Power assembly using already filled `mask+add` registers
(i have a working draft with power assembly).

In both cases, I think simpler version should be accepted first. It acts
as algorithm definition. And it already gives measurable improvement.
After some testing, attempt to improve it may be performed (assuming
release will be in a next autumn, there is enough time for testing and
improvements)
I can be mistaken.


+   /*
+	 * We intentionally do not call finish_spin_delay here, because the 
loop
+	 * above usually finished by queuing into the wait list on 
contention, and

+* doesn't reach spins_per_delay thereby doesn't sleep inside of
+* perform_spin_delay. Also, different LWLocks has very different
+	 * contention pattern, and it is wrong to update spin-lock statistic 
based

+* on LWLock contention.
+*/


Huh? This seems entirely unconvincing. Without adjusting this here 
we'll

just spin the same way every iteration. Especially for the case where
somebody else holds LW_FLAG_LOCKED that's quite bad.


LWLock's are very different. Some of them are always short-term
(BufferLock), others are always locked for a long time. Some sparse, and
some crowded. It is quite bad to tune single variable `spins_per_delay`
by such different load.

And it is already tuned by spin-locks. And spin-locks tune it quite 
well.


I've tried to place this delay into lock itself (it has 2 free bytes),
but this attempt performed worse.
Now I understand, that delays should be stored in array indexed by
tranche. But I have no time to test this idea. And I doubt it will give
cardinally better results (ie > 5%), so I think it is better to accept
patch in this way, and then experiment with per-tranche delay.




From e5b13550fc48d62b0b855bedd7fcd5848b806b09 Mon Sep 17 00:00:00 2001
From: Sokolov Yura 
Date: Tue, 30 May 2017 18:54:25 +0300
Subject: [PATCH 5/6] Fix implementation description in a lwlock.c .

---
 src/backend/storage/lmgr/lwlock.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c 
b/src/backend/storage/lmgr/lwlock.c

index 334c2a2d96..0a41c2c4e2 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -62,16 +62,15 @@
  * notice that we have to wait. Unfortunately by the time we have 
finished
  * queuing, the former locker very well might have already finished 
it's
  * work. That's problematic because we're now stuck waiting inside 
the OS.

-
- * To mitigate those races we use a two phased attempt at locking:
- *  Phase 1: Try to do it atomically, if we succeed, nice
- *  

[HACKERS] I've just started working on Full Text Search with version 10 on Ubuntu 16

2017-10-19 Thread Ronald Jewell
and I'm getting error ...

ERROR:  could not open extension control file
"/usr/share/postgresql/10/extension/tsearch2.control": No such file or
directory

when I try to create the tsearch2 extension.

thanks

Ron


Re: [HACKERS] Supporting Windows SChannel as OpenSSL replacement

2017-10-19 Thread Satyanarayana Narlapuram
Tom, Robert, Microsoft is interested in supporting windows SChannel for 
Postgres. Please let know how we can help taking this forward. We would love 
contributing to this either by enhancing the original patch provided by Heikki, 
or test the changes on Windows.

Thanks,
Satya

-Original Message-
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
Sent: Wednesday, October 18, 2017 11:51 AM
To: Robert Haas 
Cc: hlinnaka ; Jeff Janes ; Andreas 
Karlsson ; Martijn van Oosterhout ; 
Magnus Hagander ; PostgreSQL-development 

Subject: Re: [HACKERS] Supporting Windows SChannel as OpenSSL replacement

Robert Haas  writes:
> Heikki, do you have any plans to work more on this?
> Or does anyone else?

FWIW, I have some interest in the Apple Secure Transport patch that is in the 
CF queue, and will probably pick that up at some point if no one beats me to it 
(but it's not real high on my to-do list).
I won't be touching the Windows version though.  I suspect that the folk who 
might be competent to review the Windows code may have correspondingly little 
interest in the macOS patch.  This is a bit of a problem, since it would be 
good for someone to look at both of them, with an eye to whether there are any 
places in our SSL abstraction API that ought to be rethought now that we have 
actual non-OpenSSL implementations to compare to.

regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make 
changes to your subscription:
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.postgresql.org%2Fmailpref%2Fpgsql-hackers=02%7C01%7CSatyanarayana.Narlapuram%40microsoft.com%7C99f781c4865e46f8e69408d5165965f0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636439495588088189=4I%2FYNAtDb63%2BGbSIgh6XVmfKZlbq1YewZ2mkAJkQVKE%3D=0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] show "aggressive" or not in autovacuum logs

2017-10-19 Thread Alvaro Herrera
Kyotaro HORIGUCHI wrote:

> How about the followings?
> 
> "automatic [agressive ]vacuum of table \"%s..."
> "[aggressive ]vacuuming \"%s..."

That form of log message seems acceptable to me (first one is missing a 'g').

In any case, please do not construct the sentence with %s expanding the
word, because that is going to cause a problem for translations.  Use an
'if' test with two full sentences instead.  Yes, as Robert said, it's
going to add more strings, but that's okay.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fix a typo in libpq/auth.c

2017-10-19 Thread Masahiko Sawada
Hi,

Attached a patch for $subject.

s/RAIDUS/RADIUS/

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_typo_in_auth_c.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] show "aggressive" or not in autovacuum logs

2017-10-19 Thread Masahiko Sawada
On Tue, Sep 5, 2017 at 3:41 PM, Kyotaro HORIGUCHI
 wrote:
> Thank you for the opinions.
>
> At Tue, 29 Aug 2017 15:00:57 +0900, Masahiko Sawada  
> wrote in 

Re: [HACKERS] Parallel Append implementation

2017-10-19 Thread Amit Khandekar
On 13 October 2017 at 00:29, Robert Haas  wrote:
> On Wed, Oct 11, 2017 at 8:51 AM, Amit Khandekar  
> wrote:
>> [ new patch ]
>
> + parallel_append
> + Waiting to choose the next subplan during Parallel Append 
> plan
> + execution.
> +
> +
>
> Probably needs to update a morerows values of some earlier entry.

>From what I observed from the other places, the morerows value is one
less than the number of following entries. I have changed it to 10
since it has 11 entries.

>
> +   enable_parallelappend configuration
> parameter
>
> How about enable_parallel_append?

Done.

>
> + * pa_finished : workers currently executing the subplan. A worker which
>
> The way the colon is used here is not a standard comment style for PostgreSQL.

Changed it to "pa_finished:".

>
> + * Go on to the "next" subplan. If no more subplans, return the empty
> + * slot set up for us by ExecInitAppend.
> + * Note: Parallel-aware Append follows different logic for choosing 
> the
> + * next subplan.
>
> Formatting looks wrong, and moreover I don't think this is the right
> way of handling this comment anyway.  Move the existing comment inside
> the if (!node->padesc) block and leave it unchanged; the else block
> explains the differences for parallel append.

I think the first couple of lines do apply to both parallel-append and
sequential append plans. I have moved the remaining couple of lines
inside the else block.

>
> + *ExecAppendEstimate
> + *
> + *estimates the space required to serialize Append node.
>
> Ugh, this is wrong, but I notice it follows various other
> equally-wrong comments for other parallel-aware node types. I guess
> I'll go fix that.  We are not in serializing the Append node.

I didn't clealy get this. Do you think it should be "space required to
copy the Append node into the shared memory" ?

>
> I do not think that it's a good idea to call
> exec_append_parallel_next() from ExecAppendInitializeDSM,
> ExecAppendReInitializeDSM, and ExecAppendInitializeWorker.  We want to
> postpone selecting which plan to run until we're actually ready to run
> that plan.  Otherwise, for example, the leader might seize a
> non-partial plan (if only such plans are included in the Parallel
> Append) when it isn't really necessary for it to do so.  If the
> workers would've reached the plans and started returning tuples to the
> leader before it grabbed a plan, oh well, too bad.  The leader's still
> claimed that plan and must now run it.
>
> I concede that's not a high-probability scenario, but I still maintain
> that it is better for processes not to claim a subplan until the last
> possible moment.  I think we need to initialize as_whichplan as
> PA_INVALID plan and then fix it when ExecProcNode() is called for the
> first time.

Done. Set as_whichplan to PA_INVALID_PLAN in
ExecAppendInitializeDSM(), ExecAppendReInitializeDSM() and
ExecAppendInitializeWorker(). Then when ExecAppend() is called for the
first time, we notice that as_whichplan is PA_INVALID_PLAN, that means
we need to choose the plan.

>
> +if (!IsParallelWorker())
>
> This is not a great test, because it would do the wrong thing if we
> ever allowed an SQL function called from a parallel worker to run a
> parallel query of its own.  Currently that's not allowed but we might
> want to allow it someday.  What we really want to test is whether
> we're the leader for *this* query.  Maybe use a flag in the
> AppendState for that, and set it correctly in
> ExecAppendInitializeWorker.

Done. Set a new AppendState->is_parallel_worker field to true in
ExecAppendInitializeWorker().

>
> I think maybe the loop in exec_append_parallel_next should look more like 
> this:
>
> /* Pick the next plan. */
> state->as_whichplan = padesc->pa_nextplan;
> if (state->as_whichplan != PA_INVALID_PLAN)
> {
> int nextplan = state->as_whichplan;
>
> /* Mark non-partial plans done immediately so that they can't be
> picked again. */
> if (nextplan < first_partial_plan)
> padesc->pa_finished[nextplan] = true;
>
> /* Figure out what plan the next worker should pick. */
> do
> {
> /* If we've run through all the plans, loop back through
> partial plans only. */
> if (++nextplan >= state->as_nplans)
> nextplan = first_partial_plan;
>
> /* No plans remaining or tried them all?  Then give up. */
> if (nextplan == state->as_whichplan || nextplan >= state->as_nplans)
> {
> nextplan = PA_INVALID_PLAN;
> break;
> }
> } while (padesc->pa_finished[nextplan]);
>
> /* Store calculated next plan back into shared memory. */
> padesc->pa_next_plan = nextplan;
> }
>
> This might not be exactly right and the comments may need work, but
> here are a couple of points:
>
> - As you have it coded, the loop exit condition is whichplan !=

Re: [HACKERS] 64-bit queryId?

2017-10-19 Thread Julien Rouhaud
On Thu, Oct 19, 2017 at 1:08 AM, Michael Paquier
 wrote:
> On Thu, Oct 19, 2017 at 4:12 AM, Robert Haas  wrote:
>> On Wed, Oct 18, 2017 at 9:20 AM, Julien Rouhaud  wrote:
>>> WIth current pgssHashKey definition, there shouldn't be padding bits,
>>> so it should be safe.  But I wonder if adding an explicit memset() of
>>> the key in pgss_store() could avoid extension authors to have
>>> duplicate entries if they rely on this code, or prevent future issue
>>> in the unlikely case of adding other fields to pgssHashKey.
>>
>> I guess we should probably add additional comment to the definition of
>> pgssHashKey warning of the danger.  I'm OK with adding a memset if
>> somebody can promise me it will get optimized away by all reasonably
>> commonly-used compilers, but I'm not that keen on adding more cycles
>> to protect against a hypothetical danger.
>
> A comment is an adapted answer for me too.

I agree, and I'm perfectly fine with adding a comment around pgssHashKey.

PFA a patch to warn about the danger.
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index b04b4d6ce1..829ee69f51 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -124,7 +124,10 @@ typedef enum pgssVersion
 
 /*
  * Hashtable key that defines the identity of a hashtable entry.  We separate
- * queries by user and by database even if they are otherwise identical.
+ * queries by user and by database even if they are otherwise identical.  Be
+ * careful when adding new fields, tag_hash() is used to compute the hash key,
+ * so we rely on the fact that no padding bit is present in this structure.
+ * Otherwise, we'd have to zero the key variable in pgss_store.
  */
 typedef struct pgssHashKey
 {

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers