Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-10-28 Thread Dilip Kumar
On Wed, Oct 26, 2016 at 12:01 PM, Andres Freund  wrote:
> The gains are quite noticeable in some cases. So if we can make it work
> without noticeable downsides...
>
> What I'm worried about though is that this, afaics, will quite
> noticeably *increase* total cost in cases with a noticeable number of
> columns and a not that selective qual. The reason for that being that
> HeapKeyTest() uses heap_getattr(), whereas upper layers use
> slot_getattr(). The latter "caches" repeated deforms, the former
> doesn't... That'll lead to deforming being essentially done twice, and
> it's quite often already a major cost of query processing.

What about putting slot reference inside HeapScanDesc ?. I know it
will make ,heap layer use executor structure but just a thought.

I have quickly hacked this way where we use slot reference in
HeapScanDesc and directly use
 slot_getattr inside HeapKeyTest (only if we have valid slot otherwise
use _heap_getattr) and measure the worst case performance (what you
have mentioned above.)

My Test: (21 column table with varchar in beginning + qual is on last
few column + varying selectivity )

postgres=# \d test
  Table "public.test"
 Column |   Type| Modifiers
+---+---
 f1 | integer   |
 f2 | character varying |
 f3 | integer   |
 f4 | integer   |
 f5 | integer   |
 f6 | integer   |
 f7 | integer   |
 f8 | integer   |
 f9 | integer   |
 f10| integer   |
 f11| integer   |
 f12| integer   |
 f13| integer   |
 f14| integer   |
 f15| integer   |
 f16| integer   |
 f17| integer   |
 f18| integer   |
 f19| integer   |
 f20| integer   |
 f21| integer   |

tuple count : 1000 (10 Million)
explain analyze select * from test where f21< $1 and f20 < $1 and f19
< $1 and f15 < $1 and f10 < $1; ($1 vary from 1Million to 1Million).

Target code base:
---
1. Head
2. Heap_scankey_pushdown_v1
3. My hack for keeping slot reference in HeapScanDesc
(v1+use_slot_in_HeapKeyTest)

Result:
Selectivity Head   scan_key_pushdown_v1 v1+use_slot_in_HeapKeyTest
0.1 3880  2980 2747
0.2 4041  3187 2914
0.5 5051  4921 3626
0.8 5378  7296 3879
1.0 6161  8525 4575

Performance graph is attached in the mail..

Observation:

1. Heap_scankey_pushdown_v1, start degrading after very high
selectivity (this behaviour is only visible if table have 20 or more
columns, I tested with 10 columns but with that I did not see any
regression in v1).

2. (v1+use_slot_in_HeapKeyTest) is always winner, even at very high selectivity.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] emergency outage requiring database restart

2016-10-28 Thread Merlin Moncure
On Fri, Oct 28, 2016 at 3:16 PM, Jim Nasby  wrote:
> On 10/28/16 8:23 AM, Merlin Moncure wrote:
>>
>> On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark  wrote:
>>>
>>> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure 
>>> wrote:

 I think we can rule out faulty storage
>>>
>>>
>>> Nobody ever expects the faulty storage
>
>
> LOL
>
>> Believe me, I know.  But the evidence points elsewhere in this case;
>> this is clearly application driven.
>
>
> FWIW, just because it's triggered by specific application behavior doesn't
> mean there isn't a storage bug. That's what makes data corruption bugs such
> a joy to figure out.
>
> BTW, if you haven't already, I would reset all your storage related options
> and GUCs to safe defaults... plain old FSYNC, no cute journal / FS / mount
> options, etc. Maybe this is related to the app, but the most helpful thing
> right now is to find some kind of safe config so you can start bisecting.

upthread, you might have noticed that I already did that.   Here is
the other evidence:
*) server running fine for 5+ years
*) other database on same cluster not impacted with 10x write activity
*) no interesting logs reported in  /var/log/messages, dmesg, etc
*) san fabric turns over petabytes/day with no corruption. 100+
postgres clusters, 1000+ sql server clusters (and that's not
production)
*) storage/network teams have been through everything. nothing
intersting/unusual to report
*) we have infrequently run routing (posted upthread) that, when run,
database crashed within minutes
*) after turning on checksums, 30% of invocations of routine resulted
in checksum error
*) problem re-occurred after dump-restore and full cluster rebuild
*) checksum error caused routine rollback.  FWICT this prevented the damage
*) everything is fine now that routine is not being run anymore

you can come up with your conclusion, I've come up with mine.  The
only frustrating thing here is that I can't reproduce out of the
production environment.  If this database goes down I have 30 people
sitting around so I can't take downtime lightly.

> I would also consider alternatives to plsh, just to rule it out if nothing
> else. I'd certainly look at some way to get sqsh out of the loop (again,
> just to get something that doesn't crash). First idea that comes to mind is
> a stand-alone shell script that watches a named pipe for a filename; when it
> gets that file it runs it with sqsh and does something to signal completion.

I do a lot of etl to/from sql server and it's all sqsh based.  If I
can figure out how to reproduce in a better way, I'll zero in on the
problem in about 10 minutes.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed of user-defined aggregates using array_append as transfn

2016-10-28 Thread Tom Lane
I wrote:
> 3. We could try to fix it mostly from array_append's side, by teaching it
> to return the expanded array in the aggregation context when it's being
> called as an aggregate function, and making some
> hopefully-not-performance-killing tweaks to nodeAgg to play along with
> that.  This would amount to additional complication in the API contract
> for C-coded aggregate functions, but I think it wouldn't affect functions
> that weren't attempting to invoke the optimization, so it should be OK.
> A larger objection is that it still doesn't do anything for the memory
> consumption angle.

Hm, that actually seems to work, at least from a speed standpoint; see
the attached not-terribly-well-documented patch.  The additions to nodeAgg
are only in places where we're going to be doing datum copying or freeing
anyway.

I'm still worried about chewing up a kilobyte-at-least for each transition
value, but maybe that's something we could leave to fix later.  Another
idea is that we could teach the planner to know about that in its hash
table size estimates.

regards, tom lane

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b06e1c1..e0288ba 100644
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*** advance_transition_function(AggState *ag
*** 791,798 
  
  	/*
  	 * If pass-by-ref datatype, must copy the new value into aggcontext and
! 	 * pfree the prior transValue.  But if transfn returned a pointer to its
! 	 * first input, we don't need to do anything.
  	 */
  	if (!pertrans->transtypeByVal &&
  		DatumGetPointer(newVal) != DatumGetPointer(pergroupstate->transValue))
--- 791,800 
  
  	/*
  	 * If pass-by-ref datatype, must copy the new value into aggcontext and
! 	 * free the prior transValue.  But if transfn returned a pointer to its
! 	 * first input, we don't need to do anything.  Also, if transfn returned a
! 	 * pointer to a R/W expanded object that is already a child of the
! 	 * aggcontext, assume we can adopt that value without copying it.
  	 */
  	if (!pertrans->transtypeByVal &&
  		DatumGetPointer(newVal) != DatumGetPointer(pergroupstate->transValue))
*** advance_transition_function(AggState *ag
*** 800,811 
  		if (!fcinfo->isnull)
  		{
  			MemoryContextSwitchTo(aggstate->aggcontexts[aggstate->current_set]->ecxt_per_tuple_memory);
! 			newVal = datumCopy(newVal,
! 			   pertrans->transtypeByVal,
! 			   pertrans->transtypeLen);
  		}
  		if (!pergroupstate->transValueIsNull)
! 			pfree(DatumGetPointer(pergroupstate->transValue));
  	}
  
  	pergroupstate->transValue = newVal;
--- 802,826 
  		if (!fcinfo->isnull)
  		{
  			MemoryContextSwitchTo(aggstate->aggcontexts[aggstate->current_set]->ecxt_per_tuple_memory);
! 			if (DatumIsReadWriteExpandedObject(newVal,
! 			   false,
! 			   pertrans->transtypeLen) &&
! MemoryContextGetParent(DatumGetEOHP(newVal)->eoh_context) == CurrentMemoryContext)
!  /* do nothing */ ;
! 			else
! newVal = datumCopy(newVal,
!    pertrans->transtypeByVal,
!    pertrans->transtypeLen);
  		}
  		if (!pergroupstate->transValueIsNull)
! 		{
! 			if (DatumIsReadWriteExpandedObject(pergroupstate->transValue,
! 			   false,
! 			   pertrans->transtypeLen))
! DeleteExpandedObject(pergroupstate->transValue);
! 			else
! pfree(DatumGetPointer(pergroupstate->transValue));
! 		}
  	}
  
  	pergroupstate->transValue = newVal;
*** advance_combine_function(AggState *aggst
*** 1053,1060 
  
  	/*
  	 * If pass-by-ref datatype, must copy the new value into aggcontext and
! 	 * pfree the prior transValue.  But if the combine function returned a
! 	 * pointer to its first input, we don't need to do anything.
  	 */
  	if (!pertrans->transtypeByVal &&
  		DatumGetPointer(newVal) != DatumGetPointer(pergroupstate->transValue))
--- 1068,1078 
  
  	/*
  	 * If pass-by-ref datatype, must copy the new value into aggcontext and
! 	 * free the prior transValue.  But if the combine function returned a
! 	 * pointer to its first input, we don't need to do anything.  Also, if the
! 	 * combine function returned a pointer to a R/W expanded object that is
! 	 * already a child of the aggcontext, assume we can adopt that value
! 	 * without copying it.
  	 */
  	if (!pertrans->transtypeByVal &&
  		DatumGetPointer(newVal) != DatumGetPointer(pergroupstate->transValue))
*** advance_combine_function(AggState *aggst
*** 1062,1073 
  		if (!fcinfo->isnull)
  		{
  			MemoryContextSwitchTo(aggstate->aggcontexts[aggstate->current_set]->ecxt_per_tuple_memory);
! 			newVal = datumCopy(newVal,
! 			   pertrans->transtypeByVal,
! 			   pertrans->transtypeLen);
  		}
  		if (!pergroupstate->transValueIsNull)
! 			pfree(DatumGetPointer(pergroupstate->transValue));
  	}
  
  	pergroupstate->transValue = newVal;
--- 1080,1104 
  		if (!fcinfo->isnull)
  		{
  		

Re: [HACKERS] emergency outage requiring database restart

2016-10-28 Thread Jim Nasby

On 10/28/16 8:23 AM, Merlin Moncure wrote:

On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark  wrote:

On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure  wrote:

I think we can rule out faulty storage


Nobody ever expects the faulty storage


LOL


Believe me, I know.  But the evidence points elsewhere in this case;
this is clearly application driven.


FWIW, just because it's triggered by specific application behavior 
doesn't mean there isn't a storage bug. That's what makes data 
corruption bugs such a joy to figure out.


BTW, if you haven't already, I would reset all your storage related 
options and GUCs to safe defaults... plain old FSYNC, no cute journal / 
FS / mount options, etc. Maybe this is related to the app, but the most 
helpful thing right now is to find some kind of safe config so you can 
start bisecting.


I would also consider alternatives to plsh, just to rule it out if 
nothing else. I'd certainly look at some way to get sqsh out of the loop 
(again, just to get something that doesn't crash). First idea that comes 
to mind is a stand-alone shell script that watches a named pipe for a 
filename; when it gets that file it runs it with sqsh and does something 
to signal completion.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed of user-defined aggregates using array_append as transfn

2016-10-28 Thread Andrew Dunstan



On 10/28/2016 03:14 PM, Tom Lane wrote:

Andrew Dunstan  writes:

My initial admittedly ugly thought was why not have a second append
function that doesn't use expanded arrays?

That won't get us out of the O(N^2) behavior.  Also I don't see what's
better about it than my suggestion of making array_append itself do
that when called as an aggregate function.






Probably nothing, I was just thinking out loud. That suggestion seems 
like the most obviously back-patchable solution.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed of user-defined aggregates using array_append as transfn

2016-10-28 Thread Tom Lane
Andrew Dunstan  writes:
> My initial admittedly ugly thought was why not have a second append 
> function that doesn't use expanded arrays?

That won't get us out of the O(N^2) behavior.  Also I don't see what's
better about it than my suggestion of making array_append itself do
that when called as an aggregate function.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed of user-defined aggregates using array_append as transfn

2016-10-28 Thread Andrew Dunstan



On 10/28/2016 02:04 PM, Tom Lane wrote:

Comments, better ideas?





My initial admittedly ugly thought was why not have a second append 
function that doesn't use expanded arrays?


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-10-28 Thread Karl O. Pinc
On Fri, 28 Oct 2016 10:03:37 +0200
Gilles Darold  wrote:

> ...
> the v9 of the patch, attached here.

I notice that there are a number of user-supplied GUC
values for log_destination that are repeatedly used,
both in the GUC code and in your patch.  These are
presently written as hardcoded strings.

Attached are 2 patches which abstract the values a
user is supposed to supply for log_destination.


patch_pg_current_logfile-v9.diff.guc_values-part1
This applies to both master HEAD and on top of your v9
patch.  It abstracts the user-supplied values within
the GUC code.

patch_pg_current_logfile-v9.diff.guc_values-part2
This applies on top of your v9 patch.

I couldn't find a good place to put the newly defined symbols
in the existing code so the part1 patch creates
src/include/utils/guc_values.h.  Someone who knows
the code better than me would be better able to judge
if making a new .h file is a good idea.  Likewise, I presume
that a "GUCV_" prefix for the new symbols is good, but this
too could use review.

The odd part about the part1 patch is that GUCV_EVENTLOG
is never used anywhere but in src/backend/utils/misc/guc.c.
But it is used twice there and it seemed like as long as
I was doing the rest of the log_destination values I should
abstract eventlog too.

If we use these patches I propose that we keep the
part1 patch and submit it separately to the committers.
Seems like it'd be easier to review/commit when the changes to
existing code are kept separate from new code.

> Thanks a lot.

Thank you also for considering my ideas.  :)

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein


patch_pg_current_logfile-v9.diff.guc_values-part2
Description: Binary data


patch_pg_current_logfile-v9.diff.guc_values-part1
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Speed of user-defined aggregates using array_append as transfn

2016-10-28 Thread Tom Lane
I looked into this complaint:
https://www.postgresql.org/message-id/1477487162344-5927751.p...@n3.nabble.com
which boils down to array_append being a lot slower in 9.5 & up than it
was before.  Now, using array_append as an aggregate transfn has never
been a very good idea, because it's inherently O(N^2) in the number of
input rows.  But the constant factor is something like 15x worse as of
9.5.  For example you can try this simple case:

create aggregate adims(anyelement)
( sfunc = array_append,
  stype = anyarray,
  finalfunc = array_dims
);

In 9.4 I get times like this:

regression=# select adims(x) from generate_series(1,1000) x;
Time: 3.176 ms
regression=# select adims(x) from generate_series(1,1) x;
Time: 62.792 ms
regression=# select adims(x) from generate_series(1,10) x;
Time: 7318.999 ms

but 9.5 does this:

regression=# select adims(x) from generate_series(1,1000) x;
Time: 17.532 ms
regression=# select adims(x) from generate_series(1,1) x;
Time: 1146.686 ms
regression=# select adims(x) from generate_series(1,10) x;
Time: 113892.993 ms

For comparison, the nominally equivalent array_agg() is much faster,
giving roughly linear performance in either 9.4 or 9.5:

regression=# select array_dims(array_agg(x)) from generate_series(1,1000) x;
Time: 2.197 ms
regression=# select array_dims(array_agg(x)) from generate_series(1,1) x;
Time: 6.353 ms
regression=# select array_dims(array_agg(x)) from generate_series(1,10) x;
Time: 53.198 ms

The reason for the slowdown is that in 9.5, array_append prefers to work
on "expanded arrays", and we're converting back and forth between expanded
and flat array datums at each row of the aggregation.  Ick.

We knew that the expanded-array patch would be giving up performance in
some cases to win it in others, but it'd be nice if it weren't giving up
this much in a case that users actually care about.

If we could tell people not to use array_append as a transition function,
then maybe we would not have to solve this, but array_agg_transfn isn't
a very plausible solution for user-defined aggregates because of its
non-type-safe use of an "internal"-type transvalue.  That means you need
superuser privileges to create the aggregate, and you can't code your
final-function in a PL language.

I thought about various ways that we might fix this, but they all have
disadvantages of one sort or another:

1. We could add a code path in array_append that does it the old way
if it's being called as an aggregate function.  This would get us back
to 9.4-ish performance for this case ... not that that's very good.

2. We could teach nodeAgg.c to deal with expanded-object datums
explicitly, in more or less the same way that plpgsql deals with expanded
values in plpgsql variables.  I made a draft patch for this (attached),
and find that it puts this user-defined aggregate on par with array_agg
speed-wise.  But I'm not too happy with it because it adds some cycles to
advance_transition_function() whether or not we're actually dealing with
an expanded object.  We already know that adding even one instruction
there can make for measurable slowdown.  (The draft patch leaves some
things on the table there, in particular we could arrange not to call
MakeExpandedObjectReadOnly if the transvalue isn't of varlena type;
but there's still going to be some added overhead.)  Another problem
with this approach is that in hashed aggregation, an expanded object
per group might eat more storage than you'd like.  We had to teach
array_agg_transfn not to use a separate memory context per aggregation
value, and that issue would come up here too.  Lastly, the draft patch
hard-wires the optimization to be available only to array_append and
array_prepend.  That's the same as what we did in plpgsql, but it wasn't
very nice there and it's not nice here either.

3. We could try to fix it mostly from array_append's side, by teaching it
to return the expanded array in the aggregation context when it's being
called as an aggregate function, and making some
hopefully-not-performance-killing tweaks to nodeAgg to play along with
that.  This would amount to additional complication in the API contract
for C-coded aggregate functions, but I think it wouldn't affect functions
that weren't attempting to invoke the optimization, so it should be OK.
A larger objection is that it still doesn't do anything for the memory
consumption angle.

I have some other ideas about things we could do going forward, but they
don't seem likely to lead to back-patchable fixes.  The memory consumption
problem could be dealt with if we allowed expanded objects to sometimes
not have their own context, but that would involve API changes for
expanded objects.  And I'm wondering whether, if we did fix all this,
we could get rid of ArrayBuildState entirely in favor of making the
functions that use that build an expanded array directly.  And maybe
we could get rid of array_agg_transfn in favor of using array_append,
eliminating 

Re: [HACKERS] Radix tree for character conversion

2016-10-28 Thread David Fetter
On Fri, Oct 28, 2016 at 09:18:08AM -0400, Robert Haas wrote:
> On Thu, Oct 27, 2016 at 3:23 AM, Kyotaro HORIGUCHI
>  wrote:
> > | COPYRIGHT AND PERMISSION NOTICE
> > |
> > | Copyright (c) 1991-2016 Unicode, Inc. All rights reserved.
> > | Distributed under the Terms of Use in 
> > http://www.unicode.org/copyright.html.
> > |
> > | Permission is hereby granted, free of charge, to any person obtaining
> > | a copy of the Unicode data files and any associated documentation
> > | (the "Data Files") or Unicode software and any associated documentation
> > | (the "Software") to deal in the Data Files or Software
> > | without restriction, including without limitation the rights to use,
> > | copy, modify, merge, publish, distribute, and/or sell copies of
> > | the Data Files or Software, and to permit persons to whom the Data Files
> > | or Software are furnished to do so, provided that either
> > | (a) this copyright and permission notice appear with all copies
> > | of the Data Files or Software, or
> > | (b) this copyright and permission notice appear in associated
> > | Documentation.
> >
> > Perhaps we can put the files into our repositoy by providing some
> > notifications.
> 
> Uggh, I don't much like advertising clauses.

Your dislike is pretty common.

Might it be worth reaching out to the Unicode consortium about this?
They may well have added that as boilerplate without really
considering the effects, and they even have a popup that specifically
addresses licensing.

http://www.unicode.org/reporting.html

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast Default WIP patch for discussion

2016-10-28 Thread Serge Rielau

> On Oct 28, 2016, at 5:46 AM, Robert Haas  wrote:
> 
> On Fri, Oct 21, 2016 at 7:15 PM, Serge Rielau  wrote:
>> Some key design points requiring discussion:
>> 1. Storage of the “exist” (working name) default
>>   Right now the patch stores the default value in its binary form as it
>> would be in the tuple into a BYTEA.
>>   It would be feasible to store the pretty printed literal, however this
>> requires calling the io functions when a
>>   tuple descriptor is built.
> 
> A pretty-printed literal is a terrible idea because input functions
> need not be immutable (e.g. timestamptz).  
I did not know that! Interesting.

> I would have expected a
> pg_node_tree column containing a CONST node, but your bytea thing
> might be OK.
I was debating following the example given by the current DEFAULT. 
But figured it’s overkill given that no-one user will ever have to look at it.
And in the end a the pretty printed CONST node is no more readable.

> 
>> 3. Delayed vs. early expansion of the tuples.
>>To avoid having to decide when to copy tuple descriptors with or without
>> constraints and defaults
>>I have opted to expand the tuple at the core entry points.
>>How do I know I have them all? An omission means wrong results!
> 
> Early expansion seems right.  "How do I know I have them all?" - and
> not only all of the current ones but all future ones - seems like a
> core sticking point for this patch.
Yes, there is a certain amount of inherent, hard to control, risk towards the 
future that we must be willing to accept. 

> 
>> 4. attisnull()
>>This routine is used in many places, but to give correct result sit must
>> now be accompanied
>>by the tuple descriptor. This becomes moderately messy and it’s not
>> always clear where to get that.
>>Interestingly most usages are related to catalog lookups.
>>Assuming we have no intention to support fast default for catalog tables
>> we could keep using the
>>existing attisnull() api for catalog lookups and use a new version (name
>> tbd) for user tables.
> 
> attisnull is not a thing.  There's heap_attisnull and slot_attisnull.
My apologies. 
Obviously slot_attisnull() is a non issue since slots have tuple descriptors.
It’s heap_attisnull() I am struggling with. It’s rather popular. 
Yet, in the large majority of cases exist defaults do not apply, so digging up 
the descriptor
is rather wasteful.

>> 5. My head hurts looking at the PK/FK code - it’s not always clear which
>> tuple descriptor belongs
>>to which tuple
> 
> I suggest ibuprofen and a stiff upper lip.
Sound advise.

>> 6. Performance of the expansion code.
>>The current code needs to walk all defaults and then start padding by
>> filling in values.
>>But the outcome is always the same. We will produce the same payload and
>> the name null map.
>>It would be feasible to cache an “all defaults tuple”, remember the
>> offsets (and VARWIDTH, HASNULL)
>>for each attribute and then simply splice the short and default tuples
>> together.
>>This ought to be faster, but the meta data to cache is not insignificant
>> and the expansion code is messy enough
>>without this already.
> 
> You could experiment to figure out if it makes any difference.
I think my first experiment will be to measure the performance impact of 
“expressed” vs. "non expressed" defaults 
with the current design. 
If that is considered excessive I will explore the more complex alternative.

> 
>> 7. Grooming
>>Obviously we can remove all exist defaults for a table from pg_attribute
>> whenever the table is rewrittem.
>>That’s easy.
> 
> But might cause the table to expand a lot, which would suck.
Well, a this point I must point back to the original goal which is O(1) ADD 
COLUMN performance.
The footprint of a rewritten table is no worse after this patch.
The reduced footprint of a table due to a rewrite avoided on ADD COLUMN is boon 
one must not rely upon.
The existing tuple layout would support an enhancement where trailing defaults 
may be avoided.
But I believe we would be better served with a more general approach to table 
compresion.

> 
>>But could we/should we keep track of the short tuples and either
>> eliminate them or drop exist defaults once they
>>become obsolete because there is no tuple short enough for them to
>> matter.
> 
> I wouldn't mess with it.
*phew*

> 
>> 8. Do we need to worry about toasted defaults?
> 
> Presumably.
Time for me to dig into that then.

Cheers
Serge RIelau
salesforce.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Radix tree for character conversion

2016-10-28 Thread Tom Lane
Robert Haas  writes:
> On Thu, Oct 27, 2016 at 3:23 AM, Kyotaro HORIGUCHI
>  wrote:
>> Perhaps we can put the files into our repositoy by providing some
>> notifications.

> Uggh, I don't much like advertising clauses.

Even if the license were exactly compatible with ours, I'd be -1 on
bloating our tarballs with these files.  They're large and only a
tiny fraction of developers, let alone end users, will ever care
to look at them.

I think it's fine as long as we have a README file that explains
where to get them.  (I'm not even very thrilled with the proposed
auto-download script, as it makes undesirable assumptions about
which internet tools you use, not to mention that it won't work
at all on Windows.)

I'd actually vote for getting rid of the reference files we
have in the tree now (src/backend/utils/mb/Unicode/*txt), on
the same grounds.  That's 600K of stuff that does not need to
be in our tarballs.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] save redundant code for pseudotype I/O functions

2016-10-28 Thread Tom Lane
Peter Eisentraut  writes:
> On 10/27/16 11:49 PM, Peter Eisentraut wrote:
>> Here is a patch to refactor some of the I/O functions in pseudotypes.c
>> to save redundant code and reduce the number of distinct translatable
>> strings.

Would it be better to use CppAsString and CppConcat instead of directly
using # and ##, for consistency with what we do elsewhere?

Otherwise, +1.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] emergency outage requiring database restart

2016-10-28 Thread Merlin Moncure
On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark  wrote:
> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure  wrote:
>> I think we can rule out faulty storage
>
> Nobody ever expects the faulty storage

Believe me, I know.  But the evidence points elsewhere in this case;
this is clearly application driven.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Radix tree for character conversion

2016-10-28 Thread Robert Haas
On Thu, Oct 27, 2016 at 3:23 AM, Kyotaro HORIGUCHI
 wrote:
> | COPYRIGHT AND PERMISSION NOTICE
> |
> | Copyright (c) 1991-2016 Unicode, Inc. All rights reserved.
> | Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
> |
> | Permission is hereby granted, free of charge, to any person obtaining
> | a copy of the Unicode data files and any associated documentation
> | (the "Data Files") or Unicode software and any associated documentation
> | (the "Software") to deal in the Data Files or Software
> | without restriction, including without limitation the rights to use,
> | copy, modify, merge, publish, distribute, and/or sell copies of
> | the Data Files or Software, and to permit persons to whom the Data Files
> | or Software are furnished to do so, provided that either
> | (a) this copyright and permission notice appear with all copies
> | of the Data Files or Software, or
> | (b) this copyright and permission notice appear in associated
> | Documentation.
>
> Perhaps we can put the files into our repositoy by providing some
> notifications.

Uggh, I don't much like advertising clauses.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming basebackups vs pg_stat_tmp

2016-10-28 Thread David Steele

On 10/28/16 3:49 PM, Magnus Hagander wrote:

On Fri, Oct 28, 2016 at 2:44 PM, David Steele mailto:da...@pgmasters.net>> wrote:
The patch looks sane to me, but I think it would be good to
backpatch the TAP test from the exclusion patch that tests
pg_replslot as a symlink.

So that's the test that's in that same patch, 6ad8ac60, right? How much
of the code for that is actually needed? (like the row which changes a
10 to a 11? which probably means something, but is it relevant here?) Or
all of it?


The change from 10 to 11 increases the tests that are skipped on 
Windows, which is necessary because one extra symlink test is added.


I think you need:

-use Test::More tests => 54;
+use Test::More tests => 55;

and:

 SKIP:
 {
-   skip "symlinks not supported on Windows", 10 if ($windows_os);
+   skip "symlinks not supported on Windows", 11 if ($windows_os);
+
+   # Move pg_replslot out of $pgdata and create a symlink to it.
+   $node->stop;
+
+   rename("$pgdata/pg_replslot", "$tempdir/pg_replslot")
+   or BAIL_OUT "could not move $pgdata/pg_replslot";
+   symlink("$tempdir/pg_replslot", "$pgdata/pg_replslot")
+   or BAIL_OUT "could not symlink to $pgdata/pg_replslot";
+
+   $node->start;

# Create a temporary directory in the system location and symlink it
# to our physical temp location.  That way we can use shorter names
@@ -148,6 +186,8 @@ SKIP:
"tablespace symlink was updated");
closedir $dh;

+   ok(-d "$tempdir/backup1/pg_replslot", 'pg_replslot symlink copied as 
directory');

+
mkdir "$tempdir/tbl=spc2";

The rest of the tests are for exclusions.

--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] pg_basebackup from disconnected standby fails

2016-10-28 Thread Robert Haas
On Thu, Oct 27, 2016 at 7:38 PM, Michael Paquier
 wrote:
>> I committed and back-patched this with some additional work on the
>> comments, but I don't understand this remark.  That comment seems like
>> it should refer to the checkpointer in modern branches, but isn't that
>> point independent of this patch?
>
>  * During recovery, we keep a copy of the latest checkpoint record here.
> -* Used by the background writer when it wants to create a restartpoint.
> +* lastCheckPointRecPtr points to start of checkpoint record and
> +* lastCheckPointEndPtr points to end+1 of checkpoint record.  Used by the
> +* background writer when it wants to create a restartpoint.
>
> The patch committed introduces lastCheckPointEndPtr, which is not used
> to decide if a restart point should be created or not.

The comment doesn't say anything about those structure members being
used to decide "if a restart point should be created or not".  It just
says that they are used; it says nothing about the purpose for which
they are used.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query regarding selectDumpableExtension()

2016-10-28 Thread Robert Haas
On Thu, Oct 27, 2016 at 2:11 AM, amul sul  wrote:
> selectDumpableExtension() function skip
> s dump of
> built-in extensions in case of binary-upgrade  only,
> why not in normal
> dump
> ?
> Can't we assume those will already be installed in the target database
> at restore
> ?

There's a comment in dumpExtension() that explains it.

/*
 * In a regular dump, we use IF NOT EXISTS so that there isn't a
 * problem if the extension already exists in the target database;
 * this is essential for installed-by-default extensions such as
 * plpgsql.
 *
 * In binary-upgrade mode, that doesn't work well, so instead we skip
 * built-in extensions based on their OIDs; see
 * selectDumpableExtension.
 */

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming basebackups vs pg_stat_tmp

2016-10-28 Thread Magnus Hagander
On Fri, Oct 28, 2016 at 2:44 PM, David Steele  wrote:

> On 10/28/16 11:53 AM, Magnus Hagander wrote:
>
>> In 9.6 and earlier, if you change pg_stat_tmp to be a symlink,
>> basebackups no longer work. That's because we create symlink entry in
>> the tarfile for it instead of an empty directory, but with no data,
>> which Breaks Everything (TM).
>>
>> This was fixed in head in 6ad8ac60, which introduced "more excludes",
>> due to the refactoring. That commit message refers to it also fixing
>> this bug, but it seems the bugfix was never backpatched.
>>
>> Or did I miss something?
>>
>
> I don't think so.  I guess it got lost in the CF rush and also slipped my
> mind when I reviewed the final commit.
>
> Attached patch fixds this (based on 9.5 which is where I ran into it,
>> but it needs to go in other back branches as well) by bringing back a
>> (modified) version of the functoin _tarWriteDir() to the back branches.
>>
>> I'd appreciate a look-over before committing, but it works fine in my
>> tests.
>>
>
> The patch looks sane to me, but I think it would be good to backpatch the
> TAP test from the exclusion patch that tests pg_replslot as a symlink.


So that's the test that's in that same patch, 6ad8ac60, right? How much of
the code for that is actually needed? (like the row which changes a 10 to a
11? which probably means something, but is it relevant here?) Or all of it?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Danger of automatic connection reset in psql

2016-10-28 Thread Ashutosh Bapat
On Thu, Oct 20, 2016 at 5:25 PM, Oleksandr Shulgin
 wrote:
> On Thu, Oct 20, 2016 at 12:28 PM, Oleksandr Shulgin
>  wrote:
>>
>>
>> Since this is already an improvement, I'm attaching a patch.
>>
>> If on the other hand, someone is pasting into psql's terminal a block of
>> commands enclosed in BEGIN/COMMIT, the same bug is triggered: BEGIN doesn't
>> have effect and the rest of commands run outside of transaction.
>>
>> Is it possible at all to protect against the latter case?  How?
>
>
> One idea I was just suggested is to make interactive psql sessions buffer in
> all available input, before it's going to block.  This way it doesn't matter
> if the multiple statements are appearing on one line or are being pasted.
>
> Feasible?

psql seems to be reading one line (or upto EOF) at a time. If it finds
a ';' in the fetched input, it sends the query to the server and
fetches the result. That's when it will notice a server crash. The
next time it will try to fetch the line, I doubt if psql has any way
to tell, whether the block of commands was pasted all at a time or
entered one at a time.y It's same thing for psql and really I guess
it's the same thing for any interactive program. I guess, all the
pasted result is not in psql's buffers if it spans multiple lines and
has ';' in-between. So, I don't think, it's feasible to protect
against that one.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast Default WIP patch for discussion

2016-10-28 Thread Robert Haas
On Fri, Oct 21, 2016 at 7:15 PM, Serge Rielau  wrote:
> Some key design points requiring discussion:
> 1. Storage of the “exist” (working name) default
>Right now the patch stores the default value in its binary form as it
> would be in the tuple into a BYTEA.
>It would be feasible to store the pretty printed literal, however this
> requires calling the io functions when a
>tuple descriptor is built.

A pretty-printed literal is a terrible idea because input functions
need not be immutable (e.g. timestamptz).  I would have expected a
pg_node_tree column containing a CONST node, but your bytea thing
might be OK.

> 2. The exist default is cached alongside the “current” default in the tuple
> descriptor’s constraint structure.
> Seems most natural too me, but debatable.

No opinion (yet).

> 3. Delayed vs. early expansion of the tuples.
> To avoid having to decide when to copy tuple descriptors with or without
> constraints and defaults
> I have opted to expand the tuple at the core entry points.
> How do I know I have them all? An omission means wrong results!

Early expansion seems right.  "How do I know I have them all?" - and
not only all of the current ones but all future ones - seems like a
core sticking point for this patch.

> 4. attisnull()
> This routine is used in many places, but to give correct result sit must
> now be accompanied
> by the tuple descriptor. This becomes moderately messy and it’s not
> always clear where to get that.
> Interestingly most usages are related to catalog lookups.
> Assuming we have no intention to support fast default for catalog tables
> we could keep using the
> existing attisnull() api for catalog lookups and use a new version (name
> tbd) for user tables.

attisnull is not a thing.  There's heap_attisnull and slot_attisnull.

> 5. My head hurts looking at the PK/FK code - it’s not always clear which
> tuple descriptor belongs
> to which tuple

I suggest ibuprofen and a stiff upper lip.

> 6. Performance of the expansion code.
> The current code needs to walk all defaults and then start padding by
> filling in values.
> But the outcome is always the same. We will produce the same payload and
> the name null map.
> It would be feasible to cache an “all defaults tuple”, remember the
> offsets (and VARWIDTH, HASNULL)
> for each attribute and then simply splice the short and default tuples
> together.
> This ought to be faster, but the meta data to cache is not insignificant
> and the expansion code is messy enough
> without this already.

You could experiment to figure out if it makes any difference.

> 7. Grooming
> Obviously we can remove all exist defaults for a table from pg_attribute
> whenever the table is rewrittem.
> That’s easy.

But might cause the table to expand a lot, which would suck.

> But could we/should we keep track of the short tuples and either
> eliminate them or drop exist defaults once they
> become obsolete because there is no tuple short enough for them to
> matter.

I wouldn't mess with it.

> 8. Do we need to worry about toasted defaults?

Presumably.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Overlook in 2bd9e412?

2016-10-28 Thread Julien Rouhaud
It looks like pq_putmessage_hook and pq_flush_hook were meant for
development purpose and not supposed to be committed.

Attached patch remove them.

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
diff --git a/src/include/libpq/libpq.h b/src/include/libpq/libpq.h
index 18052cb..5fac817 100644
--- a/src/include/libpq/libpq.h
+++ b/src/include/libpq/libpq.h
@@ -81,9 +81,6 @@ extern char *ssl_key_file;
 extern char *ssl_ca_file;
 extern char *ssl_crl_file;
 
-extern int	(*pq_putmessage_hook) (char msgtype, const char *s, size_t len);
-extern int	(*pq_flush_hook) (void);
-
 extern int	secure_initialize(void);
 extern bool secure_loaded_verify_locations(void);
 extern void secure_destroy(void);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming basebackups vs pg_stat_tmp

2016-10-28 Thread David Steele

On 10/28/16 11:53 AM, Magnus Hagander wrote:

In 9.6 and earlier, if you change pg_stat_tmp to be a symlink,
basebackups no longer work. That's because we create symlink entry in
the tarfile for it instead of an empty directory, but with no data,
which Breaks Everything (TM).

This was fixed in head in 6ad8ac60, which introduced "more excludes",
due to the refactoring. That commit message refers to it also fixing
this bug, but it seems the bugfix was never backpatched.

Or did I miss something?


I don't think so.  I guess it got lost in the CF rush and also slipped 
my mind when I reviewed the final commit.



Attached patch fixds this (based on 9.5 which is where I ran into it,
but it needs to go in other back branches as well) by bringing back a
(modified) version of the functoin _tarWriteDir() to the back branches.

I'd appreciate a look-over before committing, but it works fine in my tests.


The patch looks sane to me, but I think it would be good to backpatch 
the TAP test from the exclusion patch that tests pg_replslot as a symlink.


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast Default WIP patch for discussion

2016-10-28 Thread Robert Haas
On Wed, Oct 26, 2016 at 11:43 AM, Serge Rielau  wrote:
> Posting to this group on a Friday evening was obviously a Bad Idea(tm). :-)

Not really.  It's just that complex patches often don't get an
immediate response because people are too busy to look at them and
think about them in detail.  If you've added it to the CommitFest, it
will eventually get some attention.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] JIT compiler for expressions

2016-10-28 Thread Dmitry Melnik
Hello hackers,

We'd like to present our work on adding LLVM JIT compilation of expressions
in SQL queries for PostgreSQL. The source code (based on 9.6.1) along with
brief manual is available at our github: https://github.com/ispras/postgres .
Сurrent speedup for TPC-H Q1 is 20% (on 40GB workload). Please feel free to
test it and tell us what you think.

Currently, our JIT is used to compile expressions in every query, so for
short-running queries JIT compilation may take longer than query execution
itself. We plan to address it by using planner estimate to decide whether
it worth JIT compiling, also we may try parallel JIT compilation. But for
now we recommend testing it on a large workload  in order to pay off the
compilation (we've tested on 40GB database for Q1).

The changes in PostgreSQL code itself are rather small, while the biggest
part of new code in our repository is autogenerated (it's LLVM IR
generators for PostgreSQL backend functions). The only real reason for
shipping prebuild_llvm_backend.cpp is that it takes patched LLVM version to
generate, otherwise it's generated right from PostgreSQL source code
(please see more on automatic backend generation at our github site). With
pre-generated cpp file, building our github PGSQL version w/JIT requires
only clean, non-patched LLVM 3.7.

JIT compilation was tested on Linux, and currently we have 5 actual tests
failing (which results in 24 errors in a regtest). It requires LLVM 3.7
(3.7.1) as build dependency (you can specify path to proper llvm-config
with --with-llvm-config= configure option, e.g. it could be named
llvm-config-3.7 on your system). Mac support is highly experimental, and
wasn't tested much, but if you like to give it a try, you can do it with
LLVM 3.7 from MacPorts or Homebrew.

This work is a part of our greater effort on implementing full JIT compiler
in PostgreSQL, where along with JITting expressions we've changed the
iteration model from Volcano-style to push-model and reimplemented code
generation with LLVM for most of Scan/Aggregation/Join methods. That
approach gives much better speedup (x4-5 times on Q1), but it takes many
code changes, so we're developing it as PostgreSQL extension. It's not
ready for release yet, but we're now working on performance, compatibility,
as well as how to make it easier to maintain by making it possible to build
both JIT compiler and the interpreter from the same source code. More
information about our full JIT compiler and related work is available in
presentation at LLVM Cauldron (http://llvm.org/devmtg/2016-09/slides/Melnik-
PostgreSQLLLVM.pdf ) and PGCon (https://www.pgcon.org/2016/
schedule/attachments/411_ISPRAS%20LLVM+Postgres%20Presentation.pdf ).
Also we're going to give a lightning talk at upcoming PGConf.EU in Tallinn,
and discuss the further development with PostgreSQL community. We'd
appreciate any feedback!

-- 
Best regards,
  Dmitry Melnik
  Institute for System Programming of the Russian Academy of Sciences
  ISP RAS (www.ispras.ru/en/)


Re: [HACKERS] Microvacuum support for Hash Index

2016-10-28 Thread Amit Kapila
On Mon, Oct 24, 2016 at 2:21 PM, Ashutosh Sharma  wrote:
> Hi All,
>
> I have added a microvacuum support for hash index access method and
> attached is the v1 patch for the same.
>

This is an important functionality for hash index as we already do
have same functionality for other types of indexes like btree.

> The patch basically takes care
> of the following things:
>
> 1. Firstly, it changes the marking of dead tuples from
> 'tuple-at-a-time' to 'page-at-a-time' during hash index scan. For this
> we accumulate the heap tids and offset of all the hash index tuples if
> it is pointed by kill_prior_tuple during scan and then mark all
> accumulated tids as LP_DEAD either while stepping from one page to
> another (assuming the scan in both forward and backward direction) or
> during end of the hash index scan or during rescan.
>
> 2. Secondly, when inserting tuple into hash index table, if not enough
> space is found on a current page then it ensures that we first clean
> the dead tuples if found in the current hash index page before moving
> to the next page in a bucket chain or going for a bucket split. This
> basically increases the page reusability and reduces the number of
> page splits, thereby reducing the overall size of hash index table.
>

Few comments on patch:

1.
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+ Buffer bucketbuf = InvalidBuffer;
+ Buffer buffer;
+ Buffer metabuf;
+ Page page;
+ XLogRedoAction action;

While replaying the delete/vacuum record on standby, it can conflict
with some already running queries.  Basically the replay can remove
some row which can be visible on standby.  You need to resolve
conflicts similar to what we do in btree delete records (refer
btree_xlog_delete).

2.
+ /*
+ * Write-lock the meta page so that we can decrement
+ * tuple count.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+ _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+  (buf == bucket_buf) ? true : false);
+
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);

It seems here meta page lock is acquired for duration more than
required and also it is not required when there are no deletable items
on page. You can take the metapage lock before decrementing the count.

3.
  Assert(offnum <= maxoff);
+

Spurious space.  There are some other similar spurious white space
changes in patch, remove them as well.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-10-28 Thread Amit Kapila
On Fri, Oct 28, 2016 at 12:16 PM, Andres Freund  wrote:
> On 2016-10-28 11:23:22 +0530, Amit Kapila wrote:
>
>> I think if we decide to form the scan key from a qual only when qual
>> refers to fixed length column and that column is before any varlen
>> column, the increased cost will be alleviated.  Do you have any other
>> idea to alleviate such cost?
>
> Well, that'll also make the feature not particularly useful :(.
>

Yeah, the number of cases it can benefit will certainly reduce, but
still it can be a win wherever it can be used which is not bad.
Another thing that can be considered here is to evaluate if we can use
selectivity as a measure to decide if we can push the quals down.  If
the quals are selective, then I think the non-effective caching impact
can be negated and we can in turn see benefits.  For example, we can
consider a table with 20 to 40 columns having large number of rows and
try to use different columns in quals and then test it for different
selectivity to see how the performance varies with this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Streaming basebackups vs pg_stat_tmp

2016-10-28 Thread Magnus Hagander
In 9.6 and earlier, if you change pg_stat_tmp to be a symlink, basebackups
no longer work. That's because we create symlink entry in the tarfile for
it instead of an empty directory, but with no data, which Breaks Everything
(TM).

This was fixed in head in 6ad8ac60, which introduced "more excludes", due
to the refactoring. That commit message refers to it also fixing this bug,
but it seems the bugfix was never backpatched.

Or did I miss something?

Attached patch fixds this (based on 9.5 which is where I ran into it, but
it needs to go in other back branches as well) by bringing back a
(modified) version of the functoin _tarWriteDir() to the back branches.

I'd appreciate a look-over before committing, but it works fine in my tests.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 6120c8f..c4e3a08 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -57,6 +57,8 @@ static bool sendFile(char *readfilename, char *tarfilename,
 static void sendFileWithContent(const char *filename, const char *content);
 static void _tarWriteHeader(const char *filename, const char *linktarget,
 struct stat * statbuf);
+static int64 _tarWriteDir(const char *pathbuf, int basepathlen, struct stat * statbuf,
+			 bool sizeonly);
 static void send_int8_string(StringInfoData *buf, int64 intval);
 static void SendBackupHeader(List *tablespaces);
 static void base_backup_cleanup(int code, Datum arg);
@@ -966,9 +968,7 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		if ((statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0) ||
 		  strncmp(de->d_name, PG_STAT_TMP_DIR, strlen(PG_STAT_TMP_DIR)) == 0)
 		{
-			if (!sizeonly)
-_tarWriteHeader(pathbuf + basepathlen + 1, NULL, &statbuf);
-			size += 512;
+			size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
 			continue;
 		}
 
@@ -978,9 +978,7 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		 */
 		if (strcmp(de->d_name, "pg_replslot") == 0)
 		{
-			if (!sizeonly)
-_tarWriteHeader(pathbuf + basepathlen + 1, NULL, &statbuf);
-			size += 512;		/* Size of the header just added */
+			size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
 			continue;
 		}
 
@@ -1245,6 +1243,30 @@ _tarWriteHeader(const char *filename, const char *linktarget,
 }
 
 /*
+ * Write tar header for a directory.  If the entry in statbuf is a link then
+ * write it as a directory anyway.
+ */
+static int64
+_tarWriteDir(const char *pathbuf, int basepathlen, struct stat * statbuf,
+			 bool sizeonly)
+{
+	if (sizeonly)
+		/* Directory headers are always 512 bytes */
+		return 512;
+
+	/* If symlink, write it as a directory anyway */
+#ifndef WIN32
+	if (S_ISLNK(statbuf->st_mode))
+#else
+	if (pgwin32_is_junction(pathbuf))
+#endif
+		statbuf->st_mode = S_IFDIR | S_IRWXU;
+
+	_tarWriteHeader(pathbuf + basepathlen + 1, NULL, statbuf);
+	return 512;
+}
+
+/*
  * Increment the network transfer counter by the given number of bytes,
  * and sleep if necessary to comply with the requested network transfer
  * rate.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-10-28 Thread Gilles Darold
Le 28/10/2016 à 05:09, Karl O. Pinc a écrit :
> Hi Gilles,
>
> On Thu, 27 Oct 2016 19:03:11 +0200
> Gilles Darold  wrote:
>
>> The current v8 of the patch
> For your consideration.
>
> Attached is a patch to apply to v8 of your patch.
>
> I moved the call to logfile_writename() in write_syslogger_file()
> into the open_csvlogfile().  That's where the new filename
> is generated and it makes sense to record the new name
> upon generation.

Yes, it make sense. The last change to documentation, comment and this
move have been included in the v9 of the patch, attached here.

Thanks a lot.

-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2e64cc4..d577bac 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15443,6 +15443,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
   
 
   
+   pg_current_logfile()
+   text
+   primary log file name in use by the logging collector
+  
+
+  
+   pg_current_logfile(text)
+   text
+   log file name, of log in the requested format, in use by the
+   logging collector
+  
+
+  
pg_my_temp_schema()
oid
OID of session's temporary schema, or 0 if none
@@ -15660,6 +15673,29 @@ SET search_path TO schema , schema, ..
 the time when the postmaster process re-read the configuration files.)

 
+
+pg_current_logile
+   
+
+   
+pg_current_logfile returns, as text,
+the path of either the csv or stderr log file currently in use by the
+logging collector.  This is a path including the
+ directory and the log file name.
+Log collection must be active or the return value
+is NULL.  When multiple logfiles exist, each in a
+different format, pg_current_logfile called
+without arguments returns the path of the file having the first format
+found in the ordered
+list: stderr, csvlog.
+NULL is returned when no log file has any of these
+formats.  To request a specific file format supply,
+as text, either csvlog
+or stderr as the value of the optional parameter. The
+return value is NULL when the log format requested
+is not a configured .
+   
+

 pg_my_temp_schema

diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index fddb69b..5bea196 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -61,6 +61,18 @@ Item
 
 
 
+ current_logfiles
+ A file recording the current log file(s) used by the syslogger and
+  its log format, stderr or csvlog. Each line
+  of the file is a space separated list of two elements: the log format and
+  the full path to the log file including the value
+  of . The log format must be present
+  in  to be listed in the file. This file
+  is present only when  is
+  activated.
+
+
+
  global
  Subdirectory containing cluster-wide tables, such as
  pg_database
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index fd62d66..83afc34 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -146,6 +146,7 @@ static char *logfile_getname(pg_time_t timestamp, const char *suffix);
 static void set_next_rotation_time(void);
 static void sigHupHandler(SIGNAL_ARGS);
 static void sigUsr1Handler(SIGNAL_ARGS);
+static void logfile_writename(char *filename, char *csvfilename);
 
 
 /*
@@ -348,6 +349,13 @@ SysLoggerMain(int argc, char *argv[])
 rotation_disabled = false;
 rotation_requested = true;
 			}
+
+			/*
+			 * Force rewriting last log filename when reloading configuration,
+			 * log_destination and logging_collector may have been changed. Do
+			 * it right now to not wait for the next file rotation.
+			 */
+			logfile_writename(last_file_name, last_csv_file_name);
 		}
 
 		if (Log_RotationAge > 0 && !rotation_disabled)
@@ -513,8 +521,13 @@ SysLogger_Start(void)
 	pid_t		sysloggerPid;
 	char	   *filename;
 
-	if (!Logging_collector)
+	if (!Logging_collector) {
+		/* Logging collector is not enabled. We don't know where messages are
+		 * logged.  Remove outdated file holding the current log filenames.
+		 */
+		unlink(CURRENT_LOG_FILENAME);
 		return 0;
+	}
 
 	/*
 	 * If first time through, create the pipe which will receive stderr
@@ -574,6 +587,8 @@ SysLogger_Start(void)
 
 	syslogFile = logfile_open(filename, "a", false);
 
+	logfile_writename(filename, NULL);
+
 	pfree(filename);
 
 #ifdef EXEC_BACKEND
@@ -1098,6 +1113,8 @@ open_csvlogfile(void)
 		pfree(last_csv_file_name);
 
 	last_csv_file_name = filename;
+
+	logfile_writename(last_file_name, last_csv_file_name);
 }
 
 /*
@@ -1209,6 +1226,8 @@ logfile_rotate(bool time_based_rotation, int size_rotation_for)
 			return;
 		}
 
+		logfile_writename(filename, csvfilename);
+
 		fclose(syslogFile);
 		syslogFile = fh;
 
@@ -1220,7 +1239,6 @@ logfile_rotate(bool time_based_rotation, int size_rotation_for)
 	}
 
 	/* Same 

Re: [HACKERS] Proposal : For Auto-Prewarm.

2016-10-28 Thread Andres Freund
On 2016-10-28 12:59:58 +0530, Amit Kapila wrote:
> On Fri, Oct 28, 2016 at 1:45 AM, Jim Nasby  wrote:
> > On 10/27/16 6:39 AM, Mithun Cy wrote:
> >>
> >> # pg_autoprewarm.
> >
> >
> > IMO it would be better to add this functionality to pg_prewarm instead of
> > creating another contrib module.
> >
> 
> There is not much common functionality between the two.

I don't really agree. For me manual and automated prewarming are pretty
closely related. Sure they have their independent usages, and not too
much code might be shared. But grouping them in the same extension seems
to make sense, it's confusing to have closely related but independent
extensions.

> One point that seems to be worth discussing is when should the buffer
> information be dumped to a file?  Shall we dump at each checkpoint or
> at regular intervals via auto prewarm worker or at some other time?

Should probably be at some regular interval - not sure if checkpoints
are the best time (or if it's even realistic to tie a bgworker to
checkpoints), since checkpoints have a significant impact on the state
of shared_buffers.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2016-10-28 Thread Amit Kapila
On Fri, Oct 28, 2016 at 1:45 AM, Jim Nasby  wrote:
> On 10/27/16 6:39 AM, Mithun Cy wrote:
>>
>> # pg_autoprewarm.
>
>
> IMO it would be better to add this functionality to pg_prewarm instead of
> creating another contrib module.
>

There is not much common functionality between the two.  The main job
of this feature is to dump the contents of shared buffers and reload
them at startup and it takes care for not over ridding the existing
shared buffer contents while reloading the dump file.  This is
somewhat different to what prewarm provides which is to load the
relation blocks in memory or buffers, it has nothing to do with prior
shared buffer contents.  So, I am not sure if it is good idea to add
it as a functionality with prewarm module.  OTOH, if more people want
that way, then as such there is no harm in doing that way.

One point that seems to be worth discussing is when should the buffer
information be dumped to a file?  Shall we dump at each checkpoint or
at regular intervals via auto prewarm worker or at some other time?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2016-10-28 Thread Ashutosh Bapat
On Tue, Oct 18, 2016 at 9:09 PM, Robert Haas  wrote:
> On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat
>  wrote:
>>> Have you tested the effect of this patch on planner memory consumption
>>> with multi-way joins between tables with many partitions?  If you
>>> haven't, you probably should. (Testing runtime would be good, too.)
>>> Does it grow linearly?  Quadratically?  Exponentially?  Minor leaks
>>> don't matter, but if we're generating too much garbage we'll have to
>>> make sure it gets cleaned up soon enough to prevent runaway memory
>>> usage.
>>
>> I tried to check memory usage with various combinations of number of
>> partitions and number of relations being joined. For higher number of
>> relations being joined like 10 with 100 partitions, OOM killer kicked
>> in during the planning phase. I am suspecting
>> adjust_partitionrel_attrs() (changed that name to
>> adjust_join_appendrel_attrs() to be in sync with
>> adjust_appendrel_attrs()) to be the culprit. It copies expression
>> trees every time for joining two children. That's an exponentially
>> increasing number as the number of legal joins increases
>> exponentially. I am still investigating this.
>
> I think the root of this problem is that the existing paths shares a
> lot more substructure than the ones created by the new code.  Without
> a partition-wise join, the incremental memory usage for a joinrel
> isn't any different whether the underlying rel is partitioned or not.
> If it's partitioned, we'll be pointing to an AppendPath; if not, we'll
> be pointing to some kind of Scan.  But the join itself creates exactly
> the same amount of new stuff regardless of what's underneath it.  With
> partitionwise join, that ceases to be true.  Every joinrel - and the
> number of those grows exponentially in the number of baserels, IICU -
> needs its own list of paths for every member rel.  So if a
> non-partition-wise join created X paths, and there are K partitions, a
> partition-wise join creates X * K paths.  That's a lot.
>
> Although we might be able to save some memory by tightening things up
> here and there - for example, right now the planner isn't real smart
> about recycling paths that are evicted by add_path(), and there's
> probably other wastage as well - I suspect that what this shows is
> that the basic design of this patch is not going to be viable.
> Intuitively, it's often going to be the case that we want the "same
> plan" for every partition-set.  That is, if we have A JOIN B ON A.x =
> B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility
> partitioned, then the result should be an Append plan with 100 join
> plans under it, and all 100 of those plans should be basically mirror
> images of each other.  Of course, that's not really right in general:
> for example, it could be that A1 is big and A2 is small while B1 is
> small and B2 is big, so that the right plan for (A1 JOIN B1) and for
> (A2 JOIN B2) are totally different from each other.  But in many
> practical cases we'll want to end up with a plan of precisely the same
> shape for all children, and the current design ignores this, expending
> both memory and CPU time to compute essentially-equivalent paths
> across all children.

I think there are going to be two kinds of partitioning use-cases.
First, carefully hand-crafted by DBAs so that every partition is
different from other and so is every join between two partitions.
There will be lesser number of partitions, but creating paths for each
join between partitions will be crucial from performance point of
view. Consider, for example, systems which use partitions to
consolidate results from different sources for analytical purposes or
sharding. If we consider various points you have listed in [1] as to
why a partition is equivalent to a table, each join between partitions
is going to have very different characteristics and thus deserves a
set of paths for its own. Add to that possibility of partition pruning
or certain conditions affecting particular partitions, the need for
detailed planning evident.

The other usage of partitioning is to distribute the data and/or
quickly eliminate the data by partition pruning. In such case, all
partitions of a given table will have very similar properties. There
is a large chance that we will end up having same plans for every
partition and for joins between partitions. In such cases, I think it
suffices to create paths for just one or may be a handful partitions
of join and repeat that plan for other partitions of join. But in such
cases it also makes sense to have a light-weight representation for
partitions as compared to partitions being a full-fledged tables. If
we have such a light-weight representation, we may not even create
RelOptInfos representing joins between partitions, and different paths
for each join between partitions.

>
> One way of attacking this problem is to gang together partitions which
> are equivalent for planning purposes, as discussed in the