Re: [HACKERS] pageinspect patch, for showing tuple data

2015-09-09 Thread Michael Paquier
On Wed, Sep 9, 2015 at 8:39 PM, Nikolay Shaplov
 wrote:
> В письме от 8 сентября 2015 11:53:24 Вы написали:
>> Hence, instead of all those complications, why not simply introducing two
>> functions that take as input the tuple data and the OID of the relation,
>> returning those bytea arrays? It seems to be that this would be a more
>> handy interface, and as this is for educational purposes I guess that the
>> addition of the overhead induced by the extra function call would be
>> acceptable.
>
> When I looked at this task I considered doing the same thing. Bun 
> unfortunately it is
> not possible. (Or if be more correct it is possible, but if I do so, it would 
> be hard to us
> e it)
> The thing is, that to properly split tuple data into attributes, we need some 
> values from
> tuple header:
> t_infomask2: where postgres store actual number of stored attributes, that 
> may differ
> from one in tuple data. That will allow to properly parse tuples after
> ALTER TABLE ADD COLUMN when it was done without SET DEFAULT option
> t_infomask: has bit that indicates that there is some null attributes in 
> tuple.
> t_bits: has a bit mask that shows what attributes were set to null.
>
> So if move tuple data parsing into separate function, then we have to pass 
> these
> values alongside the tuple data. This is not convenient at all.
> So I did it in a way you see in the patch.

Why is it not convenient at all? Yes, you have a point, we need those
fields to be able to parse the t_data properly. Still the possibility
to show individual fields of a tuple as a bytea array either with
toasted or detoasted values is a concept completely different from
simply showing the page items, which is what, it seems to me,
heap_page_items is aimed to only do. Hence, As t_infomask2, t_infomask
and t_bits are already available as return fields of heap_page_items,
we should simply add a function like that:
heap_page_item_parse(Oid relid, bytea data, t_infomask2 int,
t_infomask int, t_bits int, bool force_detoast, warning_mode bool)
returns bytea[]

Note that the data corruption checks apply only to this function as
far as I understand, so I think that things could be actually split
into two independent patches:
1) Upgrade heap_page_items to add the tuple data as bytea.
2) Add the new function able to parse those fields appropriately.

As this code, as you justly mentioned, is aimed mainly for educational
purposes to understand a page structure, we should definitely make it
as simple as possible at code level, and it seems to me that this
comes with a separate SQL interface to control tuple data parsing as a
bytea[]. We are doing no favor to our users to complicate the code of
pageinspect.c as this patch is doing in its current state, especially
if beginners want to have a look at it.

>> Actually not two functions, but just one, with an extra flag to be
>> able to enforce detoasting on the field values where this can be done.
>
> Yeah, I also thought about it. But did not come to any final conclusion. 
> Should
> we add forth argument, that enables detoasting, instead of adding separate
> function?

This is definitely something you want to control with a switch.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-09-09 Thread Fabien COELHO


Hello Amit,


 - T1 flush=off  27480 -> 27482 :+0.0%
 - T1 flush=on   25214 -> 26819 :+6.3%
 - T2 flush=off   5050 ->  6194 :   +22.6%
 - T2 flush=on2771 ->  6110 :  +120.4%


There is a clear win only in cases when sort is used with flush, apart
from that using sort alone doesn't have any clear advantage.


Indeed, I agree that the improvement is much smaller without flushing, 
although it is there somehow (+0.0 & +22.6 => +11.3% on average).


Well, at least we may agree that it is "somehow significantly better" ?:-)

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Small typo in timeline.h regarding the meaning of infinity for timeline history entry

2015-09-09 Thread Michael Paquier
Hi all,

timeline.h quotes that the end point of timeline history entry means
infinity when its value is 0. But that's not completely true, I think
that what is meant here is InvalidXLogRecPtr:
 {
TimeLineID  tli;
XLogRecPtr  begin;  /* inclusive */
-   XLogRecPtr  end;/* exclusive, 0 means
infinity */
+   XLogRecPtr  end;/* exclusive,
InvalidXLogRecPtr means
+* infinity */
 } TimeLineHistoryEntry;

And the code leads into this direction as well.
Regards,
-- 
Michael
diff --git a/src/include/access/timeline.h b/src/include/access/timeline.h
index 2b62a56..01437c0 100644
--- a/src/include/access/timeline.h
+++ b/src/include/access/timeline.h
@@ -26,7 +26,8 @@ typedef struct
 {
 	TimeLineID	tli;
 	XLogRecPtr	begin;			/* inclusive */
-	XLogRecPtr	end;			/* exclusive, 0 means infinity */
+	XLogRecPtr	end;			/* exclusive, InvalidXLogRecPtr means
+ * infinity */
 } TimeLineHistoryEntry;
 
 extern List *readTimeLineHistory(TimeLineID targetTLI);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-09 Thread Michael Paquier
On Wed, Sep 9, 2015 at 7:13 PM, Alexander Korotkov wrote:
> On Wed, Sep 9, 2015 at 9:01 AM, Michael Paquier wrote:
>> The code building the target history file is a duplicate of what is done
>> with the source. Perhaps we could refactor that as a single routine in
>> pg_rewind.c?
>
>
> Do you mean duplication between rewind_parseTimeLineHistory() and
> readTimeLineHistory(). We could put readTimeLineHistory() into common with
> some refactoring. This is the subject of separate patch, though.

Well, there is that :) But I was referring to this part (beware of
useless newlines in your code btw):
+   if (targettli == 1)
{
-   TimeLineHistoryEntry *entry = &sourceHistory[i];
+   targetHistory = (TimeLineHistoryEntry *)
pg_malloc(sizeof(TimeLineHistoryEntry));
+   targetHistory->tli = targettli;
+   targetHistory->begin = targetHistory->end = InvalidXLogRecPtr;
+   targetNentries = 1;
+
+   }
Your patch generates a timeline history array for the target node, and
HEAD does exactly the same for the source node, so IMO your patch is
duplicating code, the only difference being that fetchFile is used for
the source history file and slurpFile is used for the target history
file. Up to now the code duplication caused by the difference that the
target is always a local instance, and the source may be either local
or remote has created the interface that we have now, but I think that
we should refactor fetch.c such as the routines it contains do not
rely anymore on datadir_source, and use instead a datadir argument. If
datadir is NULL, the remote code path is used. If it is not NULL,
local code path is used. This way we can make the target node use
fetchFile only, and remove the duplication your patch is adding, as
well as future ones like that. Thoughts?

>> Except that, the patch looks pretty neat to me. I was wondering as well:
>> what tests did you run up to now with this patch? I am attaching an updated
>> version of my test script I used for some more complex scenarios. Feel free
>> to play with it.
>
> I was running manual tests like a noob :) When you attached you bash script,
> I've switched to it.

[product_placement]The patch to improve the recovery test suite
submitted in this CF would have eased the need of bash-ing test cases
here, and we could have test cases directly included in your
patch.[/product_placement]

> A also added additional check in findCommonAncestorTimeline(). Two standbys
> could be independently promoted and get the same new timeline ID. Now, it's
> checked that timelines, that we assume to be the same, have equal begins.
> Begins could still match by coincidence. But the same risk exists in
> original pg_rewind, though.

Really? pg_rewind blocks attempts to rewind two nodes that are already
on the same timeline before checking if their timeline history map at
some point or not:
/*
 * If both clusters are already on the same timeline, there's nothing to
 * do.
 */
if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
ControlFile_source.checkPointCopy.ThisTimeLineID)
pg_fatal("source and target cluster are on the same
timeline\n");
And this seems really justified to me. Imagine that you have one
master, with two standbys linked to it. If both standbys are promoted
to the same timeline, you could easily replug them to the master, but
I fail to see the point to be able to replug one promoted standby with
the other in this case: those nodes have segment and history files
that map with each other, an operator could easily mess up things in
such a configuration.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-09-09 Thread Jeff Janes
On Wed, Sep 9, 2015 at 12:12 PM, Andres Freund  wrote:

> On 2015-09-09 20:56:15 +0200, Fabien COELHO wrote:
> > As I wrote before, FreeBSD would be a good candidate because the
> > posix_fadvise seems much more reasonable than on Linux, and should be
> > profitable, so it would be a pity to remove it.
>
> Why do you think it's different on fbsd? Also, why is it unreasonable
> that DONNEED removes stuff from the cache?
>


It seems kind of silly that it means "No one, even people I am not aware of
and have no right to speak for, needs this" as opposed to "I don't need
this, don't keep it around on my behalf."

Cheers,

Jeff


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-09-09 Thread Fujii Masao
On Thu, Sep 10, 2015 at 1:08 PM, Amit Kapila  wrote:
> On Thu, Sep 10, 2015 at 9:29 AM, Fujii Masao  wrote:
>>
>> On Thu, Sep 10, 2015 at 12:49 PM, Amit Kapila 
>> wrote:
>> > On Wed, Sep 9, 2015 at 6:43 PM, Fujii Masao 
>> > wrote:
>> >>
>> >> On Fri, Sep 4, 2015 at 4:48 PM, Amit Kapila 
>> >> wrote:
>> >>
>> >
>> > - errdetail("Could not rename \"%s\" to \"%s\": %m.",
>> > + errdetail("\"%s\" could not be renamed to \"%s\": %m.",
>> >
>> > Is there any reason to change this message?
>> > I think you have changed this message to make it somewhat similar with
>> > the new message we are planning to use in this function, but I don't see
>> > that as compelling reason to change this message.
>>
>> The following part in error message style guide made me feel that's
>> better.
>> IOW, I didn't think that the previous message follows complete-sentence
>> style.
>>
>> -
>> Detail and hint messages: Use complete sentences, and end each with a
>> period.
>> -
>>
>
> I am not sure if this indicates that previous used message was wrong.
> If we look for errdetail in code, then there are many other similar
> messages.

Yes, but the existence of other similar messages doesn't mean that
they follow the style. We need to understand what "complete sentence" means.
I was thinking that "complete sentence" basically needs to contain
the subject. But to be honest I'm not confident about that.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-09-09 Thread Amit Kapila
On Thu, Sep 10, 2015 at 4:16 AM, Robert Haas  wrote:
>
> On Wed, Sep 9, 2015 at 11:07 AM, Amit Kapila 
wrote:
> > On Wed, Sep 9, 2015 at 11:47 AM, Haribabu Kommi <
kommi.harib...@gmail.com>
> > wrote:
> >> With subquery, parallel scan is having some problem, please refer
below.
> >>
> >> postgres=# explain analyze select * from test01 where kinkocord not in
> >> (select kinkocord from test02 where tenpocord = '001');
> >> ERROR:  badly formatted node string "SUBPLAN :subLinkType 2
:testexpr"...
> >> CONTEXT:  parallel worker, pid 32879
> >> postgres=#
> >
> > The problem here is that readfuncs.c doesn't have support for reading
> > SubPlan nodes. I have added support for some of the nodes, but it seems
> > SubPlan node also needs to be added.  Now I think this is okay if the
> > SubPlan
> > is any node other than Funnel, but if Subplan contains Funnel, then each
> > worker needs to spawn other workers to execute the Subplan which I am
> > not sure is the best way.  Another possibility could be store the
results of
> > Subplan in some tuplestore or some other way and then pass those to
workers
> > which again doesn't sound to be promising way considering we might have
> > hashed SubPlan for which we need to build a hashtable.  Yet another way
> > could be for such cases execute the Filter in master node only.
>
> IIUC, there are two separate issues here:
>

Yes.

> 1. We need to have readfuncs support for all the right plan nodes.
> Maybe we should just bite the bullet and add readfuncs support for all
> plan nodes.  But if not, we can add support for whatever we need.
>
> 2. I think it's probably a good idea - at least for now, and maybe
> forever - to avoid nesting parallel plans inside of other parallel
> plans.  It's hard to imagine that being a win in a case like this, and
> it certainly adds a lot more cases to think about.
>

I also think that avoiding nested parallel plans is a good step forward.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-09-09 Thread Amit Kapila
On Thu, Sep 10, 2015 at 9:29 AM, Fujii Masao  wrote:
>
> On Thu, Sep 10, 2015 at 12:49 PM, Amit Kapila 
wrote:
> > On Wed, Sep 9, 2015 at 6:43 PM, Fujii Masao 
wrote:
> >>
> >> On Fri, Sep 4, 2015 at 4:48 PM, Amit Kapila 
> >> wrote:
> >>
> >
> > - errdetail("Could not rename \"%s\" to \"%s\": %m.",
> > + errdetail("\"%s\" could not be renamed to \"%s\": %m.",
> >
> > Is there any reason to change this message?
> > I think you have changed this message to make it somewhat similar with
> > the new message we are planning to use in this function, but I don't see
> > that as compelling reason to change this message.
>
> The following part in error message style guide made me feel that's
better.
> IOW, I didn't think that the previous message follows complete-sentence
style.
>
> -
> Detail and hint messages: Use complete sentences, and end each with a
period.
> -
>

I am not sure if this indicates that previous used message was wrong.
If we look for errdetail in code, then there are many other similar
messages.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] checkpointer continuous flushing

2015-09-09 Thread Amit Kapila
On Wed, Sep 9, 2015 at 2:31 PM, Fabien COELHO  wrote:
>
>
> Hello Amit,
>
>>> I think that we may conclude, on these run:
>>>
>>> (1) sorting seems not to harm performance, and may help a lot.
>>
>>
>> I agree with first part, but about helping a lot, I am not sure
>
>
> I'm focussing on the "sort" dimension alone, that is I'm comparing the
average tps performance with sorting with the same test without sorting, :
There are 4 cases from your tests, if I'm not mistaken:
>
>  - T1 flush=off  27480 -> 27482 :+0.0%
>  - T1 flush=on   25214 -> 26819 :+6.3%
>  - T2 flush=off   5050 ->  6194 :   +22.6%
>  - T2 flush=on2771 ->  6110 :  +120.4%
>

There is a clear win only in cases when sort is used with flush, apart
from that using sort alone doesn't have any clear advantage.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-09-09 Thread Fujii Masao
On Thu, Sep 10, 2015 at 12:49 PM, Amit Kapila  wrote:
> On Wed, Sep 9, 2015 at 6:43 PM, Fujii Masao  wrote:
>>
>> On Fri, Sep 4, 2015 at 4:48 PM, Amit Kapila 
>> wrote:
>> >
>> > You mean to say, just try renaming tablespace_map and don't display any
>> > message whether that is successful or not-successful?
>> >
>> > I see some user inconvenience if we do this way, which is even after the
>> > backup is cancelled, on next recovery, there will be a log message
>> > indicating
>> > either rename of tablespace_map successful or unsuccessful.  Also, don't
>> > you
>> > think it is better to let user know that the tablespace_map file is
>> > successfully
>> > renamed as we do for backup_label file.  Shall we change the patch such
>> > that
>> > if backup_label is successfully renamed and renaming of tablespace_map
>> > gets failed, then display a log message to something like below:
>> >
>> > LOG:  online backup mode canceled
>> > DETAIL:  "backup_label" was renamed to "backup_label.old", could not
>> > rename
>> > "tablespace_map" to "tablespace_map.old"
>>
>> Agreed with this direction. So what about the attached patch?
>>
>
> - errdetail("Could not rename \"%s\" to \"%s\": %m.",
> + errdetail("\"%s\" could not be renamed to \"%s\": %m.",
>
> Is there any reason to change this message?
> I think you have changed this message to make it somewhat similar with
> the new message we are planning to use in this function, but I don't see
> that as compelling reason to change this message.

The following part in error message style guide made me feel that's better.
IOW, I didn't think that the previous message follows complete-sentence style.

-
Detail and hint messages: Use complete sentences, and end each with a period.
-

> Other than that patch looks good.

Thanks for the review!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-09-09 Thread Amit Kapila
On Wed, Sep 9, 2015 at 6:43 PM, Fujii Masao  wrote:
>
> On Fri, Sep 4, 2015 at 4:48 PM, Amit Kapila 
wrote:
> >
> > You mean to say, just try renaming tablespace_map and don't display any
> > message whether that is successful or not-successful?
> >
> > I see some user inconvenience if we do this way, which is even after the
> > backup is cancelled, on next recovery, there will be a log message
> > indicating
> > either rename of tablespace_map successful or unsuccessful.  Also,
don't you
> > think it is better to let user know that the tablespace_map file is
> > successfully
> > renamed as we do for backup_label file.  Shall we change the patch such
that
> > if backup_label is successfully renamed and renaming of tablespace_map
> > gets failed, then display a log message to something like below:
> >
> > LOG:  online backup mode canceled
> > DETAIL:  "backup_label" was renamed to "backup_label.old", could not
rename
> > "tablespace_map" to "tablespace_map.old"
>
> Agreed with this direction. So what about the attached patch?
>

- errdetail("Could not rename \"%s\" to \"%s\": %m.",
+ errdetail("\"%s\" could not be renamed to \"%s\": %m.",

Is there any reason to change this message?
I think you have changed this message to make it somewhat similar with
the new message we are planning to use in this function, but I don't see
that as compelling reason to change this message.

Other than that patch looks good.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-09 Thread Tom Lane
David Fetter  writes:
> On Thu, Sep 10, 2015 at 01:32:10AM +0200, Andres Freund wrote:
>> I'm not convinced. Sure, it's easy, but I by now have written the
>> respective function dozens of times. Why should we force that on
>> everyone?

> +20 or so here, that being the number of times I recall offhand having
> written the function.

Were all twenty of them exactly alike?  Were they identical to Andres'
several dozen attempts?

The problem I've got with this proposal is that by the time you get to
a function that could satisfy every possible use-case, you do not have
something that is easier to use than "write your own function that
addresses just your use-case".

The only complaint I've seen in this thread that seems like a valid
deficiency is that RAISE can't deal with treating the error severity level
as a variable.  But surely we should address that as a new RAISE feature,
not by inventing a SQL wrapper that will need to reproduce every existing
RAISE feature before it can think about solving anything new.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: function parse_ident

2015-09-09 Thread Pavel Stehule
Hi

2015-09-09 21:55 GMT+02:00 Alvaro Herrera :

> Pavel Stehule wrote:
>
> > I cannot to use current SplitIdentifierString because it is designed for
> > different purpose - and it cannot to separate non identifier part. But
> the
> > code is simple - and will be cleaned.
> >
> >  postgres=# select * from parse_ident('"AHOJ".NAZDAR[]'::text);
> > ┌───┬───┐
> > │ parts │ other │
> > ╞═══╪═══╡
> > │ {AHOJ,nazdar} │ []│
> > └───┴───┘
> > (1 row)
>
> Um.  Now this is really getting into much of the same territory I got
> into with the objname/objargs arrays for pg_get_object_address.  I think
> the "other" bit is a very poor solution to that.
>
> If you want to be able to parse names for all kinds of objects, you need
> a solution much more complex than this function.  I think a clean
> solution returns three sets of things; one is the primary part of the
> name, which is an array of text; the second is the secondary name, which
> is another array of text; the third is an array of TypeName.
>
> For the name of a relation, only the first of these arrays is used.  For
> the name of objects subsidiary to a relation, the first two are used
> (the first array is the name of the relation itself, and the second
> array is the name of the object; for instance a constraint name, or a
> trigger name).
>
> The array of type names is necessary because the parsing of TypeName is
> completely unlike parsing of plain names.  You need [] decorator and
> typmod.  If you consider objects such as casts, you need two TypeNames
> ("from" and "to"), hence this is an array and not a single one.  As far
> as I recall there are other object types that also need more than one
> TypeName.
>
> For the name of a function, you need the first text array, and the array
> of TypeName which are the input argument types.
>
> If you don't want to have all this complexity, I think you need to forgo
> the idea of the "other" thingy that you propose above, and just concern
> yourself with the first bits.  I don't think "AHOJ".NAZDAR[] is an
> identifier.
>

yes, usually I don't need a "other" part. And If I need it, then I can get
it as difference against a original string. But sometimes you don't get a
clean string - and you have to find a end of identifier. The
SplitIdentifierString calculates only with separator char, and it cannot to
find end of ident. So little bit modified API can look like

CREATE OR REPLACE FUNCTION parse_ident(str text, strict boolean DEFAULT
true) RETURNS text[]

raise exception "syntax error" for '"AHOJ".NAZDAR[]' when "strict" is true
returns "AHOJ".nazdar for '"AHOJ".NAZDAR[]' when "strict" is false

Pavel



> --
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-09 Thread Pavel Stehule
2015-09-10 2:47 GMT+02:00 David Fetter :

> On Thu, Sep 10, 2015 at 01:32:10AM +0200, Andres Freund wrote:
> > On 2015-09-09 18:27:51 -0400, Robert Haas wrote:
> > > On Wed, Sep 9, 2015 at 11:37 AM, dinesh kumar 
> wrote:
> > > > Sure, it’s a clear fact that, we can implement this function
> > > > with RAISE statements.
> > >
> > > Given that, I suggest we just forget the whole thing.
> >
> > I'm not convinced. Sure, it's easy, but I by now have written the
> > respective function dozens of times. Why should we force that on
> > everyone?
>
> +20 or so here, that being the number of times I recall offhand having
> written the function.
>

I have same opinion - it is pretty often used function.

Pavel


>
> Cheers,
> David.
> --
> David Fetter  http://fetter.org/
> Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
> Skype: davidfetter  XMPP: david.fet...@gmail.com
>
> Remember to vote!
> Consider donating to Postgres: http://www.postgresql.org/about/donate
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] Freeze avoidance of very large table.

2015-09-09 Thread Andres Freund
On 2015-09-04 23:35:42 +0100, Simon Riggs wrote:
> This looks OK. You saw that I was proposing to solve this problem a
> different way ("Summary of plans to avoid the annoyance of Freezing"),
> suggesting that we wait for a few CFs to see if a patch emerges for that -
> then fall back to this patch if it doesn't? So I am moving this patch to
> next CF.

As noted on that other thread I don't think that's a good policy, and it
seems like Robert agrees with me. So I think we should move this back to
"Needs Review".

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do Layered Views/Relations Preserve Sort Order ?

2015-09-09 Thread Charles Sheridan

On 9/9/15 7:44 PM, David G. Johnston wrote:
On Wed, Sep 9, 2015 at 7:53 PM, Charles Sheridan >wrote:


Hi All,

When there are several views defined on top of each other, are
SELECTs on views that do not specify a SORT order guaranteed to
preserve the cumulative sort order of the lower-level views ?

Is the answer true for any arbitrarily large set of layered views?

Is the answer the same if the layers of relations are a mix of
views and tables ?


​The answer to any question as broad and non-specific as yours is 
likely to be answered wit​h a no.


The better question is how expensive is it to sort already sorted 
data.  If its cheap, and it likely is, then placing explicit sorting 
where you care is the best solution regardless of your level of 
confidence that lower level sorting is being maintained.


Since tables are never sorted I don't get why you think they enter 
into the equation.


If ones operates under the guideline that only top-layer queries 
should contain ORDER BY then your whole structure is unsound.  
Presumably those queries you rely upon are or were themselves 
considered top-level queries at one point and now you are adding a 
dependent to them that they likely were never intended to consider. 
Simplification queries should not use ORDER BY unless it is necessary 
to implement their logic.  A query whose logic depends on order really 
should declare that fact.


David J.


David,  yes, I agree that sorting at the end is the highest-confidence 
approach.  I don't (yet) have a large stack of views with an assumption 
of a guaranteed underlying sort order, I'm just trying to get a better 
sense of what Postgres behavior I can reasonably expect here.


Thanks, Charles



Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-09 Thread David Fetter
On Thu, Sep 10, 2015 at 01:32:10AM +0200, Andres Freund wrote:
> On 2015-09-09 18:27:51 -0400, Robert Haas wrote:
> > On Wed, Sep 9, 2015 at 11:37 AM, dinesh kumar  
> > wrote:
> > > Sure, it’s a clear fact that, we can implement this function
> > > with RAISE statements.
> > 
> > Given that, I suggest we just forget the whole thing.
> 
> I'm not convinced. Sure, it's easy, but I by now have written the
> respective function dozens of times. Why should we force that on
> everyone?

+20 or so here, that being the number of times I recall offhand having
written the function.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do Layered Views/Relations Preserve Sort Order ?

2015-09-09 Thread David G. Johnston
On Wed, Sep 9, 2015 at 7:53 PM, Charles Sheridan  wrote:

> Hi All,
>
> When there are several views defined on top of each other, are SELECTs on
> views that do not specify a SORT order guaranteed to preserve the
> cumulative sort order of the lower-level views ?
>
> Is the answer true for any arbitrarily large set of layered views?
>
> Is the answer the same if the layers of relations are a mix of views and
> tables ?
>

​The answer to any question as broad and non-specific as yours is likely to
be answered wit​h a no.

The better question is how expensive is it to sort already sorted data.  If
its cheap, and it likely is, then placing explicit sorting where you care
is the best solution regardless of your level of confidence that lower
level sorting is being maintained.

Since tables are never sorted I don't get why you think they enter into the
equation.

If ones operates under the guideline that only top-layer queries should
contain ORDER BY then your whole structure is unsound.  Presumably those
queries you rely upon are or were themselves considered top-level queries
at one point and now you are adding a dependent to them that they likely
were never intended to consider.  Simplification queries should not use
ORDER BY unless it is necessary to implement their logic.  A query whose
logic depends on order really should declare that fact.

David J.


[HACKERS] Do Layered Views/Relations Preserve Sort Order ?

2015-09-09 Thread Charles Sheridan

Hi All,

When there are several views defined on top of each other, are SELECTs 
on views that do not specify a SORT order guaranteed to preserve the 
cumulative sort order of the lower-level views ?


Is the answer true for any arbitrarily large set of layered views?

Is the answer the same if the layers of relations are a mix of views and 
tables ?



Best, Charles


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-09 Thread Andres Freund
On 2015-09-09 18:27:51 -0400, Robert Haas wrote:
> On Wed, Sep 9, 2015 at 11:37 AM, dinesh kumar  wrote:
> > Sure, it’s a clear fact that, we can implement this function with RAISE
> > statements.
> 
> Given that, I suggest we just forget the whole thing.

I'm not convinced. Sure, it's easy, but I by now have written the
respective function dozens of times. Why should we force that on
everyone?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-09 Thread Jim Nasby

On 9/9/15 5:27 PM, Robert Haas wrote:

Sure, it’s a clear fact that, we can implement this function with RAISE
>statements.

Given that, I suggest we just forget the whole thing.


Except that you can't use a variable to control the log level in a 
plpgsql RAISE, so then you end up with a CASE statement. That perhaps 
wouldn't be so bad, until you also want to optionally report detail, 
hint, and/or errcode. So trying to create a generic wrapper around RAISE 
is decidedly non-trivial. Another option is removing those restrictions 
from RAISE, but it seems a bit silly to take the hit of firing up a 
plpgsql function for this.


So I think there is value in a SQL equivalent to RAISE. I'm not thrilled 
by piling another hack onto the horribly inadequate errlevel 
infrastructure, but at least Dinesh's "MESSAGE" idea is essentially just 
side-stepping that hole instead of digging it deeper.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-09-09 Thread Robert Haas
On Wed, Sep 9, 2015 at 11:07 AM, Amit Kapila  wrote:
> On Wed, Sep 9, 2015 at 11:47 AM, Haribabu Kommi 
> wrote:
>> With subquery, parallel scan is having some problem, please refer below.
>>
>> postgres=# explain analyze select * from test01 where kinkocord not in
>> (select kinkocord from test02 where tenpocord = '001');
>> ERROR:  badly formatted node string "SUBPLAN :subLinkType 2 :testexpr"...
>> CONTEXT:  parallel worker, pid 32879
>> postgres=#
>
> The problem here is that readfuncs.c doesn't have support for reading
> SubPlan nodes. I have added support for some of the nodes, but it seems
> SubPlan node also needs to be added.  Now I think this is okay if the
> SubPlan
> is any node other than Funnel, but if Subplan contains Funnel, then each
> worker needs to spawn other workers to execute the Subplan which I am
> not sure is the best way.  Another possibility could be store the results of
> Subplan in some tuplestore or some other way and then pass those to workers
> which again doesn't sound to be promising way considering we might have
> hashed SubPlan for which we need to build a hashtable.  Yet another way
> could be for such cases execute the Filter in master node only.

IIUC, there are two separate issues here:

1. We need to have readfuncs support for all the right plan nodes.
Maybe we should just bite the bullet and add readfuncs support for all
plan nodes.  But if not, we can add support for whatever we need.

2. I think it's probably a good idea - at least for now, and maybe
forever - to avoid nesting parallel plans inside of other parallel
plans.  It's hard to imagine that being a win in a case like this, and
it certainly adds a lot more cases to think about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-09 Thread Robert Haas
On Wed, Sep 9, 2015 at 11:37 AM, dinesh kumar  wrote:
> I am admitting here that, I don’t know the real use case behind this
> proposal in our TODO list. I thought, having ereport wrapper at SQL level,
> gives a default debugging behavior for the end users, and this is the only
> real use case I see.
>
...
> Sure, it’s a clear fact that, we can implement this function with RAISE
> statements.

Given that, I suggest we just forget the whole thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Modernizing contrib/tablefunc

2015-09-09 Thread Joe Conway
On 09/09/2015 03:03 PM, David Fetter wrote:
> Folks,
> 
> While doing some research for the upcoming (UN)PIVOT proposal, I
> noticed that there were some hairy and no-longer-needed bits in the
> regression tests for tablefunc, which I have shaved off with the
> attached patch.
> 
> What say?

Is the entire patch just eliminating '' by using dollar quoting? I'm not
against the idea, but it hardly seems worth the effort.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] jsonb_concat: make sure we always return a non-scalar value

2015-09-09 Thread Jim Nasby

On 9/9/15 12:03 PM, Oskari Saarenmaa wrote:

andrew=# select '[1]'::jsonb || '{"a":"b"}';
 ?column?
-
  [1, {"a": "b"}]


It looks wrong, but I'm not sure what's right in that case.  I think
it'd make sense to just restrict concatenation to object || object,
array || array and array || scalar and document that.  I doubt many
people expect their objects to turn into arrays if they accidentally
concatenate an array into it.  Alternatively the behavior could depend
on the order of arguments for concatenation, array || anything -> array,
object || array -> error.


That definitely doesn't sound like a good default.

It might be useful to have a concat function that would concatinate 
anything into an array. But if we don't provide one by default users 
could always create their own with json__typeof() and json_build_array().

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Modernizing contrib/tablefunc

2015-09-09 Thread David Fetter
Folks,

While doing some research for the upcoming (UN)PIVOT proposal, I
noticed that there were some hairy and no-longer-needed bits in the
regression tests for tablefunc, which I have shaved off with the
attached patch.

What say?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
diff --git a/contrib/tablefunc/expected/tablefunc.out 
b/contrib/tablefunc/expected/tablefunc.out
index fffadc6..6a43906 100644
--- a/contrib/tablefunc/expected/tablefunc.out
+++ b/contrib/tablefunc/expected/tablefunc.out
@@ -14,7 +14,7 @@ SELECT avg(normal_rand)::int FROM normal_rand(100, 250, 0.2);
 --
 CREATE TABLE ct(id int, rowclass text, rowid text, attribute text, val text);
 \copy ct from 'data/ct.data'
-SELECT * FROM crosstab2('SELECT rowid, attribute, val FROM ct where rowclass = 
''group1'' and (attribute = ''att2'' or attribute = ''att3'') ORDER BY 1,2;');
+SELECT * FROM crosstab2($$SELECT rowid, attribute, val FROM ct where rowclass 
= 'group1' and (attribute = 'att2' or attribute = 'att3') ORDER BY 1,2;$$);
  row_name | category_1 | category_2 
 --++
  test1| val2   | val3
@@ -22,7 +22,7 @@ SELECT * FROM crosstab2('SELECT rowid, attribute, val FROM ct 
where rowclass = '
   | val10  | val11
 (3 rows)
 
-SELECT * FROM crosstab3('SELECT rowid, attribute, val FROM ct where rowclass = 
''group1'' and (attribute = ''att2'' or attribute = ''att3'') ORDER BY 1,2;');
+SELECT * FROM crosstab3($$SELECT rowid, attribute, val FROM ct where rowclass 
= 'group1' and (attribute = 'att2' or attribute = 'att3') ORDER BY 1,2;$$);
  row_name | category_1 | category_2 | category_3 
 --+++
  test1| val2   | val3   | 
@@ -30,7 +30,7 @@ SELECT * FROM crosstab3('SELECT rowid, attribute, val FROM ct 
where rowclass = '
   | val10  | val11  | 
 (3 rows)
 
-SELECT * FROM crosstab4('SELECT rowid, attribute, val FROM ct where rowclass = 
''group1'' and (attribute = ''att2'' or attribute = ''att3'') ORDER BY 1,2;');
+SELECT * FROM crosstab4($$SELECT rowid, attribute, val FROM ct where rowclass 
= 'group1' and (attribute = 'att2' or attribute = 'att3') ORDER BY 1,2;$$);
  row_name | category_1 | category_2 | category_3 | category_4 
 --++++
  test1| val2   | val3   || 
@@ -38,7 +38,7 @@ SELECT * FROM crosstab4('SELECT rowid, attribute, val FROM ct 
where rowclass = '
   | val10  | val11  || 
 (3 rows)
 
-SELECT * FROM crosstab2('SELECT rowid, attribute, val FROM ct where rowclass = 
''group1'' ORDER BY 1,2;');
+SELECT * FROM crosstab2($$SELECT rowid, attribute, val FROM ct where rowclass 
= 'group1' ORDER BY 1,2;$$);
  row_name | category_1 | category_2 
 --++
  test1| val1   | val2
@@ -46,7 +46,7 @@ SELECT * FROM crosstab2('SELECT rowid, attribute, val FROM ct 
where rowclass = '
   | val9   | val10
 (3 rows)
 
-SELECT * FROM crosstab3('SELECT rowid, attribute, val FROM ct where rowclass = 
''group1'' ORDER BY 1,2;');
+SELECT * FROM crosstab3($$SELECT rowid, attribute, val FROM ct where rowclass 
= 'group1' ORDER BY 1,2;$$);
  row_name | category_1 | category_2 | category_3 
 --+++
  test1| val1   | val2   | val3
@@ -54,7 +54,7 @@ SELECT * FROM crosstab3('SELECT rowid, attribute, val FROM ct 
where rowclass = '
   | val9   | val10  | val11
 (3 rows)
 
-SELECT * FROM crosstab4('SELECT rowid, attribute, val FROM ct where rowclass = 
''group1'' ORDER BY 1,2;');
+SELECT * FROM crosstab4($$SELECT rowid, attribute, val FROM ct where rowclass 
= 'group1' ORDER BY 1,2;$$);
  row_name | category_1 | category_2 | category_3 | category_4 
 --++++
  test1| val1   | val2   | val3   | val4
@@ -62,49 +62,49 @@ SELECT * FROM crosstab4('SELECT rowid, attribute, val FROM 
ct where rowclass = '
   | val9   | val10  | val11  | val12
 (3 rows)
 
-SELECT * FROM crosstab2('SELECT rowid, attribute, val FROM ct where rowclass = 
''group2'' and (attribute = ''att1'' or attribute = ''att2'') ORDER BY 1,2;');
+SELECT * FROM crosstab2($$SELECT rowid, attribute, val FROM ct where rowclass 
= 'group2' and (attribute = 'att1' or attribute = 'att2') ORDER BY 1,2;$$);
  row_name | category_1 | category_2 
 --++
  test3| val1   | val2
  test4| val4   | val5
 (2 rows)
 
-SELECT * FROM crosstab3('SELECT rowid, attribute, val FROM ct where rowclass = 
''group2'' and (attribute = ''att1'' or attribute = ''att2'') ORDER BY 1,2;');
+SELECT * FROM crosstab3($$SELECT rowid, attribute, val FROM ct where

Re: [HACKERS] Raising our compiler requirements for 9.6

2015-09-09 Thread Andres Freund
On 2015-08-27 11:31:44 -0400, Tom Lane wrote:
> Needs a bit of copy-editing in places, but +1 overall.

Heres a slightly expanded version of this. I tried to do some of the
copy-editing, but I'm sure a look from a native speaker won't hurt.

Greetings,

Andres Freund
diff --git a/doc/src/sgml/sources.sgml b/doc/src/sgml/sources.sgml
index d6461ec..66a4629 100644
--- a/doc/src/sgml/sources.sgml
+++ b/doc/src/sgml/sources.sgml
@@ -851,4 +851,109 @@ BETTER: unrecognized node type: 42
 
   
 
+  
+   Miscellaneous Coding Conventions
+
+   
+C Standard
+
+ Code in PostgreSQL should only rely on language
+ features available in the C89 standard. That means a conforming
+ C89 compiler has to be able to compile postgres, at least aside
+ from a few platform dependant pieces. Features from later
+ revision of the C standard or compiler specific features can be
+ used, if a fallback is provided.
+
+
+ For example static inline and
+ _StaticAssert() are currently used, even
+ though they are from newer revisions of the C standard. If not
+ available we respectively fall back to defining the functions
+ without inline, and to using a C89 compatible replacement that
+ performs the same checks, but emits rather cryptic messages.
+
+   
+
+   
+Function-Like Macros and Inline Functions
+
+ Both, macros with arguments and static inline
+ functions, may be used. The latter are preferable if there are
+ multiple-evaluation hazards when written as a macro, as e.g. the
+ case with
+
+#define Max(x, y)   ((x) > (y) ? (x) : (y))
+
+ or when the macro would be very long. In other cases it's only
+ possible to use macros, or at least easier.  For example because
+ expressions of various types need to be passed to the macro.
+
+
+ When the definition an inline function references symbols
+ (i.e. variables, functions) that are only available as part of the
+ backend, the function may not be visible when included from frontend
+ code.
+
+#ifndef FRONTEND
+static inline MemoryContext
+MemoryContextSwitchTo(MemoryContext context)
+{
+MemoryContext old = CurrentMemoryContext;
+
+CurrentMemoryContext = context;
+return old;
+}
+#endif   /* FRONTEND */
+
+ In this example CurrentMemoryContext, which is only
+ available in the backend, is referenced and the function thus
+ hidden with a #ifndef FRONTEND. This rule
+ exists because some compilers emit references to symbols
+ contained in inline functions even if the function is not used.
+
+   
+
+   
+Writing Signal Handlers
+
+ To be suitable to run inside a signal handler code has to be
+ written very carefully. The fundamental problem is that, unless
+ blocked, a signal handler can interrupt code at any time. If code
+ inside the signal handler uses the same state as code outside
+ chaos may ensue. As an example consider what happens if a signal
+ handler tries to acquire a lock that's already held in the
+ interrupted code.
+
+
+ Barring special arrangements code in signal handlers may only
+ call async-signal safe functions (as defined in posix) and access
+ variables of type volatile sig_atomic_t. A few
+ functions in postgres are also deemed signal safe, importantly
+ SetLatch().
+
+
+ In most cases signal handlers should do nothing more than note
+ that a signal has arrived, and wake up code running outside of
+ the handler using a latch. An example of such a handler is the
+ following:
+
+static void
+handle_sighup(SIGNAL_ARGS)
+{
+int save_errno = errno;
+
+got_SIGHUP = true;
+SetLatch(MyLatch);
+
+errno = save_errno;
+}
+
+ errno is safed and restored because
+ SetLatch() might change it. If that were not done
+ interrupted code that's currently inspecting errno might see the wrong
+ value.
+
+   
+
+  
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-09 Thread Kevin Grittner
Robert Haas  wrote:

> I think the problem we should be trying to solve is: Given a set
> of server IPs, connect to one that is up.
>
> I believe this comes up in several different scenarios.
>
> Example #1: [single server; changing IP address gracefully]
>
> Example #2: [xDB/BDR client uses local master if up; else a remote]
>
> Example #3: [SR/HS with changing primary]

For all of those it is clear that we do not need (or want!)
heartbeat, STONITH, fencing, etc. to be handled by the connector.
If the above are the sorts of problems we are trying to solve, a
very simple solution is the best.  I know you outlined several; I'm
not sure it would matter all that much which one we used -- any
would work and someone should Just Do It.

> I'm sure there are more.

Ay, there's the rub.  Some people seemed to be suggesting that this
should be far more than what you describe above.  That would, IMO,
be a mistake.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-09-09 Thread Fabien COELHO



It would replace what is currently an array.


It'd still be one afterwards.
[...]
extract/reinsert is actually O(1).


Hm, strange. I probably did not understood at all the heap structure 
you're suggesting. No big deal.



[...] Why would a heap as I've described it require that?


Hmmm... The heap does *not* require anything, the *balancing* requires 
this property.



[...] There's no "proved and heavily tested code" touched here.


I've prooved and tested heavily the submitted patch based on an array, 
that you want to replace with some heap, so I think that my point stands.


Moreover, I do not see a clear benefit in changing the data structure.


So I would prefer to keep the code as is, that is pretty straightforward,
and wait for a strong incentive before doing anything fancier.


I find the proposed code not particularly pretty, so I don't really buy
the straightforwardness argument.


No big deal. From my point of view, the data structure change you're 
suggesting does not bring significant value, so there is no good reason to 
do it.


If you want to submit another patch, this is free software, please 
proceed.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: function parse_ident

2015-09-09 Thread Alvaro Herrera
Pavel Stehule wrote:

> I cannot to use current SplitIdentifierString because it is designed for
> different purpose - and it cannot to separate non identifier part. But the
> code is simple - and will be cleaned.
> 
>  postgres=# select * from parse_ident('"AHOJ".NAZDAR[]'::text);
> ┌───┬───┐
> │ parts │ other │
> ╞═══╪═══╡
> │ {AHOJ,nazdar} │ []│
> └───┴───┘
> (1 row)

Um.  Now this is really getting into much of the same territory I got
into with the objname/objargs arrays for pg_get_object_address.  I think
the "other" bit is a very poor solution to that.

If you want to be able to parse names for all kinds of objects, you need
a solution much more complex than this function.  I think a clean
solution returns three sets of things; one is the primary part of the
name, which is an array of text; the second is the secondary name, which
is another array of text; the third is an array of TypeName.

For the name of a relation, only the first of these arrays is used.  For
the name of objects subsidiary to a relation, the first two are used
(the first array is the name of the relation itself, and the second
array is the name of the object; for instance a constraint name, or a
trigger name).

The array of type names is necessary because the parsing of TypeName is
completely unlike parsing of plain names.  You need [] decorator and
typmod.  If you consider objects such as casts, you need two TypeNames
("from" and "to"), hence this is an array and not a single one.  As far
as I recall there are other object types that also need more than one
TypeName.

For the name of a function, you need the first text array, and the array
of TypeName which are the input argument types.

If you don't want to have all this complexity, I think you need to forgo
the idea of the "other" thingy that you propose above, and just concern
yourself with the first bits.  I don't think "AHOJ".NAZDAR[] is an
identifier.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-09-09 Thread Fabien COELHO


As I wrote before, FreeBSD would be a good candidate because the 
posix_fadvise seems much more reasonable than on Linux, and should be 
profitable, so it would be a pity to remove it.


Why do you think it's different on fbsd? Also, why is it unreasonable 
that DONNEED removes stuff from the cache?


Yep, I agree that this part is a bad point, obviously, but at least there 
is also some advantage: I understood that buffers are actually pushed 
towards the disk when calling posix_fadvise with DONTNEED on FreeBSD, so 
in-buffer tests shoud see better performance, and out-of-buffer in-memory 
tests would probably be degraded as Amit test shown on Linux. As an admin 
I can choose if I know whether I run in buffer or not.


On Linux either the call is ignored (if the page is not written yet) or 
the page is coldly removed, so it has either no effect or a bad effect, 
basically.


So I think that the current off default when running with posix_fadvise is 
reasonable, and in some case turning it on can probably provide a better 
performance stability, esp for in-buffer runs.


Now, franckly I do not care much about FreeBSD or Windows, so I'm fine 
with dropping posix_fadvise if this is a blocker.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-09-09 Thread Andres Freund
On 2015-09-09 21:29:12 +0200, Fabien COELHO wrote:
> >Imagine a binaryheap.h style heap over a structure like (tablespaceid,
> >progress, progress_inc, nextbuf) where the comparator compares the progress.
> 
> It would replace what is currently an array.

It'd still be one afterwards.

> The balancing code needs to enumerate all tablespaces in a round-robin
> way so as to ensure that all tablespaces are given some attention,
> otherwise you could have a balance on two tablespaces but others could
> be left out. The array makes this property straightforward.

Why would a heap as I've described it require that?

> >>Anyway, I think it would complicate the code significantly (compared to the
> >>straightforward array)
> >
> >I doubt it. I mean instead of your GetNext you'd just do:
> >   next_tblspc = DatumGetPointer(binaryheap_first(heap));
> >   if (next_tblspc == 0)
> >   return 0;
> >   next_tblspc.progress += next_tblspc.progress_slice;
> >   binaryheap_replace_first(PointerGetDatum(next_tblspc));
> >
> >   return next_tblspc.nextbuf++;
> 
> Compare to the array, this tree approach would required ln(Ntablespace) work
> to extract and reinsert the tablespace under progress, so there is no
> complexity advantage.

extract/reinsert is actually O(1).


> >progress_slice is the number of buffers in the tablespace divided by the
> >number of total buffers, to avoid doing any sort of expensive math in
> >the more frequently executed path.
> 
> If there are many buffers, I'm not too sure about rounding issues and the
> like, so the current approach with a rational seems more secure.

Meh. The amount of imbalance introduced by rounding won't matter.

> >>ISTM that using a tablespace in the sorting would reduce the complexity to
> >>ln(NBuffers) * Ntablespace for finding the boundaries, and then Nbuffers *
> >>(Ntablespace/Ntablespace) = NBuffers for scanning, at the expense of more
> >>memory and code complexity.
> >
> >Afaics finding the boundaries can be done as part of the enumeration of
> >tablespaces in BufferSync(). That code needs to be moved, but that's not
> >too bad. I don't see the code be that much more complicated?
> 
> Hmmm. you are proposing to replace prooved and heavilly tested code by a
> more complex tree data structures distributed quite differently around the
> source, and no very clear benefit.

There's no "proved and heavily tested code" touched here.

> So I would prefer to keep the code as is, that is pretty straightforward,
> and wait for a strong incentive before doing anything fancier.

I find the proposed code not particularly pretty, so I don't really buy
the straightforwardness argument.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-09-09 Thread Fabien COELHO


Hello Andres,

Wouldn't it be just as easy to put this logic into the checkpointing 
code?


Not sure it would simplify anything, because the checkpointer currently
knows about buffers but flushing is about files, which are hidden from
view.


It'd not really simplify things, but it'd keep it local.


Ok, it would be local, but it would also mean that the checkpointer would 
have to deal explicitely with files, whereas it currently does not have 
to.


I think that the current buffer/file boundary is on engineering principle 
a good one, so I tried to break it as little as possible to enable the 
feature, and I wanted to avoid to have to do a buffer to file translation 
twice, once in the checkpointer and once when writing the buffer.



* Wouldn't a binary heap over the tablespaces + progress be nicer?


I'm not sure where it would fit exactly.


Imagine a binaryheap.h style heap over a structure like (tablespaceid,
progress, progress_inc, nextbuf) where the comparator compares the progress.


It would replace what is currently an array. The balancing code needs to 
enumerate all tablespaces in a round-robin way so as to ensure that all 
tablespaces are given some attention, otherwise you could have a balance 
on two tablespaces but others could be left out. The array makes this 
property straightforward.




Anyway, I think it would complicate the code significantly (compared to the
straightforward array)


I doubt it. I mean instead of your GetNext you'd just do:
   next_tblspc = DatumGetPointer(binaryheap_first(heap));
   if (next_tblspc == 0)
   return 0;
   next_tblspc.progress += next_tblspc.progress_slice;
   binaryheap_replace_first(PointerGetDatum(next_tblspc));

   return next_tblspc.nextbuf++;


Compare to the array, this tree approach would required ln(Ntablespace) 
work to extract and reinsert the tablespace under progress, so there is no 
complexity advantage.


Moreover, given that in most cases there are 1 or 2 tablespaces, a tree 
structure is really on the heavy side.



progress_slice is the number of buffers in the tablespace divided by the
number of total buffers, to avoid doing any sort of expensive math in
the more frequently executed path.


If there are many buffers, I'm not too sure about rounding issues and the 
like, so the current approach with a rational seems more secure.



[...] I'm not seing where you'd need an extra pointer?


Indeed, I misunderstood.


[...] What for would you need to bsearch?


To find the tablespace boundaries in the sorted buffer array in 
log(NBuffers) * Ntablespace, instead of NBuffers.



I think that the current approach is ok as the number of tablespace
should be small.


Right that's often the case.


Yep.


ISTM that using a tablespace in the sorting would reduce the complexity to
ln(NBuffers) * Ntablespace for finding the boundaries, and then Nbuffers *
(Ntablespace/Ntablespace) = NBuffers for scanning, at the expense of more
memory and code complexity.


Afaics finding the boundaries can be done as part of the enumeration of
tablespaces in BufferSync(). That code needs to be moved, but that's not
too bad. I don't see the code be that much more complicated?


Hmmm. you are proposing to replace prooved and heavilly tested code by a 
more complex tree data structures distributed quite differently around the 
source, and no very clear benefit.


So I would prefer to keep the code as is, that is pretty straightforward, 
and wait for a strong incentive before doing anything fancier. ISTM that 
there are other places in pg need attention more that further tweaking 
this patch.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-09-09 Thread Andres Freund
On 2015-09-09 20:56:15 +0200, Fabien COELHO wrote:
> As I wrote before, FreeBSD would be a good candidate because the
> posix_fadvise seems much more reasonable than on Linux, and should be
> profitable, so it would be a pity to remove it.

Why do you think it's different on fbsd? Also, why is it unreasonable
that DONNEED removes stuff from the cache?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Latent cache flush hazard in RelationInitIndexAccessInfo

2015-09-09 Thread Tom Lane
Some stuff Salesforce has been doing turned up the fact that
RelationInitIndexAccessInfo is not safe against a relcache flush on
its target index.  Specifically, the failure goes like this:

* RelationInitIndexAccessInfo loads up relation->rd_indextuple.

* Within one of the subsequent catalog fetches, eg the pg_am read, a
relcache flush occurs (this can be forced with CLOBBER_CACHE_ALWAYS).

* RelationClearRelation keeps the relcache entry, since it has
positive refcount, but doesn't see a reason to keep the index-related
fields.

* RelationInitIndexAccessInfo dumps core when it tries to fetch
fields out of relation->rd_indextuple, which has been reset to NULL.

Now, it turns out this scenario cannot happen in the standard Postgres
code as it stands today (if it could, we'd have seen it in the buildfarm's
CLOBBER_CACHE_ALWAYS testing).  The reason for that is that
RelationInitIndexAccessInfo is actually only called on newly-minted
Relation structs that are not yet installed into the relcache hash table;
hence RelationCacheInvalidate can't find them to zap them.  It can happen
in Salesforce's modified code though, and it's not hard to imagine that
future changes in the community version might expose the same hazard.
(For one reason, the current technique implies that an error halfway
through relcache load results in leaking the Relation struct and
subsidiary data; we might eventually decide that's not acceptable.)

We could just ignore the problem until/unless it becomes real for us,
but now that I've figured it out I'm inclined to do something before
the knowledge disappears again.

The specific reason there's a problem is that there's a disconnect between
RelationClearRelation's test for whether a relcache entry has valid index
info (it checks whether rd_indexcxt != NULL) and the order of operations
in RelationInitIndexAccessInfo (creating the index context is not the
first thing it does).  Given that if RelationClearRelation thinks the
info is valid, what it does is call RelationReloadIndexInfo which needs
rd_index to be valid, I'm thinking the best fix is to leave the order of
operations in RelationInitIndexAccessInfo alone and instead make the
"relcache entry has index info" check be "rd_index != NULL".  There might
need to be some additional work to keep RelationReloadIndexInfo from
setting rd_isvalid = true too soon.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-09-09 Thread Fabien COELHO



(3) posix_fadvise on Linux is a bad idea... the good news is that it
is not needed there:-) How good or bad an idea it is on other system
is an open question...


I don't know what is the best way to verify that, if some body else has
access to such a m/c, please help to get that verified.


Why wouldn't we just leave it out then? Putting it in when the one 
platform we've tried it on shows a regression makes no sense.  We 
shouldn't include it and then remove it if someone can prove it's bad; 
we should only include it in the first place if we have good benchmarks 
showing that it is good.


Does anyone have a big Windows box they can try this on?


Just a box with a disk would be enough, it does not need to be big!

As I wrote before, FreeBSD would be a good candidate because the 
posix_fadvise seems much more reasonable than on Linux, and should be 
profitable, so it would be a pity to remove it.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 答复:[HACKERS] 答复:[HACKERS] 答复:[HACKERS] about fsync in CLOG buffer write

2015-09-09 Thread 周正中(德歌)
HI,   Sorry, it's my 
misunderstand.    I don't read seriously SlruSelectLRUPage before.
  /* See if page already has a buffer assigned */
  for (slotno = 0; slotno < shared->num_slots; slotno++)
  {
   if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY)
return slotno;
  }

  /*
   * If we find any EMPTY slot, just select that one. Else choose a
   * victim page to replace.  We normally take the least recently used
   * valid page, but we will never take the slot containing
   * latest_page_number, even if it appears least recently used.  We
   * will select a slot that is already I/O busy only if there is no
   * other choice: a read-busy slot will not be least recently used once
   * the read finishes, and waiting for an I/O on a write-busy slot is
   * inferior to just picking some other slot.  Testing shows the slot
   * we pick instead will often be clean, allowing us to begin a read at
   * once.
   *
   * Normally the page_lru_count values will all be different and so
   * there will be a well-defined LRU page.  But since we allow
   * concurrent execution of SlruRecentlyUsed() within
   * SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
   * acquire the same lru_count values.  In that case we break ties by
   * choosing the furthest-back page.
   *
   * Notice that this next line forcibly advances cur_lru_count to a
   * value that is certainly beyond any value that will be in the
   * page_lru_count array after the loop finishes.  This ensures that
   * the next execution of SlruRecentlyUsed will mark the page newly
   * used, even if it's for a page that has the current counter value.
   * That gets us back on the path to having good data when there are
   * multiple pages with the same lru_count.
   */
  cur_count = (shared->cur_lru_count)++;
  for (slotno = 0; slotno < shared->num_slots; slotno++)
  {
   int   this_delta;
   int   this_page_number;

   if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
return slotno;
   this_delta = cur_count - shared->page_lru_count[slotno];
   if (this_delta < 0)
   {
/*
 * Clean up in case shared updates have caused cur_count
 * increments to get "lost".  We back off the page counts,
 * rather than trying to increase cur_count, to avoid any
 * question of infinite loops or failure in the presence of
 * wrapped-around counts.
 */
shared->page_lru_count[slotno] = cur_count;
this_delta = 0;
   }
   this_page_number = shared->page_number[slotno];
   if (this_page_number == shared->latest_page_number)
continue;
   if (shared->page_status[slotno] == SLRU_PAGE_VALID)
   {
if (this_delta > best_valid_delta ||
 (this_delta == best_valid_delta &&
  ctl->PagePrecedes(this_page_number,
best_valid_page_number)))
{
 bestvalidslot = slotno;
 best_valid_delta = this_delta;
 best_valid_page_number = this_page_number;
}
   }
   else
   {
if (this_delta > best_invalid_delta ||
 (this_delta == best_invalid_delta &&
  ctl->PagePrecedes(this_page_number,
best_invalid_page_number)))
{
 bestinvalidslot = slotno;
 best_invalid_delta = this_delta;
 best_invalid_page_number = this_page_number;
}
   }
  }

  /*
   * If all pages (except possibly the latest one) are I/O busy, we'll
   * have to wait for an I/O to complete and then retry.  In that
   * unhappy case, we choose to wait for the I/O on the least recently
   * used slot, on the assumption that it was likely initiated first of
   * all the I/Os in progress and may therefore finish first.
   */
  if (best_valid_delta < 0)
  {
   SimpleLruWaitIO(ctl, bestinvalidslot);
   continue;
  }

  /*
   * If the selected page is clean, we're set.
   */
  if (!shared->page_dirty[bestvalidslot])
   return bestvalidslot;

--发件人:Andres 
Freund 发送时间:2015年9月9日(星期三) 02:28收件人:周正中(德歌) 
抄 送:Amit Kapila ,张广舟(明虚) 
,pgsql-hackers 
,范孝剑(康贤) ,曾文旌(义从) 
,窦贤明(执白) ,萧少聪(铁庵) 
,陈新坚(惧留孙) 主 题:Re: 
[HACKERS] 答复:[HACKERS] 答复:[HACKERS] about fsync in CLOG buffer write
Hi,

On 2015-09-08 15:58:26 +0800, 周正中(德歌) wrote:
> postgres@digoal-> cat 7.sql
> select txid_current();
> 
> postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 
>10
> About 32K tps.
> progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
> progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127

So you're benchmarking how fast you can assign txids. Is that actually
something meaningful? If you have other writes interleaved you'll write
much more WAL, so there'll be checkpoints and such.

FWIW, if you measure something realistic and there's checkpoints,
there'll be fewer fsyncs if you increase the slru buffer size - as
there'll often be clean buffers due to the checkpoint having written
them out.

Greetings,

Andres Freund


Re: [HACKERS] jsonb_concat: make sure we always return a non-scalar value

2015-09-09 Thread Oskari Saarenmaa

09.09.2015, 18:53, Andrew Dunstan kirjoitti:

On 09/08/2015 09:54 AM, Andrew Dunstan wrote:

On 09/05/2015 02:47 AM, Oskari Saarenmaa wrote:

There was a long thread about concatenating jsonb objects to each
other, but that discussion didn't touch concatenating other types.
Currently jsonb_concat always just returns the other argument if one
of arguments is considered empty.  This causes surprising behavior
when concatenating scalar values to empty arrays:

os=# select '[]'::jsonb || '1'::jsonb;
1

os=# select '[]'::jsonb || '[1]'::jsonb;
 [1]

os=# select '[]'::jsonb || '1'::jsonb || '2'::jsonb;
 [1, 2]

os=# select '0'::jsonb || '1'::jsonb;
 [0, 1]

os=# select '{"x": "y"}'::jsonb || '[1]'::jsonb;
 [{"x": "y"}, 1]

os=# select '{"x": "y"}'::jsonb || '1'::jsonb;
ERROR:  invalid concatenation of jsonb objects

Attached a patch to fix and test this.  Also added a test case for
concatenating two scalar values which currently produces an array..
I'm not sure that behavior makes sense, but didn't want to change
that in this patch as I guess someone could consider that feature
useful.




This looks correct. Barring objection I'll apply this shortly.



Actually, I'm not sure the test is sufficient here. It looks to me like
we should only be taking this fast path if one operand is an empty array
and the other is a non-scalar array.

Otherwise we get things like this (second case is wrong, I think):

andrew=# select '[]'::jsonb || '"a"';
  ?column?
--
  ["a"]

andrew=# select '[]'::jsonb || '{"a":"b"}';
   ?column?

  {"a": "b"}

andrew=# select '[1]'::jsonb || '{"a":"b"}';
 ?column?
-
  [1, {"a": "b"}]


It looks wrong, but I'm not sure what's right in that case.  I think 
it'd make sense to just restrict concatenation to object || object, 
array || array and array || scalar and document that.  I doubt many 
people expect their objects to turn into arrays if they accidentally 
concatenate an array into it.  Alternatively the behavior could depend 
on the order of arguments for concatenation, array || anything -> array, 
object || array -> error.


/ Oskari


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v2] GSSAPI encryption support

2015-09-09 Thread Robbie Harwood
Michael Paquier  writes:

> On Wed, Sep 9, 2015 at 4:12 AM, Robbie Harwood wrote:
>> Michael Paquier writes:
>> As promised, here's a V2 to address your issues with comments.  I
>> haven't heard back on the issues you found in testing, so no other
>> changes are present.
>
> Well, the issue is still here: login through gssapi fails with your
> patch, not with HEAD. This patch is next on my review list by the way
> so I'll see what I can do about it soon even if I am in the US for
> Postgres Open next week. Still, how did you test it? I am just
> creating by myself a KDC, setting up a valid credential with kinit,
> and after setting up Postgres for this purpose the protocol
> communication just fails.

My KDC is setup through freeIPA; I create a service for postgres,
acquire a keytab, set it in the config file, and fire up the server.  It
should go without saying that this is working for me, which is why I
asked you for more information so I could try to debug.  I wrote a post
on this back in June when this was still in development:
http://mivehind.net/page/view-page-slug/16/postgres-kerberos


signature.asc
Description: PGP signature


Re: [HACKERS] jsonb_concat: make sure we always return a non-scalar value

2015-09-09 Thread Andrew Dunstan



On 09/08/2015 09:54 AM, Andrew Dunstan wrote:



On 09/05/2015 02:47 AM, Oskari Saarenmaa wrote:
There was a long thread about concatenating jsonb objects to each 
other, but that discussion didn't touch concatenating other types. 
Currently jsonb_concat always just returns the other argument if one 
of arguments is considered empty.  This causes surprising behavior 
when concatenating scalar values to empty arrays:


os=# select '[]'::jsonb || '1'::jsonb;
1

os=# select '[]'::jsonb || '[1]'::jsonb;
 [1]

os=# select '[]'::jsonb || '1'::jsonb || '2'::jsonb;
 [1, 2]

os=# select '0'::jsonb || '1'::jsonb;
 [0, 1]

os=# select '{"x": "y"}'::jsonb || '[1]'::jsonb;
 [{"x": "y"}, 1]

os=# select '{"x": "y"}'::jsonb || '1'::jsonb;
ERROR:  invalid concatenation of jsonb objects

Attached a patch to fix and test this.  Also added a test case for
concatenating two scalar values which currently produces an array.. 
I'm not sure that behavior makes sense, but didn't want to change 
that in this patch as I guess someone could consider that feature 
useful. 




This looks correct. Barring objection I'll apply this shortly.



Actually, I'm not sure the test is sufficient here. It looks to me like 
we should only be taking this fast path if one operand is an empty array 
and the other is a non-scalar array.


Otherwise we get things like this (second case is wrong, I think):

   andrew=# select '[]'::jsonb || '"a"';
 ?column?
   --
 ["a"]

   andrew=# select '[]'::jsonb || '{"a":"b"}';
  ?column?
   
 {"a": "b"}

   andrew=# select '[1]'::jsonb || '{"a":"b"}';
?column?
   -
 [1, {"a": "b"}]


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-09 Thread Tomas Vondra

On 09/09/2015 03:55 PM, Robert Haas wrote:

On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
 wrote:

Also, I'm not sure what other places do you have in mind (could you list
some examples?) but I'd bet we limit the allocation to 1GB because of the
palloc() limit and not because of fear of over-estimates.


I don't really think those two things are different from each other.
The palloc() limit is a means of enforcing a general policy of
limiting all allocations to 1GB except in places where we've made a
very conscious decision to allow a specific exception.  This limit
happens to dovetail nicely with the varlena size limit, so in many
cases it is the exactly correct limit just for that reason.  But even
when, as here, that's not at issue, it's still a useful limit, because
there are many ways that some garbage value can get passed to palloc
-- bad planner estimates, corrupted tuples, bugs in other parts of our
code.  And at least on my old MacBook Pro (I haven't tested the
current one), passing a sufficiently-large value to malloc() causes a
kernel panic.  That's probably a particularly bad bug, but there are
lots of systems where "accidentally" allocating an unreasonable amount
of space will have all kinds of unpleasant consequences.  So, I
believe that  palloc()'s limit improves the overall stability of the
system considerably even if it causes some occasional annoyance.


I'm not really buying this. The 1GB has nothing to do with platform 
limits, it's there exactly to make it varlena-like (which has exactly 
the same limit), and because it allows using 32-bit int to track all the 
bytes. Neither of these is relevant here.


It has nothing to do with malloc() limits on various platforms, and if 
there really are such limits that we think we should worry about, we 
should probably address those properly. Not by and-aiding all the 
various places independently.


And most importantly, these platform limits would apply to both the 
initial allocation and to the subsequent resize. It's utterly useless to 
just "fix" the initial allocation and then allow failure when we try to 
resize the hash table.



Most of the time, you can just palloc() and not worry too much about
whether you're going to blow up the machine: you won't, because you
aren't going to allocate more than 1GB.  Any place that wants to
allocate more than that needs to be someplace where we can be pretty
sure that we're not going to accidentally allocate some completely
unreasonable amount of memory, like say 1TB.  Nothing in this
discussion convinces me that this is such a place.  Note that


We're not going to allocate a completely unreasonable amount of memory, 
because there already are some checks in place.


Firstly, you can't really get buckets larger than ~25% of work_mem, 
because we the pointer has only 8B, while the HJ tuple has 16B plus the 
data (IIRC). For wider tuples the size of buckets further decreases.


Secondly, we limit the number of buckets to INT_MAX, so about 16GB 
(because buckets are just pointers). No matter how awful estimate you 
get (or how insanely high you set work_mem) you can't exceed this.



tuplesort.c and tuplestore.c, the only existing callers of
repalloc_huge, only allocate such large amounts of memory when they
actually have enough tuples to justify it - it is always based on the
actual number of tuples, never an estimate.  I think that would be a
sound principle here, too.  Resizing the hash table to such a large
size based on the actual load factor is very reasonable; starting with
such a large size seems less so.  Admittedly, 512MB is an arbitrary
point: and if it so happened that the limit was 256MB or 1GB or 128MB
or even 2GB I wouldn't advocate for changing it just for fun. But
you're saying we should just remove that limit altogether, and I think
that's clearly unreasonable.  Do you really want to start out with a
TB or even PB-sized hash table when the actual number of tuples is,
say, one?  That may sound crazy, but I've seen enough bad query plans
to know that, yes, we are sometimes off by nine orders of magnitude.
This is not a hypothetical problem.


No, I'm not saying anything like that - I actually explicitly stated 
that I'm not against such change (further restricting the initial hash 
table size), if someone takes the time to do a bit of testing and 
provide some numbers.


Moreover as I explained there already are limits in place (25% of 
work_mem or 16GB, whichever is lower), so I don't really see the bugfix 
as unreasonable.


Maybe if we decide to lift this restriction (using int64 to address the 
buckets, which removes the 16GB limit) this issue will get much more 
pressing. But I guess hash tables handling 2B buckets will be enough for 
the near future.



More importantly, removing the cap on the allocation size makes the
problem a lot worse.  You might be sad if a bad planner estimate
causes the planner to allocate 1GB when 64MB would have been enough,
but on modern systems it is n

Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-09 Thread dinesh kumar
HI Robert,

On Wed, Sep 9, 2015 at 8:30 PM, Robert Haas  wrote:

> On Wed, Jul 22, 2015 at 9:56 PM, dinesh kumar 
> wrote:
> >> The real question is why the existing functionality in plpgsql isn't
> >> sufficient.  Somebody who wants a "log from SQL" function can easily
> >> write a simple plpgsql function that does exactly what they want,
> >> with no more or fewer bells-n-whistles than they need.  If we try
> >> to create a SQL function that does all that, it's likely to be a mess
> >> to use, even with named arguments.
> >>
> >> I'm not necessarily against the basic idea, but I think inventing
> >> something that actually offers an increment in usability compared
> >> to the existing alternative is going to be harder than it sounds.
> >>
> >
> > I agree with your inputs. We can build  pl/pgsql function as alternative
> for
> > this.
> >
> > My initial proposal, and implementation was, logging messages to log file
> > irrespectively of our log settings. I was not sure we can do this with
> some
> > pl/perlu. And then, I started working on our to do item,
> > ereport, wrapper callable from SQL, and found it can be useful to have a
> > direct function call with required log level.
>
> But, why?
>
> I am admitting here that, I don’t know the real use case behind this
proposal in our TODO list. I thought, having ereport wrapper at SQL level,
gives a default debugging behavior for the end users, and this is the only
real use case I see.


> I just took a look at the latest patch and I can't see why it's any
> better than just using PL/pgsql's RAISE statement.
>
> Sure, it’s a clear fact that, we can implement this function with RAISE
statements.

Thanks in advance for your guidance.

> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

-- 

Regards,
Dinesh
manojadinesh.blogspot.com


Re: [HACKERS] [GENERAL] Feature Request: bigtsvector

2015-09-09 Thread Bruce Momjian
On Wed, Sep  9, 2015 at 06:14:28PM +0300, Ildus Kurbangaliev wrote:
> On Wed, 9 Sep 2015 10:52:02 -0400
> Bruce Momjian  wrote:
> 
> > On Wed, Jun 17, 2015 at 07:58:21AM +0200, CPT wrote:
> > > Hi all;
> > > 
> > > We are running a multi-TB bioinformatics system on PostgreSQL and
> > > use a denormalized schema in
> > > places with a lot of tsvectors aggregated together for centralized
> > > searching.  This is
> > > very important to the performance of the system.  These aggregate
> > > many documents (sometimes tens of thousands), many of which contain
> > > large numbers of references to other documents.  It isn't uncommon
> > > to have tens of thousands of lexemes.  The tsvectors hold mixed
> > > document id and natural language search information (all f which
> > > comes in from the same documents).
> > > 
> > > Recently we have started hitting the 1MB limit on tsvector size.  We
> > > have found it possible to
> > > patch PostgreSQL to make the tsvector larger but this changes the
> > > on-disk layout.  How likely is
> > > it that either the tsvector size could be increased in future
> > > versions to allow for vectors up to toastable size (1GB logical)?  I
> > > can't imagine we are the only ones with such a problem.  Since, I
> > > think, changing the on-disk layout might not be such a good idea,
> > > maybe it would be worth considering having a new bigtsvector type?
> > > 
> > > Btw, we've been very impressed with the extent that PostgreSQL has
> > > tolerated all kinds of loads we have thrown at it.
> > 
> > Can anyone on hackers answer this question from June?
> > 
> 
> Hi, I'm working on patch now that removes this limit without changes (or
> small changes) of on-disk layout. I think it'll be ready during this 
> month.

Oh, great, thanks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Feature Request: bigtsvector

2015-09-09 Thread Ildus Kurbangaliev
On Wed, 9 Sep 2015 10:52:02 -0400
Bruce Momjian  wrote:

> On Wed, Jun 17, 2015 at 07:58:21AM +0200, CPT wrote:
> > Hi all;
> > 
> > We are running a multi-TB bioinformatics system on PostgreSQL and
> > use a denormalized schema in
> > places with a lot of tsvectors aggregated together for centralized
> > searching.  This is
> > very important to the performance of the system.  These aggregate
> > many documents (sometimes tens of thousands), many of which contain
> > large numbers of references to other documents.  It isn't uncommon
> > to have tens of thousands of lexemes.  The tsvectors hold mixed
> > document id and natural language search information (all f which
> > comes in from the same documents).
> > 
> > Recently we have started hitting the 1MB limit on tsvector size.  We
> > have found it possible to
> > patch PostgreSQL to make the tsvector larger but this changes the
> > on-disk layout.  How likely is
> > it that either the tsvector size could be increased in future
> > versions to allow for vectors up to toastable size (1GB logical)?  I
> > can't imagine we are the only ones with such a problem.  Since, I
> > think, changing the on-disk layout might not be such a good idea,
> > maybe it would be worth considering having a new bigtsvector type?
> > 
> > Btw, we've been very impressed with the extent that PostgreSQL has
> > tolerated all kinds of loads we have thrown at it.
> 
> Can anyone on hackers answer this question from June?
> 

Hi, I'm working on patch now that removes this limit without changes (or
small changes) of on-disk layout. I think it'll be ready during this 
month.


Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com  
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] 答复:[HACKERS] 答复:[HACKERS] about fsync in CLOG buffer write

2015-09-09 Thread Andres Freund
On 2015-09-09 10:46:47 -0400, Robert Haas wrote:
> Well, if you're filling ~1 clog page per second, you're doing ~1 fsync
> per second too.  Or if you are not, then you are thrashing the
> progressively smaller and smaller number of clean slots ever-harder
> until no clean pages remain and you're forced to fsync something -
> probably, a bunch of things all at once.

And you're also triggering a lot of other fsyncs. At the very least
you're pretty constantly flushing the WAL, even with synchronous_commit
= off.

> > Like Andres, I'd want to see a more realistic problem case before
> > expending a lot of work here.
> 
> I think the question here isn't whether the problem case is realistic
> - I am quite sure that a pgbench workload is - but rather how much of
> a problem it actually causes.  I'm very sure that individual pgbench
> backends experience multi-second stallls as a result of this.

The fsync causes that? That seems odd. I mean we release the slru
control lock for writing out pages, don't we?

To me it seems the fsyncs are a red herring here, and the problems are,
uh, much bigger. ISTM that cache replacement locking granularity et al
have a much bigger effect than a fsync every now and then.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-09-09 Thread Amit Kapila
On Wed, Sep 9, 2015 at 8:09 PM, Robert Haas  wrote:
>
> On Wed, Sep 9, 2015 at 2:17 AM, Haribabu Kommi 
wrote:
> > And also regarding the number of workers (16) that is shown in the
> > explain analyze plan are not actually allotted because the in my
> > configuration i set the max_worker_process as 8 only. I feel the plan
> > should show the allotted workers not the planned workers.
> > If the query execution takes time because of lack of workers and the
> > plan is showing as 16 workers, in that case user may think that
> > even with 16 workers the query is slower, but actually it is not.
>
> I would expect EXPLAIN should show the # of workers planned, and
> EXPLAIN ANALYZE should show both the planned and actual values.
>

Sounds sensible, will look into doing that way.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel Seq Scan

2015-09-09 Thread Amit Kapila
On Wed, Sep 9, 2015 at 11:47 AM, Haribabu Kommi 
wrote:
>

> With subquery, parallel scan is having some problem, please refer below.
>
>
> postgres=# explain analyze select * from test01 where kinkocord not in
> (select kinkocord from test02 where tenpocord = '001');
> ERROR:  badly formatted node string "SUBPLAN :subLinkType 2 :testexpr"...
> CONTEXT:  parallel worker, pid 32879
> postgres=#
>

The problem here is that readfuncs.c doesn't have support for reading
SubPlan nodes. I have added support for some of the nodes, but it seems
SubPlan node also needs to be added.  Now I think this is okay if the
SubPlan
is any node other than Funnel, but if Subplan contains Funnel, then each
worker needs to spawn other workers to execute the Subplan which I am
not sure is the best way.  Another possibility could be store the results of
Subplan in some tuplestore or some other way and then pass those to workers
which again doesn't sound to be promising way considering we might have
hashed SubPlan for which we need to build a hashtable.  Yet another way
could be for such cases execute the Filter in master node only.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Counting lines correctly in psql help displays

2015-09-09 Thread Andrew Dunstan



On 09/09/2015 10:27 AM, Tom Lane wrote:

Andrew Dunstan  writes:

On 09/05/2015 12:55 PM, Tom Lane wrote:

Or we could just give up and replace the counts by INT_MAX, forcing use
of the pager unless you've turned it off.  All of those outputs are long
enough now that it's hard to believe there are any common screen layouts
where you don't end up invoking the pager anyway.  (usage() is 60 lines,
the others are more.)  This is probably the reason why we've seldom
noticed they're wrong --- it barely matters anymore.

One way or the other I think it's past time to get out of the business
of maintaining these counts.  I'm willing to do the work of using a
PQExpBuffer if people think it's worth the trouble to have an accurate
count, but it may not be worth the code space.

I'm not terribly happy about the INT_MAX idea. Counting lines in a
PGExpBuffer seems OK. That way we could honor pager_min_lines, I hope.

TBH, I'm not detecting enough concern about this issue to make it worth
doing more than replacing the counts by INT_MAX.  Nobody has stepped up
and said "yeah, my terminal window is 100 lines high and I'll be really
annoyed if \? invokes the pager unnecessarily".  I plan to just do the
three-line fix and move on.





Do people really use terminals without scrollbars for serious work any 
more? Personally I'm in favor of forcing the pager less, not more. But 
I'm not going to make a fuss, I'm just surprised.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-09 Thread Robert Haas
On Wed, Jul 22, 2015 at 9:56 PM, dinesh kumar  wrote:
>> The real question is why the existing functionality in plpgsql isn't
>> sufficient.  Somebody who wants a "log from SQL" function can easily
>> write a simple plpgsql function that does exactly what they want,
>> with no more or fewer bells-n-whistles than they need.  If we try
>> to create a SQL function that does all that, it's likely to be a mess
>> to use, even with named arguments.
>>
>> I'm not necessarily against the basic idea, but I think inventing
>> something that actually offers an increment in usability compared
>> to the existing alternative is going to be harder than it sounds.
>>
>
> I agree with your inputs. We can build  pl/pgsql function as alternative for
> this.
>
> My initial proposal, and implementation was, logging messages to log file
> irrespectively of our log settings. I was not sure we can do this with some
> pl/perlu. And then, I started working on our to do item,
> ereport, wrapper callable from SQL, and found it can be useful to have a
> direct function call with required log level.

But, why?

I just took a look at the latest patch and I can't see why it's any
better than just using PL/pgsql's RAISE statement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Autonomous Transaction is back

2015-09-09 Thread Merlin Moncure
On Wed, Sep 9, 2015 at 9:04 AM, Robert Haas  wrote:
> On Sun, Sep 6, 2015 at 1:56 AM, Noah Misch  wrote:
>> What design principle(s) have you been using to decide how autonomous
>> transactions should behave?
>
> I have to admit to a complete lack of principle.  I'm quite frightened
> of what this is going to need from the lock manager, and I'm trying to
> wriggle out of having to do things there that are going to be nastily
> hard.  My wriggling isn't going very well, though.

Hm.  Here is current dblink behavior:

postgres=# create table l (id int);
CREATE TABLE
postgres=# insert into l values(1);
INSERT 0 1
postgres=# update l set id =2 where id = 1;
UPDATE 1
Time: 0.595 ms
postgres=# select dblink('', 'update l set id = 3 where id = 1');


Does the lock manager really needs to be extended to address this
case?  pg_locks pretty clearly explains what's happening, via:
postgres=# select locktype, transactionid, pid, granted from pg_locks
where not granted;
   locktype│ transactionid │  pid  │ granted
───┼───┼───┼─
 transactionid │ 88380 │ 20543 │ f

and

postgres=# select locktype, transactionid, pid, granted from pg_locks
where transactionid = 88380;
   locktype│ transactionid │  pid  │ granted
───┼───┼───┼─
 transactionid │ 88380 │ 20543 │ f
 transactionid │ 88380 │ 19022 │ t

If pg_locks and/or pg_stat_activity were extended with a 'parent_pid'
column, a userland query could terminate affected backends with a join
against that column where any ungranted.  Naturally there's a lot more
to it than that; you'd want to issue an appropriate cancellation
message and various other things.  I suppose I'm probably missing
something, but I'm not clear on why the lock manager needs to be
overhauled to deal with this case when we can just scan current
strictures assuming we can a) manage child pid independently of parent
pid and b) easily figure out who is parent of who.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Feature Request: bigtsvector

2015-09-09 Thread Bruce Momjian
On Wed, Jun 17, 2015 at 07:58:21AM +0200, CPT wrote:
> Hi all;
> 
> We are running a multi-TB bioinformatics system on PostgreSQL and
> use a denormalized schema in
> places with a lot of tsvectors aggregated together for centralized
> searching.  This is
> very important to the performance of the system.  These aggregate
> many documents (sometimes tens of thousands), many of which contain
> large numbers of references to other documents.  It isn't uncommon
> to have tens of thousands of lexemes.  The tsvectors hold mixed
> document id and natural language search information (all f which
> comes in from the same documents).
> 
> Recently we have started hitting the 1MB limit on tsvector size.  We
> have found it possible to
> patch PostgreSQL to make the tsvector larger but this changes the
> on-disk layout.  How likely is
> it that either the tsvector size could be increased in future
> versions to allow for vectors up to toastable size (1GB logical)?  I
> can't imagine we are the only ones with such a problem.  Since, I
> think, changing the on-disk layout might not be such a good idea,
> maybe it would be worth considering having a new bigtsvector type?
> 
> Btw, we've been very impressed with the extent that PostgreSQL has
> tolerated all kinds of loads we have thrown at it.

Can anyone on hackers answer this question from June?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-09-09 Thread Robert Haas
On Tue, Sep 8, 2015 at 11:31 PM, Amit Kapila  wrote:
>> (3) posix_fadvise on Linux is a bad idea... the good news is that it
>> is not needed there:-) How good or bad an idea it is on other system
>> is an open question...
>
> I don't know what is the best way to verify that, if some body else has
> access to such a m/c, please help to get that verified.

Why wouldn't we just leave it out then?  Putting it in when the one
platform we've tried it on shows a regression makes no sense.  We
shouldn't include it and then remove it if someone can prove it's bad;
we should only include it in the first place if we have good
benchmarks showing that it is good.

Does anyone have a big Windows box they can try this on?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [HACKERS] 答复:[HACKERS] 答复:[HACKERS] about fsync in CLOG buffer write

2015-09-09 Thread Robert Haas
On Wed, Sep 9, 2015 at 10:35 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> ... How often such a workload actually has to replace a *dirty* clog
>> buffer obviously depends on how often you checkpoint, but if you're
>> getting ~28k TPS you can completely fill 32 clog buffers (1 million
>> transactions) in less than 40 seconds, and you're probably not
>> checkpointing nearly that often.
>
> But by the same token, at that kind of transaction rate, no clog page is
> actively getting dirtied for more than a couple of seconds.  So while it
> might get swapped in and out of the SLRU arena pretty often after that,
> this scenario seems unconvincing as a source of repeated fsyncs.

Well, if you're filling ~1 clog page per second, you're doing ~1 fsync
per second too.  Or if you are not, then you are thrashing the
progressively smaller and smaller number of clean slots ever-harder
until no clean pages remain and you're forced to fsync something -
probably, a bunch of things all at once.

> Like Andres, I'd want to see a more realistic problem case before
> expending a lot of work here.

I think the question here isn't whether the problem case is realistic
- I am quite sure that a pgbench workload is - but rather how much of
a problem it actually causes.  I'm very sure that individual pgbench
backends experience multi-second stallls as a result of this.  What
I'm not sure about is how frequently it happens, and how much of an
effect it has on overall latency.  I think it would be worth someone's
time to try to write some good instrumentation code here and figure
that out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-09-09 Thread Robert Haas
On Wed, Sep 9, 2015 at 2:17 AM, Haribabu Kommi  wrote:
> And also regarding the number of workers (16) that is shown in the
> explain analyze plan are not actually allotted because the in my
> configuration i set the max_worker_process as 8 only. I feel the plan
> should show the allotted workers not the planned workers.
> If the query execution takes time because of lack of workers and the
> plan is showing as 16 workers, in that case user may think that
> even with 16 workers the query is slower, but actually it is not.

I would expect EXPLAIN should show the # of workers planned, and
EXPLAIN ANALYZE should show both the planned and actual values.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [HACKERS] 答复:[HACKERS] 答复:[HACKERS] about fsync in CLOG buffer write

2015-09-09 Thread Tom Lane
Robert Haas  writes:
> ... How often such a workload actually has to replace a *dirty* clog
> buffer obviously depends on how often you checkpoint, but if you're
> getting ~28k TPS you can completely fill 32 clog buffers (1 million
> transactions) in less than 40 seconds, and you're probably not
> checkpointing nearly that often.

But by the same token, at that kind of transaction rate, no clog page is
actively getting dirtied for more than a couple of seconds.  So while it
might get swapped in and out of the SLRU arena pretty often after that,
this scenario seems unconvincing as a source of repeated fsyncs.

Like Andres, I'd want to see a more realistic problem case before
expending a lot of work here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Counting lines correctly in psql help displays

2015-09-09 Thread Tom Lane
Andrew Dunstan  writes:
> On 09/05/2015 12:55 PM, Tom Lane wrote:
>> Or we could just give up and replace the counts by INT_MAX, forcing use
>> of the pager unless you've turned it off.  All of those outputs are long
>> enough now that it's hard to believe there are any common screen layouts
>> where you don't end up invoking the pager anyway.  (usage() is 60 lines,
>> the others are more.)  This is probably the reason why we've seldom
>> noticed they're wrong --- it barely matters anymore.
>> 
>> One way or the other I think it's past time to get out of the business
>> of maintaining these counts.  I'm willing to do the work of using a
>> PQExpBuffer if people think it's worth the trouble to have an accurate
>> count, but it may not be worth the code space.

> I'm not terribly happy about the INT_MAX idea. Counting lines in a 
> PGExpBuffer seems OK. That way we could honor pager_min_lines, I hope.

TBH, I'm not detecting enough concern about this issue to make it worth
doing more than replacing the counts by INT_MAX.  Nobody has stepped up
and said "yeah, my terminal window is 100 lines high and I'll be really
annoyed if \? invokes the pager unnecessarily".  I plan to just do the
three-line fix and move on.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] 答复:[HACKERS] 答复:[HACKERS] about fsync in CLOG buffer write

2015-09-09 Thread Robert Haas
On Tue, Sep 8, 2015 at 2:28 PM, Andres Freund  wrote:
> On 2015-09-08 15:58:26 +0800, 周正中(德歌) wrote:
>> postgres@digoal-> cat 7.sql
>> select txid_current();
>>
>> postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 
>> 10
>> About 32K tps.
>> progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
>> progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127
>
> So you're benchmarking how fast you can assign txids. Is that actually
> something meaningful? If you have other writes interleaved you'll write
> much more WAL, so there'll be checkpoints and such.
>
> FWIW, if you measure something realistic and there's checkpoints,
> there'll be fewer fsyncs if you increase the slru buffer size - as
> there'll often be clean buffers due to the checkpoint having written
> them out.

But I think it's not very hard to come up with a workload where 32
clog buffers isn't enough, and you end up waiting for backends to
fsync().  Suppose you have a pgbench workload with N tuples.  We
update tuples at random, so sometimes we hit one that's just recently
been updated, and other times we hit one that hasn't been updated for
many transactions.  At 32k transactions per page, each transaction's
chance of updating a tuple whose existing xmin is on the most recent
clog page is 32k/N.  The chance of hitting a tuple whose existing xmin
is on the next page back is (1 - 32k/N) * 32k/N.  The chance of
hitting a page whose current xmin is at least X pages prior to the
most recent one is (1 - 32k/N)^X.  So, how many tuples do we need in
order for each update to have a 50% chance of hitting a tuple that is
at least 32 pages back?

(1 - 32k/N)^32 = .5
1 - 32k/N = 0.9785720620877
32k/N = 0.0214279379122999
N = 32k/0.0214279379122999
N = 1529218.54329206

...or in other words, scale factor 16.  At scale factors >= 1044, the
chance that the next update hits an xmin more than 32 clog buffers
back is > 99%.  So any test of this sort causes extremely rapid clog
page eviction - basically every transaction is going to request a
buffer that's probably not cached, and as soon as it's done with it,
some other transaction will evict it to bring in a different buffer
that's not cached.

How often such a workload actually has to replace a *dirty* clog
buffer obviously depends on how often you checkpoint, but if you're
getting ~28k TPS you can completely fill 32 clog buffers (1 million
transactions) in less than 40 seconds, and you're probably not
checkpointing nearly that often.

I'm not entirely sure how much fsync absorption for SLRU buffers will
help, but I think it would be worth trying.  Back when I was spending
more time on this area, I saw some evidence that those fsyncs did
cause at least some latency spikes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-09-09 Thread Pavel Stehule
>
>
> \x doesn't exactly rotate it either.  \x puts the column headers down
> the side instead of across the top, but it doesn't put the rows across
> the top instead of down the side.  Rather, each row is listed in a
> separate chunk.


true, it is rotation per one row. I was wrong.


> This feature is doing something else again.  I've
> actually never seen this particular transformation anywhere except for
> Microsoft Excel's pivot tables, which I still find confusing.
>




>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] Autonomous Transaction is back

2015-09-09 Thread Robert Haas
On Sun, Sep 6, 2015 at 1:56 AM, Noah Misch  wrote:
> What design principle(s) have you been using to decide how autonomous
> transactions should behave?

I have to admit to a complete lack of principle.  I'm quite frightened
of what this is going to need from the lock manager, and I'm trying to
wriggle out of having to do things there that are going to be nastily
hard.  My wriggling isn't going very well, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-09-09 Thread Robert Haas
On Tue, Sep 8, 2015 at 4:58 PM, Pavel Stehule  wrote:
> 2015-09-08 22:55 GMT+02:00 Daniel Verite :
>>
>> Pavel Stehule wrote:
>>
>> > rotate ~ sounds like transpose matrix, what is not true in this case.
>
> for me the relation rotation is exactly what \x does

\x doesn't exactly rotate it either.  \x puts the column headers down
the side instead of across the top, but it doesn't put the rows across
the top instead of down the side.  Rather, each row is listed in a
separate chunk.  This feature is doing something else again.  I've
actually never seen this particular transformation anywhere except for
Microsoft Excel's pivot tables, which I still find confusing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-09 Thread Robert Haas
On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
 wrote:
> Also, I'm not sure what other places do you have in mind (could you list
> some examples?) but I'd bet we limit the allocation to 1GB because of the
> palloc() limit and not because of fear of over-estimates.

I don't really think those two things are different from each other.
The palloc() limit is a means of enforcing a general policy of
limiting all allocations to 1GB except in places where we've made a
very conscious decision to allow a specific exception.  This limit
happens to dovetail nicely with the varlena size limit, so in many
cases it is the exactly correct limit just for that reason.  But even
when, as here, that's not at issue, it's still a useful limit, because
there are many ways that some garbage value can get passed to palloc
-- bad planner estimates, corrupted tuples, bugs in other parts of our
code.  And at least on my old MacBook Pro (I haven't tested the
current one), passing a sufficiently-large value to malloc() causes a
kernel panic.  That's probably a particularly bad bug, but there are
lots of systems where "accidentally" allocating an unreasonable amount
of space will have all kinds of unpleasant consequences.  So, I
believe that  palloc()'s limit improves the overall stability of the
system considerably even if it causes some occasional annoyance.

Most of the time, you can just palloc() and not worry too much about
whether you're going to blow up the machine: you won't, because you
aren't going to allocate more than 1GB.  Any place that wants to
allocate more than that needs to be someplace where we can be pretty
sure that we're not going to accidentally allocate some completely
unreasonable amount of memory, like say 1TB.  Nothing in this
discussion convinces me that this is such a place.  Note that
tuplesort.c and tuplestore.c, the only existing callers of
repalloc_huge, only allocate such large amounts of memory when they
actually have enough tuples to justify it - it is always based on the
actual number of tuples, never an estimate.  I think that would be a
sound principle here, too.  Resizing the hash table to such a large
size based on the actual load factor is very reasonable; starting with
such a large size seems less so.  Admittedly, 512MB is an arbitrary
point: and if it so happened that the limit was 256MB or 1GB or 128MB
or even 2GB I wouldn't advocate for changing it just for fun.  But
you're saying we should just remove that limit altogether, and I think
that's clearly unreasonable.  Do you really want to start out with a
TB or even PB-sized hash table when the actual number of tuples is,
say, one?  That may sound crazy, but I've seen enough bad query plans
to know that, yes, we are sometimes off by nine orders of magnitude.
This is not a hypothetical problem.

>> More importantly, removing the cap on the allocation size makes the
>> problem a lot worse.  You might be sad if a bad planner estimate
>> causes the planner to allocate 1GB when 64MB would have been enough,
>> but on modern systems it is not likely to be an enormous problem.  If
>> a similar mis-estimation causes the planner to allocate 16GB rather
>> than 1GB, the opportunity for you to be sad is magnified pretty
>> considerably.  Therefore, I don't really see the over-estimation bug
>> fix as being separate from this one.
>
> Perhaps. But if you want to absolutely prevent such sadness then maybe you
> should not set work_mem that high?

I think that's a red herring for a number of reasons.  One, the
allocation for the hash buckets is only a small portion of the total
memory.  Two, the fact that you are OK with the hash table growing to
a certain size does not mean that you want it to start out that big on
the strength of a frequently-flawed planner estimate.

> Anyway, I do see this as a rather orthogonal problem - an independent
> improvement, mostly unrelated to the bugfix. Even if we decide to redesign
> it like this (and I'm not particularly opposed to that, assuming someone
> takes the time to measure how expensive the additional resize actually is),
> we'll still have to fix the repalloc().
>
> So I still fail to see why we shouldn't apply this fix.

In all seriousness, that is fine.  I respect your opinion; I'm just
telling you mine, which happens to be different.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dependency between bgw_notify_pid and bgw_flags

2015-09-09 Thread Robert Haas
On Wed, Sep 9, 2015 at 8:14 AM, Ashutosh Bapat
 wrote:
> PFA patch with improved test module and fix for a bug.
>
> bgworker_sigusr1_handler() should set the latch when set_latch_on_sigusr1 is
> true, similar to procsignal_sigusr1_handler(). Without this fix, if a
> background worker without DATABASE_CONNECTION flag calls
> WaitForBackgroundWorker*() functions, those functions wait indefinitely as
> the latch is never set upon receipt of SIGUSR1.

This is hardly an insurmountable problem given that they can replace
bgworker_sigusr1_handler() with another handler whenever they like.
It might be a good idea anyway, but not without saving and restoring
errno.

>> Thanks.  I don't think this test module is suitable for commit for a
>> number of reasons, including the somewhat hackish use of exit(0)
>> instead of proper error reporting
> I have changed this part of code.

Maybe I should have been more clear: I don't really want a test module
for this.  Yeah, there was a bug, but we fixed it, and I don't see
that it's going to come back.  We might have a different one, but this
module is so special-purpose that it won't catch that.  If there are
some +1s to the idea of this test module from other people then I am
not unwilling to reconsider, but my opinion is that this is the wrong
thing to spend time on.  If we had some plan to build out a test
framework here that would really thoroughly exercise these code paths,
that might be valuable -- I'm not opposed to the idea of trying to
have better test coverage of the bgworker code.  But I just don't see
that this particular module does enough to make it worthwhile.

Considering that, I'm not really excited about spending a lot of time
on review right now, but FWIW I still think the error handling in here
is weak.  Why elog() and not ereport()?  Yeah, this is not user-facing
stuff exactly because it's just a test module, but where else do we
use elog() like this?  Why elog(WARNING) and proc_exit(1) instead of
elog(FATAL) or elog(PANIC)? The messages don't follow the style
guidelines either, like "found launcher in unexpected state (expected
status %d, got %d), exiting." -- that doesn't look anything like our
usual style for reporting errors.

>> , the fact that you didn't integrate
>> it into the Makefile structure properly
>
> What was missing? I am able to make {check,clean,install) from the
> directory. I can also make -C  check from repository's root.

make check-world won't run it, right?  So there would be no BF coverage.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-09-09 Thread Fujii Masao
On Fri, Sep 4, 2015 at 4:48 PM, Amit Kapila  wrote:
> On Thu, Sep 3, 2015 at 6:07 PM, Fujii Masao  wrote:
>>
>> On Tue, Aug 4, 2015 at 12:15 PM, Amit Kapila 
>> wrote:
>> > On Mon, Aug 3, 2015 at 7:44 PM, Fujii Masao 
>> > wrote:
>> >> ISTM that we can
>> >> see that the online backup mode has already been canceled if
>> >> backup_label
>> >> file
>> >> is successfully removed whether tablespace_map file remains or not. No?
>> >>
>> >
>> > I think what we should do is that display successful cancellation
>> > message
>> > only when both the files are renamed.
>>
>> Please imagine the case where backup_label was successfully renamed
>> but tablespace_map was not. Even in this case, I think that we can see
>> that the backup mode was canceled because the remaining tablespace_map
>> file will be ignored in the subsequent recovery.
>
> Right.
>
>>
>> So we should emit
>> the successful cancellation message when backup_label is renamed
>> whether tablespace_map is successfully renamed or not?
>>
>
> You mean to say, just try renaming tablespace_map and don't display any
> message whether that is successful or not-successful?
>
> I see some user inconvenience if we do this way, which is even after the
> backup is cancelled, on next recovery, there will be a log message
> indicating
> either rename of tablespace_map successful or unsuccessful.  Also, don't you
> think it is better to let user know that the tablespace_map file is
> successfully
> renamed as we do for backup_label file.  Shall we change the patch such that
> if backup_label is successfully renamed and renaming of tablespace_map
> gets failed, then display a log message to something like below:
>
> LOG:  online backup mode canceled
> DETAIL:  "backup_label" was renamed to "backup_label.old", could not rename
> "tablespace_map" to "tablespace_map.old"

Agreed with this direction. So what about the attached patch?

Regards,

-- 
Fujii Masao
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 127bc58..c2a1d51 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10911,32 +10911,32 @@ CancelBackup(void)
 {
 	struct stat stat_buf;
 
-	/* if the file is not there, return */
+	/* if the backup_label file is not there, return */
 	if (stat(BACKUP_LABEL_FILE, &stat_buf) < 0)
 		return;
 
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(BACKUP_LABEL_OLD);
 
-	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) == 0)
-	{
-		ereport(LOG,
-(errmsg("online backup mode canceled"),
- errdetail("\"%s\" was renamed to \"%s\".",
-		   BACKUP_LABEL_FILE, BACKUP_LABEL_OLD)));
-	}
-	else
+	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
 	{
 		ereport(WARNING,
 (errcode_for_file_access(),
  errmsg("online backup mode was not canceled"),
- errdetail("Could not rename \"%s\" to \"%s\": %m.",
+ errdetail("\"%s\" could not be renamed to \"%s\": %m.",
 		   BACKUP_LABEL_FILE, BACKUP_LABEL_OLD)));
+		return;
 	}
 
 	/* if the tablespace_map file is not there, return */
 	if (stat(TABLESPACE_MAP, &stat_buf) < 0)
+	{
+		ereport(LOG,
+(errmsg("online backup mode canceled"),
+ errdetail("\"%s\" was renamed to \"%s\".",
+		   BACKUP_LABEL_FILE, BACKUP_LABEL_OLD)));
 		return;
+	}
 
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(TABLESPACE_MAP_OLD);
@@ -10945,15 +10945,19 @@ CancelBackup(void)
 	{
 		ereport(LOG,
 (errmsg("online backup mode canceled"),
- errdetail("\"%s\" was renamed to \"%s\".",
-		   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
+ errdetail("\"%s\" and \"%s\" were renamed to "
+		   "\"%s\" and \"%s\", respectively.",
+		   BACKUP_LABEL_FILE, TABLESPACE_MAP,
+		   BACKUP_LABEL_OLD, TABLESPACE_MAP_OLD)));
 	}
 	else
 	{
 		ereport(WARNING,
 (errcode_for_file_access(),
- errmsg("online backup mode was not canceled"),
- errdetail("Could not rename \"%s\" to \"%s\": %m.",
+ errmsg("online backup mode canceled"),
+ errdetail("\"%s\" was renamed to \"%s\", "
+		   "but \"%s\" could not be renamed to \"%s\": %m.",
+		   BACKUP_LABEL_FILE, BACKUP_LABEL_OLD,
 		   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 	}
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Summary of plans to avoid the annoyance of Freezing

2015-09-09 Thread Greg Stark
On Sun, Sep 6, 2015 at 1:25 PM, Andres Freund  wrote:
> My vote is that we should try to get freeze maps into 9.6 - that seems
> more realistic given that we have a patch right now. Yes, it might end
> up being superflous churn, but it's rather localized. I think around
> we've put off significant incremental improvements off with the promise
> of more radical stuff too often.

Superfluous churn in the code isn't too bad. But superfluous churn in
data formats might be a bit more scary. Would we be able to handle
pg_upgrade from a database with or without a freezemap? Would you have
to upgrade once to add the freezemap then again to remove it?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql: Improve logging of TAP tests.

2015-09-09 Thread Stephen Frost
Noah,

* Noah Misch (n...@leadboat.com) wrote:
> On Tue, Sep 08, 2015 at 02:58:36PM -0400, Stephen Frost wrote:
> > * Andrew Dunstan (and...@dunslane.net) wrote:
> > > Improve logging of TAP tests.
> > 
> > [...]
> > 
> > This broke 'make check' for REL9_4_STABLE with --enable-tap-tests
> > because it added a reference to 'with_temp_install' but didn't actually
> > define it.
> 
> The corresponding commits for HEAD (1ea0620) and 9.5 (fa4a4df) added just an
> "rm" invocation to that Makefile.  Commit ef57b98 had no occasion to do more;
> I suspect a merge accident.  Best to revert the extra change:
> 
> --- a/src/Makefile.global.in
> +++ b/src/Makefile.global.in
> @@ -357,5 +357,7 @@ endef
>  define prove_check
>  rm -rf $(CURDIR)/tmp_check/log
> -cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) 
> PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' $(PROVE) 
> $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
> +$(MKDIR_P) tmp_check/log
> +$(MAKE) -C $(top_builddir) DESTDIR='$(CURDIR)'/tmp_check/install install 
> >'$(CURDIR)'/tmp_check/log/install.log 2>&1
> +cd $(srcdir) && TESTDIR='$(CURDIR)' 
> PATH="$(CURDIR)/tmp_check/install$(bindir):$$PATH" $(call 
> add_to_path,$(ld_library_path_var),$(CURDIR)/tmp_check/install$(libdir)) 
> top_builddir='$(CURDIR)/$(top_builddir)' PGPORT='6$(DEF_PGPORT)' $(PROVE) 
> $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
>  endef

Yup, reverting mine and applying the above appears to work based on my
testing.

Patch attached for review.  Barring objections, I'll commit this in a
few hours.

Thanks!

Stephen
From 78a6aa0e7db6b8b4ac731119932b2878450972df Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 9 Sep 2015 08:31:35 -0400
Subject: [PATCH] Revert ed47666 and part of ef57b98

This reverts ed47666, which ended up adding a second tempoary
installation for all 'make check' runs, and reverts the part of
ef57b98 that changed the TAP 'prove_check' section, which appears to
have been unintentional to begin with on this branch.

Analysis and patch by Noah.
---
 src/Makefile.global.in | 22 +++---
 1 file changed, 3 insertions(+), 19 deletions(-)

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index bea05ad..eaa3ec4 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -59,7 +59,6 @@ endif
 endif
 else # not PGXS
 vpath_build = @vpath_build@
-abs_top_builddir = @abs_top_builddir@
 abs_top_srcdir = @abs_top_srcdir@
 
 ifneq ($(vpath_build),yes)
@@ -317,19 +316,6 @@ BZIP2	= bzip2
 
 # Testing
 
-check: temp-install
-
-.PHONY: temp-install
-temp-install:
-ifndef NO_TEMP_INSTALL
-ifeq ($(MAKELEVEL),0)
-	rm -rf '$(abs_top_builddir)'/tmp_install
-	$(MKDIR_P) '$(abs_top_builddir)'/tmp_install/log
-	$(MAKE) -C '$(top_builddir)' DESTDIR='$(abs_top_builddir)'/tmp_install install >'$(abs_top_builddir)'/tmp_install/log/install.log 2>&1
-endif
-	$(if $(EXTRA_INSTALL),for extra in $(EXTRA_INSTALL); do $(MAKE) -C '$(top_builddir)'/$$extra DESTDIR='$(abs_top_builddir)'/tmp_install install >>'$(abs_top_builddir)'/tmp_install/log/install.log || exit; done)
-endif
-
 PROVE = @PROVE@
 PG_PROVE_FLAGS = -I $(top_srcdir)/src/test/perl/
 PROVE_FLAGS = --verbose
@@ -344,10 +330,6 @@ define ld_library_path_var
 $(if $(filter $(PORTNAME),darwin),DYLD_LIBRARY_PATH,$(if $(filter $(PORTNAME),aix),LIBPATH,LD_LIBRARY_PATH))
 endef
 
-define with_temp_install
-PATH="$(abs_top_builddir)/tmp_install$(bindir):$$PATH" $(call add_to_path,$(ld_library_path_var),$(abs_top_builddir)/tmp_install$(libdir))
-endef
-
 ifeq ($(enable_tap_tests),yes)
 
 define prove_installcheck
@@ -356,7 +338,9 @@ endef
 
 define prove_check
 rm -rf $(CURDIR)/tmp_check/log
-cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
+$(MKDIR_P) tmp_check/log
+$(MAKE) -C $(top_builddir) DESTDIR='$(CURDIR)'/tmp_check/install install >'$(CURDIR)'/tmp_check/log/install.log 2>&1
+cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(CURDIR)/tmp_check/install$(bindir):$$PATH" $(call add_to_path,$(ld_library_path_var),$(CURDIR)/tmp_check/install$(libdir)) top_builddir='$(CURDIR)/$(top_builddir)' PGPORT='6$(DEF_PGPORT)' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
 endef
 
 else
-- 
1.9.1



signature.asc
Description: Digital signature


Re: [HACKERS] Summary of plans to avoid the annoyance of Freezing

2015-09-09 Thread Robert Haas
On Sun, Sep 6, 2015 at 8:25 AM, Andres Freund  wrote:
> On 2015-08-10 07:03:02 +0100, Simon Riggs wrote:
>> I was previously a proponent of (2) as a practical way forwards, but my
>> proposal here today is that we don't do anything further on 2) yet, and
>> seek to make progress on 5) instead.
>>
>> If 5) fails to bring a workable solution by the Jan 2016 CF then we commit
>> 2) instead.
>>
>> If Heikki wishes to work on (5), that's good. Otherwise, I think its
>> something I can understand and deliver by 1 Jan, though likely for 1 Nov CF.
>
> I highly doubt that we can get either variant into 9.6 if we only start
> to seriously review them by then. Heikki's lsn ranges patch essentially
> was a variant of 5) and it ended up being a rather complicated patch. I
> don't think using an explicit epoch is going to be that much simpler.
>
> So I think we need to decide now.
>
> My vote is that we should try to get freeze maps into 9.6 - that seems
> more realistic given that we have a patch right now. Yes, it might end
> up being superflous churn, but it's rather localized. I think around
> we've put off significant incremental improvements off with the promise
> of more radical stuff too often.

I strongly support that plan.  I think it's unlikely we're going to
have something better ready in time for 9.6, and freeze maps by
themselves would bring enormous relief to many people.  And the worst
thing that happens if we rip this back out again because it isn't
needed, which isn't really going to be a lot of work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dependency between bgw_notify_pid and bgw_flags

2015-09-09 Thread Ashutosh Bapat
On Wed, Sep 2, 2015 at 2:02 AM, Robert Haas  wrote:

> On Mon, Aug 31, 2015 at 8:01 AM, Ashutosh Bapat
>  wrote:
> >> Thanks.  It needs testing though to see if it really works as
> >> intended.  Can you look into that?
> >
> > PFA the patch containing your code changes + test module. See if that
> meets
> > your expectations.
>
>
PFA patch with improved test module and fix for a bug.

bgworker_sigusr1_handler() should set the latch when set_latch_on_sigusr1
is true, similar to procsignal_sigusr1_handler(). Without this fix, if a
background worker without DATABASE_CONNECTION flag calls
WaitForBackgroundWorker*() functions, those functions wait indefinitely as
the latch is never set upon receipt of SIGUSR1.


> Thanks.  I don't think this test module is suitable for commit for a
> number of reasons, including the somewhat hackish use of exit(0)
> instead of proper error reporting


I have changed this part of code.


> , the fact that you didn't integrate
> it into the Makefile structure properly


What was missing? I am able to make {check,clean,install) from the
directory. I can also make -C  check from repository's root.


> and the fact that without the
> postmaster.c changes it hangs forever instead of causing a test
> failure.


Changed this too. The SQL level function test_bgwnotify() now errors out if
it doesn't receive notification in specific time.


> But it's sufficient to show that the code changes have the
> intended effect.
>

Looking at the kind of bugs I am getting, we should commit this test
module. Let me know your comments, I will fix those.


> I've committed this and back-patched it to 9.5, but not further.  It's
> a bug fix, but it's also a refactoring exercise, so I'd rather not
> push it into released branches.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 68c9505..fbbf97e 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -556,20 +556,23 @@ bgworker_die(SIGNAL_ARGS)
  * Standard SIGUSR1 handler for unconnected workers
  *
  * Here, we want to make sure an unconnected worker will at least heed
  * latch activity.
  */
 static void
 bgworker_sigusr1_handler(SIGNAL_ARGS)
 {
 	int			save_errno = errno;
 
+	if (set_latch_on_sigusr1)
+		SetLatch(MyLatch);
+
 	latch_sigusr1_handler();
 
 	errno = save_errno;
 }
 
 /*
  * Start a new background worker
  *
  * This is the main entry point for background worker, to be called from
  * postmaster.
diff --git a/src/test/modules/test_bgwnotify/Makefile b/src/test/modules/test_bgwnotify/Makefile
new file mode 100644
index 000..6931c09
--- /dev/null
+++ b/src/test/modules/test_bgwnotify/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_bgwnotify/Makefile
+
+MODULE_big = test_bgwnotify
+OBJS = test_bgwnotify.o $(WIN32RES)
+PGFILEDESC = "test_bgwnotify - example use of background worker notification infrastructure"
+
+EXTENSION = test_bgwnotify
+DATA = test_bgwnotify--1.0.sql
+
+REGRESS = test_bgwnotify
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_bgwnotify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_bgwnotify/expected/test_bgwnotify.out b/src/test/modules/test_bgwnotify/expected/test_bgwnotify.out
new file mode 100644
index 000..b7e1b65
--- /dev/null
+++ b/src/test/modules/test_bgwnotify/expected/test_bgwnotify.out
@@ -0,0 +1,11 @@
+CREATE EXTENSION test_bgwnotify;
+--
+-- We're checking that the operations complete without crashing or hanging and
+-- that none of their internal sanity tests fail.
+--
+SELECT test_bgwnotify();
+ test_bgwnotify 
+
+ t
+(1 row)
+
diff --git a/src/test/modules/test_bgwnotify/sql/test_bgwnotify.sql b/src/test/modules/test_bgwnotify/sql/test_bgwnotify.sql
new file mode 100644
index 000..e448ef0
--- /dev/null
+++ b/src/test/modules/test_bgwnotify/sql/test_bgwnotify.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_bgwnotify;
+
+--
+-- We're checking that the operations complete without crashing or hanging and
+-- that none of their internal sanity tests fail.
+--
+SELECT test_bgwnotify();
diff --git a/src/test/modules/test_bgwnotify/test_bgwnotify--1.0.sql b/src/test/modules/test_bgwnotify/test_bgwnotify--1.0.sql
new file mode 100644
index 000..f3423bc
--- /dev/null
+++ b/src/test/modules/test_bgwnotify/test_bgwnotify--1.0.sql
@@ -0,0 +1,7 @@
+/* src/test/modules/test_bgwnotify/test_bgwnotify--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_bgwnotify" to load this file. \quit
+
+CREATE FUNCTION test_bgwnotify() RETURNS pg_catalog.bool STRICT
+	AS 

Re: [HACKERS] pageinspect patch, for showing tuple data

2015-09-09 Thread Nikolay Shaplov
В письме от 8 сентября 2015 11:53:24 Вы написали:
> On Sat, Sep 5, 2015 at 1:05 AM, Nikolay Shaplov wrote:
> > В письме от 4 сентября 2015 14:58:29 пользователь Michael Paquier написал:
> > > Documentation is missing, that would be good to have to understand what
> > > each function is intended to do.
> > 
> > I were going to add documentation when this patch is commited, or at least
> > approved for commit. So I would know for sure that function definition
> > would
> > not change, so had not to rewrite it again and again.
> 
> I doubt that this patch would be committed without documentation, this is a
> requirement for any patch.
Ok. I can do it. 

> > So if it is possible I would write documentation and test when this patch
> > is
> > already approved.
> 
> I'm fine with that. Let's discuss its shape and come up with an agreement.
> It would have been good to post test cases of what this stuff actually does
> though, people reviewing this thread are limited to guess on what is tried
> to be achieved just by reading the code.

To test non detoasted functions one should do:

CREATE EXTENSION pageinspect;

create table test (a int, b smallint,c varchar);
insert into test VALUES
(-1,2,'111'),
(2,null,'222'),
(3,8,'333'),
(null,16,null);

Then 

# select * from heap_page_items(get_raw_page('test', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | 
t_infomask2 | t_infomask | t_hoff |  t_bits  | t_oid | t_data 
++--++++--++-+++--+---+
  1 |   8152 |1 | 34 |763 |  0 |0 | (0,1)  |
   3 |   2050 | 24 |  |   | \x020009313131
  2 |   8120 |1 | 32 |763 |  0 |0 | (0,2)  |
   3 |   2051 | 24 | 1010 |   | \x020009323232
  3 |   8080 |1 | 34 |763 |  0 |0 | (0,3)  |
   3 |   2050 | 24 |  |   | \x030008000933
  4 |   8048 |1 | 26 |763 |  0 |0 | (0,4)  |
   3 |   2049 | 24 | 0100 |   | \x1000

or 

# select * from heap_page_item_attrs(get_raw_page('test', 0),'test'::regclass);
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | 
t_infomask2 | t_infomask | t_hoff |  t_bits  | t_oid | t_attrs  
   
++--++++--++-+++--+---+-
  1 |   8152 |1 | 34 |763 |  0 |0 | (0,1)  |
   3 |   2050 | 24 |  |   | 
{"\\x","\\x0200","\\x09313131"}
  2 |   8120 |1 | 32 |763 |  0 |0 | (0,2)  |
   3 |   2051 | 24 | 1010 |   | 
{"\\x0200",NULL,"\\x09323232"}
  3 |   8080 |1 | 34 |763 |  0 |0 | (0,3)  |
   3 |   2050 | 24 |  |   | 
{"\\x0300","\\x0800","\\x0933"}
  4 |   8048 |1 | 26 |763 |  0 |0 | (0,4)  |
   3 |   2049 | 24 | 0100 |   | {NULL,"\\x1000",NULL}
(4 rows)

For detoasted function you should do something like this:

CREATE EXTENSION pageinspect;

create table test2 (c varchar);

insert into test2 VALUES ('++');
insert into test2 VALUES (repeat('+',2100));

Then  heap_page_item_attrs will show you compressed attr data:

# select * from heap_page_item_attrs(get_raw_page('test2', 
0),'test2'::regclass);
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | 
t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |
t_attrs
++--++++--++-++++---+---
  1 |   8160 |1 | 27 |768 |  0 |0 | (0,1)  |
   1 |   2050 | 24 ||   | {"\\x072b2b"}
  2 |   8096 |1 | 59 |769 |  0 |0 | (0,2)  |
   1 |   2050 | 24 ||   | 
{"\\x8e003408fe2b0f01ff0f01ff0f01ff0f01ff0f01ff0f01ff0f01ff010f01aa"}
(2 rows)

and heap_page_item_detoast_attrs will show you original data

# select * from heap_page_item_detoast_attrs(get_raw_page('test2', 
0),'test2'::regclass);
[will not paste output here it is too long, see it for yourself]


> That's actually where
> documentation, even a draft of documentation helps a lot in the review to
> see if what is expected by the developer matches what the code actually
> does.
> 
> > Code has some whitespaces.
> > I've found and removed some. Hope that was all of them...
> 
> Yeah, it looks that you took completely rid of them.
> 
> In details, t

Re: [HACKERS] Waits monitoring

2015-09-09 Thread Alexander Korotkov
On Tue, Sep 8, 2015 at 8:01 PM, Robert Haas  wrote:

> On Mon, Sep 7, 2015 at 7:33 AM, Ildus Kurbangaliev
>  wrote:
> >> > Ildus, could you please review Amit & Robert's patch?
> >> [1] -
> >>
> http://www.postgresql.org/message-id/caa4ek1kdec1tm5ya9gkv85vtn4qqsrxzkjru-tu70g_tl1x...@mail.gmail.com
> > About [1] and [2]. They are slightly conflicting, but if [1] is
> > going to be committed I can easily use it in [2]. And it will simplify
> > my patch, so I have no objections here. And the same time [3] can be
> > significantly simplified and improved on top of [1] and [2].
>
> Great - let's try to deal with [1] first, then.
>
> Does anyone wish to object to me committing that part?
>

No objections from me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-09 Thread Alexander Korotkov
On Wed, Sep 9, 2015 at 9:01 AM, Michael Paquier 
wrote:

>
>
> On Wed, Sep 9, 2015 at 3:27 AM, Alexander Korotkov
>  wrote:
>
>> On Tue, Sep 8, 2015 at 10:28 AM, Michael Paquier
>> wrote:
>>
>>> I am planning to do as well a detailed code review rather soon.
>>>
>>
>> Good, thanks.
>>
>
> When testing a bit more complex structures, it occurred to me that we may
> want as well to use as a source node a standby. For example based on the
> case of my cluster above:
>  master (5432)
>   /  \
>  1 (5433)   2 (5434)
>  |
>  3 (5435)
> If I try for example to rebuild the cluster as follows there will be
> failures:
> 1) Rewind with source = 3, target = 2
> 2) Start 3 and 2
> 3) Shutdown 2
> 3) Rewind source = 2, target = 1, failure with:
> source data directory must be shut down cleanly
>
> It seems to me that we should allow a node that has been shutdowned in
> recovery to be used as a source for rewind as well, as follows:
> -   if (datadir_source && ControlFile_source.state != DB_SHUTDOWNED)
> +   if (datadir_source &&
> +   ControlFile_source.state != DB_SHUTDOWNED &&
> +   ControlFile_source.state != DB_SHUTDOWNED_IN_RECOVERY)
> pg_fatal("source data directory must be shut down
> cleanly\n");
> At least your patch justifies in my eyes such a change.
>

It's source is not required to be √ in recovery if we specify it by
connection string. This is why I think DB_SHUTDOWNED_IN_RECOVERY is OK when
we specify source by data directory. Included in attached patch.

 /*
> + * Find minimum from two xlog pointers assuming invalid pointer is
> greatest
> + * possible pointer.
> + */
> +static XLogRecPtr
> +xlPtrMin(XLogRecPtr a, XLogRecPtr b)
> +{
> +   if (XLogRecPtrIsInvalid(a))
> +   return b;
> +   else if (XLogRecPtrIsInvalid(b))
> +   return a;
> +   else
> +   return Min(a, b);
> +}
> This is true as timeline.h tells so, still I think that it would be better
> to mention that this is this assumption is held in this header file, or
> simply that timeline history entries at the top have their end position set
> as InvalidXLogRecPtr which is a synonym of infinity.
>

Agree. Comment is adjusted.


> The code building the target history file is a duplicate of what is done
> with the source. Perhaps we could refactor that as a single routine in
> pg_rewind.c?
>

Do you mean duplication between rewind_parseTimeLineHistory()
and readTimeLineHistory(). We could put readTimeLineHistory() into common
with some refactoring. This is the subject of separate patch, though.

Except that, the patch looks pretty neat to me. I was wondering as well:
> what tests did you run up to now with this patch? I am attaching an updated
> version of my test script I used for some more complex scenarios. Feel free
> to play with it.
>

I was running manual tests like a noob :) When you attached you bash
script, I've switched to it.

A also added additional check in findCommonAncestorTimeline(). Two standbys
could be independently promoted and get the same new timeline ID. Now, it's
checked that timelines, that we assume to be the same, have equal begins.
Begins could still match by coincidence. But the same risk exists in
original pg_rewind, though.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-rewind-target-switch-4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor code improvements to create_foreignscan_plan/ExecInitForeignScan

2015-09-09 Thread Etsuro Fujita
On 2015/07/10 18:40, Etsuro Fujita wrote:
> To save cycles, I modified create_foreignscan_plan so that it detects
> whether any system columns are requested if scanning a base relation.
> Also, I revised other code there a little bit.

Attached is an updated version of the patch.  The previous version
contained changes to ExecInitForeignScan, but I've dropped that part, as
discussed before.

Best regards,
Etsuro Fujita
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***
*** 2058,2066  create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
  	RelOptInfo *rel = best_path->path.parent;
  	Index		scan_relid = rel->relid;
  	Oid			rel_oid = InvalidOid;
- 	Bitmapset  *attrs_used = NULL;
- 	ListCell   *lc;
- 	int			i;
  
  	Assert(rel->fdwroutine != NULL);
  
--- 2058,2063 
***
*** 2120,2155  create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path,
  	}
  
  	/*
! 	 * Detect whether any system columns are requested from rel.  This is a
! 	 * bit of a kluge and might go away someday, so we intentionally leave it
! 	 * out of the API presented to FDWs.
! 	 *
! 	 * First, examine all the attributes needed for joins or final output.
! 	 * Note: we must look at reltargetlist, not the attr_needed data, because
! 	 * attr_needed isn't computed for inheritance child rels.
  	 */
! 	pull_varattnos((Node *) rel->reltargetlist, rel->relid, &attrs_used);
! 
! 	/* Add all the attributes used by restriction clauses. */
! 	foreach(lc, rel->baserestrictinfo)
  	{
! 		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
  
! 		pull_varattnos((Node *) rinfo->clause, rel->relid, &attrs_used);
! 	}
  
! 	/* Now, are any system columns requested from rel? */
! 	scan_plan->fsSystemCol = false;
! 	for (i = FirstLowInvalidHeapAttributeNumber + 1; i < 0; i++)
! 	{
! 		if (bms_is_member(i - FirstLowInvalidHeapAttributeNumber, attrs_used))
  		{
! 			scan_plan->fsSystemCol = true;
! 			break;
  		}
- 	}
  
! 	bms_free(attrs_used);
  
  	return scan_plan;
  }
--- 2117,2161 
  	}
  
  	/*
! 	 * If we're scanning a base relation, detect whether any system columns
! 	 * are requested from the rel.  (Irrelevant if scanning a join relation.)
! 	 * This is a bit of a kluge and might go away someday, so we intentionally
! 	 * leave it out of the API presented to FDWs.
  	 */
! 	scan_plan->fsSystemCol = false;
! 	if (scan_relid > 0)
  	{
! 		Bitmapset  *attrs_used = NULL;
! 		ListCell   *lc;
! 		int			i;
  
! 		/*
! 		 * First, examine all the attributes needed for joins or final output.
! 		 * Note: we must look at reltargetlist, not the attr_needed data,
! 		 * because attr_needed isn't computed for inheritance child rels.
! 		 */
! 		pull_varattnos((Node *) rel->reltargetlist, scan_relid, &attrs_used);
  
! 		/* Add all the attributes used by restriction clauses. */
! 		foreach(lc, rel->baserestrictinfo)
  		{
! 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
! 
! 			pull_varattnos((Node *) rinfo->clause, scan_relid, &attrs_used);
  		}
  
! 		/* Now, are any system columns requested from rel? */
! 		for (i = FirstLowInvalidHeapAttributeNumber + 1; i < 0; i++)
! 		{
! 			if (bms_is_member(i - FirstLowInvalidHeapAttributeNumber, attrs_used))
! 			{
! scan_plan->fsSystemCol = true;
! break;
! 			}
! 		}
! 
! 		bms_free(attrs_used);
! 	}
  
  	return scan_plan;
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-09 Thread Shulgin, Oleksandr
On Wed, Sep 9, 2015 at 10:22 AM, Shulgin, Oleksandr <
oleksandr.shul...@zalando.de> wrote:

> On Wed, Sep 9, 2015 at 8:36 AM, Pavel Stehule 
> wrote:
>
>>
>>> Please find attached v4.
>>>
>>
>> It is better
>>
>
> One important thing to notice, and probably deserves a code comment, that
> any modification of the slot fields done by the info sender backend must be
> done before actually sending the message via the shared memory queue (or
> detaching one, if there's nothing to be sent).  Otherwise it is possible
> that is_processed flag will be set on the slot that has been already reused
> by the receiving backend.
>

Sorry, the latest version did contain some debugging messages I didn't
intend to send.  I've removed them and also added a comment about the above
synchronization problem.

--
Alex
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..a2b757c 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -139,6 +139,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, CmdStatusShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -243,6 +244,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
+	CmdStatusShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 0abde43..e637be1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -26,6 +26,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/cmdstatus.h"
 
 
 /*
@@ -296,6 +297,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CMD_STATUS_INFO))
+		HandleCmdStatusInfoInterrupt();
+
 	if (set_latch_on_sigusr1)
 		SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d917af3..1a5b03c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/cmdstatus.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2991,6 +2992,9 @@ ProcessInterrupts(void)
 
 	if (ParallelMessagePending)
 		HandleParallelMessages();
+
+	if (CmdStatusInfoPending)
+		ProcessCmdStatusInfoRequest();
 }
 
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 3ed0b44..2c8687c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -18,7 +18,7 @@ endif
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-	bool.o cash.o char.o date.o datetime.o datum.o dbsize.o domains.o \
+	bool.o cash.o char.o cmdstatus.o date.o datetime.o datum.o dbsize.o domains.o \
 	encode.o enum.o expandeddatum.o \
 	float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
diff --git a/src/backend/utils/adt/cmdstatus.c b/src/backend/utils/adt/cmdstatus.c
new file mode 100644
index 000..40df6e1
--- /dev/null
+++ b/src/backend/utils/adt/cmdstatus.c
@@ -0,0 +1,640 @@
+/*-
+ *
+ * cmdstatus.c
+ *	  Definitions for pg_cmdstatus function.
+ *
+ * Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/cmdstatus.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/htup_details.h"
+#include "commands/explain.h"
+#include "lib/stringinfo.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shm_mq.h"
+#include "storage/shm_toc.h"
+#include "tcop/dest.h"
+#include "tcop/pquery.h"
+#include "utils/builtins.h"
+#include "utils/cmdstatus.h"
+
+
+typedef enum {
+	CMD_STATUS_REQUEST_EXPLAIN = 1,
+	CMD_STATUS_REQUEST_QUERY_TEXT = 2,
+	CMD_STATUS_REQUEST_PROGRESS_TAG = 3,
+	CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE = 4
+} CmdStatusInfoRequestType;
+
+#define CMD_STATUS_MAX_REQUEST CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE
+
+typedef enum {
+	CMD_STATUS_RESULT_FAILURE = -1,
+	CMD_STATUS_RESULT_SUCCESS = 0,
+	CMD_STATUS_RESULT_BACKEND_IDLE,
+	CMD_STATUS_RESULT_NO_COMMAND_TAG
+} CmdStatusInfoResultC

Re: [HACKERS] checkpointer continuous flushing

2015-09-09 Thread Fabien COELHO


Hello Amit,


I think that we may conclude, on these run:

(1) sorting seems not to harm performance, and may help a lot.


I agree with first part, but about helping a lot, I am not sure


I'm focussing on the "sort" dimension alone, that is I'm comparing the 
average tps performance with sorting with the same test without sorting, : 
There are 4 cases from your tests, if I'm not mistaken:


 - T1 flush=off  27480 -> 27482 :+0.0%
 - T1 flush=on   25214 -> 26819 :+6.3%
 - T2 flush=off   5050 ->  6194 :   +22.6%
 - T2 flush=on2771 ->  6110 :  +120.4%

The average improvement induced by sort=on is +50%, if you do not agree on 
"a lot", maybe we can agree on "significantly":-)


based on the tests conducted by me, among all the runs, it has shown 
improvement in average TPS is one case and that too with a dip in number 
of times the TPS is below 10.



(2) Linux flushing with sync_file_range may degrade a little raw tps
average in some case, but definitely improves performance stability
(always 100% availability when on !).


Agreed, I think the benefit is quite clear, but it would be better if we try
to do some more test for the cases (data fits in shared_buffers) where
we saw small regression just to make sure that regression is small.


I've already reported a lot of tests (several hundred of hours on two 
different hosts), and I did not have such a dip, but the hardware was more 
"usual" or "casual", really different from your runs.


If you can run more tests, great!

I think that the main safeguard to handle the (small) uncertainty is to 
keep gucs to control these features.



(3) posix_fadvise on Linux is a bad idea... the good news is that it
is not needed there:-) How good or bad an idea it is on other system
is an open question...


I don't know what is the best way to verify that, if some body else has
access to such a m/c, please help to get that verified.


Yep. There has been such calls on this thread which were not very 
effective, up to now.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-09 Thread Pavel Stehule
Two notices:
>
> 1. The communication mechanism can be used more wide, than only for this
> purpose. We can introduce a SendInfoHook - and it can be used for any
> customer probes - memory, cpu, ...
>

Not sure if for CPU you can get any more insight than an external tool like
top(1) will provide.  For the memory, it might be indeed useful in some way
(memory context inspection?)

CPU is probably nonsense - sorry - but there are more other possibilities,
how to use it - simple checking some internal states of extensions without
log processing or without gdb, etc


>
> So there's certainly a space for generalization.  Rename it as
> pg_backend_info()?
>

It is good name, so why not?

>
> 2. With your support for explain of nested queries we have all what we
>> need for integration auto_explain to core.
>>
>
> Well, I don't quite see how that follows.  What I'm really after is
> bringing instrumentation support to this, so that not only plan could be
> extracted, but also the rows/loops/times accumulated thus far during the
> query run.
>

It is possible, but not necessary step - but it is premature question in
this moment

Regards

Pavel


>
> --
> Alex
>
>


Re: [HACKERS] pgbench progress with timestamp

2015-09-09 Thread Fabien COELHO



In the sgml, second should be plural in 'intead of the number of second
since the'.  And 'intead' should be 'instead'.


Ok.


--progress-timestamp use a Unix-like epoch timestamp for progress
reporting

but that is getting pretty long.


Indeed. I've done:

  --progress-timestamp   use Unix epoch timestamps in ms for progress

Here is a v2.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 2517a3a..dabb1ae 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -429,6 +429,18 @@ pgbench  options  dbname
  
 
  
+  --progress-timestamp
+  
+   
+When showing progress (option -P), use a millisecond
+timestamp (Unix epoch) instead of the number of seconds since the
+beginning of the run.
+This helps compare logs generated by various tools.
+   
+  
+ 
+
+ 
   -r
   --report-latencies
   
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 30e8d2a..f750f62 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -165,6 +165,7 @@ bool		use_quiet;			/* quiet logging onto stderr */
 int			agg_interval;		/* log aggregates instead of individual
  * transactions */
 int			progress = 0;		/* thread progress report every this seconds */
+bool		progress_timestamp = false;
 int			progress_nclients = 0;		/* number of clients for progress
 		 * report */
 int			progress_nthreads = 0;		/* number of threads for progress
@@ -389,6 +390,7 @@ usage(void)
 		   "  -v, --vacuum-all vacuum all four standard tables before tests\n"
 		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
 		   "  --sampling-rate=NUM  fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+		   "  --progress-timestamp use Unix epoch timestamps in ms for progress\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug  print debugging output\n"
 	  "  -h, --host=HOSTNAME  database server host or socket directory\n"
@@ -2774,6 +2776,7 @@ main(int argc, char **argv)
 		{"aggregate-interval", required_argument, NULL, 5},
 		{"rate", required_argument, NULL, 'R'},
 		{"latency-limit", required_argument, NULL, 'L'},
+		{"progress-timestamp", no_argument, NULL, 6},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3110,6 +3113,10 @@ main(int argc, char **argv)
 }
 #endif
 break;
+			case 6:
+progress_timestamp = true;
+benchmarking_option_set = true;
+break;
 			default:
 fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 exit(1);
@@ -3748,6 +3755,7 @@ threadRun(void *arg)
 			sqlat,
 			lag,
 			stdev;
+char		tbuf[64];
 
 /*
  * Add up the statistics of all threads.
@@ -3780,10 +3788,16 @@ threadRun(void *arg)
 stdev = 0.001 * sqrt(sqlat - 100.0 * latency * latency);
 lag = 0.001 * (lags - last_lags) / (count - last_count);
 
+if (progress_timestamp)
+	sprintf(tbuf, "%.03f s",
+			INSTR_TIME_GET_MILLISEC(now_time) / 1000.0);
+else
+	sprintf(tbuf, "%.1f s", total_run);
+
 fprintf(stderr,
-		"progress: %.1f s, %.1f tps, "
-		"lat %.3f ms stddev %.3f",
-		total_run, tps, latency, stdev);
+		"progress: %s, %.1f tps, lat %.3f ms stddev %.3f",
+		tbuf, tps, latency, stdev);
+
 if (throttle_delay)
 {
 	fprintf(stderr, ", lag %.3f ms", lag);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-09 Thread Shulgin, Oleksandr
On Wed, Sep 9, 2015 at 8:36 AM, Pavel Stehule 
wrote:

>
>> Please find attached v4.
>>
>
> It is better
>

One important thing to notice, and probably deserves a code comment, that
any modification of the slot fields done by the info sender backend must be
done before actually sending the message via the shared memory queue (or
detaching one, if there's nothing to be sent).  Otherwise it is possible
that is_processed flag will be set on the slot that has been already reused
by the receiving backend.

I could hit that problem rather easily while doing

select pg_cmdstatus(pid,$1) from pg_stat_activity where
pid<>pg_backend_pid() \watch 1

and running pgbench with >= 8 connections in the background.  The observed
effect is that after a few successful runs, the pg_cmdstatus() never
returns because the next signaled backend loops through the slots and
doesn't find an unprocessed one (because it was marked as processed by the
backend that was responding just before it).

Two notices:
>
> 1. The communication mechanism can be used more wide, than only for this
> purpose. We can introduce a SendInfoHook - and it can be used for any
> customer probes - memory, cpu, ...
>

Not sure if for CPU you can get any more insight than an external tool like
top(1) will provide.  For the memory, it might be indeed useful in some way
(memory context inspection?)

So there's certainly a space for generalization.  Rename it as
pg_backend_info()?

2. With your support for explain of nested queries we have all what we need
> for integration auto_explain to core.
>

Well, I don't quite see how that follows.  What I'm really after is
bringing instrumentation support to this, so that not only plan could be
extracted, but also the rows/loops/times accumulated thus far during the
query run.

--
Alex


Re: [HACKERS] Odd/undocumented precedence of concatenation operator

2015-09-09 Thread Shay Rojansky
>
> It is expected, and documented.  (It's also different in 9.5, see
>
> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c6b3c939b7e0f1d35f4ed4996e71420a993810d2
> )
>

Ah, thanks!


> > If nothing else, it seems that the concatenation operator should be
> listed
> > on the operator precedence table at
> >
> http://www.postgresql.org/docs/9.4/static/sql-syntax-lexical.html#SQL-PRECEDENCE-TABLE
>
> Both >= and || fall into the "any other operator" case, no?
>

I somehow missed that in the table, assuming that >= would be somewhere
with > and =. Thanks again.


Re: [HACKERS] proposal: function parse_ident

2015-09-09 Thread Pavel Stehule
next iteration - fixed bug in parsing UTF8 chars, enhanced error messages.

Regards

Pavel
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index b3b78d2..75ea33a
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 1707,1712 
--- 1707,1727 

 
  
+  parse_ident
+ 
+ parse_ident(qualified_identifier text,
+ OUT parts text[], OUT other text)
+
+record
+Split qualified identifier to array parts.
+
+parse_ident('"SomeSchema".someTable')
+("{SomeSchema,sometable}",)
+   
+ 
+   
+
+ 
   pg_client_encoding
  
  pg_client_encoding()
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
new file mode 100644
index c0495d9..59f1e3e
*** a/src/backend/utils/adt/misc.c
--- b/src/backend/utils/adt/misc.c
***
*** 21,32 
--- 21,35 
  #include 
  
  #include "access/sysattr.h"
+ #include "access/htup_details.h"
  #include "catalog/catalog.h"
+ #include "catalog/namespace.h"
  #include "catalog/pg_tablespace.h"
  #include "catalog/pg_type.h"
  #include "commands/dbcommands.h"
  #include "funcapi.h"
  #include "miscadmin.h"
+ #include "parser/scansup.h"
  #include "parser/keywords.h"
  #include "postmaster/syslogger.h"
  #include "rewrite/rewriteHandler.h"
***
*** 38,43 
--- 41,47 
  #include "utils/ruleutils.h"
  #include "tcop/tcopprot.h"
  #include "utils/acl.h"
+ #include "utils/array.h"
  #include "utils/builtins.h"
  #include "utils/timestamp.h"
  
*** pg_column_is_updatable(PG_FUNCTION_ARGS)
*** 598,600 
--- 602,759 
  
  	PG_RETURN_BOOL((events & REQ_EVENTS) == REQ_EVENTS);
  }
+ 
+ 
+ /*
+  * This simple parser utility are compatible with lexer implementation,
+  * used only in parse_ident function
+  */
+ static bool
+ is_ident_start(unsigned char c)
+ {
+ 	if (c == '_')
+ 		return true;
+ 	if (isalpha(c))
+ 		return true;
+ 
+ 	if (c >= 0200 && c <= 0377)
+ 		return true;
+ 
+ 	return false;
+ }
+ 
+ static bool
+ is_ident_cont(unsigned char c)
+ {
+ 	if (isdigit(c))
+ 		return true;
+ 
+ 	return is_ident_start(c);
+ }
+ 
+ /*
+  * parse_ident - decompose text identifier to parts of
+  * identifiers and the rest (doesn't follow a identifier rule).
+  */
+ Datum
+ parse_ident(PG_FUNCTION_ARGS)
+ {
+ 	text		*qualname = PG_GETARG_TEXT_PP(0);
+ 	char		*qualname_str;
+ 	Datum		values[2];
+ 	bool		nulls[2];
+ 	TupleDesc	tupdesc;
+ 	ArrayBuildState *astate = NULL;
+ 	char	*nextp;
+ 
+ 	qualname_str = text_to_cstring(qualname);
+ 	nextp = qualname_str;
+ 
+ 	/* Prepare result tuple desc */
+ 	tupdesc = CreateTemplateTupleDesc(2, false);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "parts",
+ 	TEXTARRAYOID, -1, 0);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "other",
+ 	TEXTOID, -1, 0);
+ 
+ 	BlessTupleDesc(tupdesc);
+ 
+ 	nulls[0] = false;
+ 	nulls[1] = true;
+ 
+ 	/* skip leading whitespace */
+ 	while (isspace((unsigned char) *nextp))
+ 		nextp++;
+ 
+ 	do
+ 	{
+ 		char		*curname;
+ 		char		*endp;
+ 		bool		missing_ident;
+ 
+ 		missing_ident = true;
+ 
+ 		if (*nextp == '\"')
+ 		{
+ 			curname = nextp + 1;
+ 			for (;;)
+ 			{
+ endp = strchr(nextp + 1, '\"');
+ if (endp == NULL)
+ 	ereport(ERROR,
+ 		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 		 errmsg("unclused double quotes")));
+ if (endp[1] != '\"')
+ 	break;
+ memmove(endp, endp + 1, strlen(endp));
+ nextp = endp;
+ 			}
+ 			nextp = endp + 1;
+ 			*endp = '\0';
+ 
+ 			if (endp - curname == 0)
+ ereport(ERROR,
+ 	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 	 errmsg("identifier should not be empty")));
+ 
+ 			astate = accumArrayResult(astate,
+ CStringGetTextDatum(curname), false,
+ 		TEXTOID, CurrentMemoryContext);
+ 			missing_ident = false;
+ 		}
+ 		else
+ 		{
+ 			if (is_ident_start((unsigned char) *nextp))
+ 			{
+ char *downname;
+ int	len;
+ text	*part;
+ 
+ curname = nextp++;
+ while (is_ident_cont((unsigned char) *nextp))
+ 	nextp++;
+ 
+ len = nextp - curname;
+ 
+ downname = downcase_truncate_identifier(curname, len, false);
+ part = cstring_to_text_with_len(downname, len);
+ astate = accumArrayResult(astate,
+ 	PointerGetDatum(part), false,
+ 			TEXTOID, CurrentMemoryContext);
+ missing_ident = false;
+ 			}
+ 		}
+ 
+ 		if (missing_ident)
+ 			ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+  errmsg("missing identifier after \".\" symbol")));
+ 
+ 		while (isspace((unsigned char) *nextp))
+ 			nextp++;
+ 
+ 		if (*nextp == '.')
+ 		{
+ 			nextp++;
+ 			while (isspace((unsigned char) *nextp))
+ nextp++;
+ 			continue;
+ 		}
+ 		else if (*nextp == '\0')
+ 		{
+ 			break;
+ 		}
+ 		else
+ 		{
+ 			values[1] = CStringGetTextDatum(nextp);
+ 			nulls[1] = false;
+ 			break;
+ 		}
+ 	} while (true);