date:20160707

Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-07 Thread Michael Paquier

On Fri, Jul 8, 2016 at 2:00 PM, Andrew Borodin  wrote:

(Please top-post on threads, this is annoying)

> One more thing came to my mind:
> WAL records done by the GiST before patch will corrupt GiST after
> patch if replayed.
> Is it a problem? May be it's prevented on some higher level?

If a feature changes the shape of WAL records, XLOG_PAGE_MAGIC is
bumped to prevent any problems.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

2016-07-07 Thread Andrew Borodin

One more thing came to my mind:
WAL records done by the GiST before patch will corrupt GiST after
patch if replayed.
Is it a problem? May be it's prevented on some higher level?

Best regards, Andrey Borodin, Octonica & Ural Federal University.

2016-07-06 22:11 GMT+05:00 Andrew Borodin :
> Here is a new patch. Updates:
> 1. Uses MAXALIGN to allocate space on page
> 2. Uses ItemIdSetNormal in case when ItemId was not normal for some
> reasons before call
> 3. Improved comments and var names
>
> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>
> 2016-07-05 17:54 GMT+05:00 Andrew Borodin :
>> Here is a patch with updated WAL behavior.
>>
>> I'm not certain about MAXALIGN for size of appended tuple. Currently
>> if anyone calls PageIndexTupleOverwrite() with unalligned tuple it
>> will ereport(ERROR).
>> I suspect that it should allign size instead.
>>
>> Also I suspect that PageIndexTupleOverwrite() should make a call to
>> ItemIdSetNormal() to pretend it is generic function and not just a
>> private part of GiST.
>>
>> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>>
>> 2016-07-04 22:59 GMT+05:00 Andrew Borodin :
>>> Thank you, Alvaro.
>>>
>>> Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
>>> accordingly with patched gistplacetopage().
>>> Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
>>> tuple size. So, PageIndexTupleOverwrite should behave the same.
>>>
>>> About PageIndexDeleteNoCompact() in BRIN. I do not see why
>>> PageIndexDeleteNoCompact() is not a good solution instead of
>>> PageIndexTupleOverwrite? Will it mark tuple as unused until next
>>> PageAddItem will use it's place?
>>>
>>> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>>>
>>> 2016-07-04 22:16 GMT+05:00 Alvaro Herrera :
 Andrew Borodin wrote:
> Thank you, Amit.
>
> Currently, if WAL reconstruct page in another order it won't be a problem.

 We require that replay of WAL leads to identical bytes than the
 original.  Among other things, this enables use of a tool that verifies
 that WAL is working correctly simply by comparing page images.  So
 please fix it instead of just verifying that this works for GiST.

 By the way, BRIN indexes have a need of this operation too.  The current
 approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
 I suppose it would be beneficial to use your new routine there too.

 --
 Álvaro Herrerahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] \timing interval

2016-07-07 Thread Pavel Stehule

2016-07-08 0:13 GMT+02:00 Peter Geoghegan :

> On Thu, Jul 7, 2016 at 2:52 PM, Corey Huinker 
> wrote:
> > Wouldn't it be great if we had a way of printing timing in more human
> > friendly formats?
>
> Yes, it would. I've thought about doing this myself. So, +1 to the idea
> from me.
>

+1

Pavel


>
>
> --
> Peter Geoghegan
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] can we optimize STACK_DEPTH_SLOP

2016-07-07 Thread Tom Lane

I wrote:
> Based on these numbers, I'd have no fear of reducing STACK_DEPTH_SLOP
> to 256KB on x86_64.  It would sure be good to check things on some
> other architectures, though ...

I went to the work of doing the same test on a PPC Mac:

 182 Stack   [   8192K/ 40K]
  25 Stack   [   8192K/ 48K]
   2 Stack   [   8192K/ 56K]
  11 Stack   [   8192K/ 60K]
   5 Stack   [   8192K/ 64K]
   2 Stack   [   8192K/108K]
   1 Stack   [   8192K/576K]
   1 Stack   [   8192K/   2056K]

The last number here is "resident pages", not "dirty pages", because
this older version of OS X doesn't provide the latter.  Still, the
numbers seem to track pretty well with the ones I got on x86_64.
Which is a bit odd when you think about it: a 32-bit machine ought
to consume less stack space because pointers are narrower.

Also on my old HPPA dinosaur:

  40  addr 0x7b03a000, length 8, physical pages 8, type STACK
 166  addr 0x7b03a000, length 10, physical pages 9, type STACK
  26  addr 0x7b03a000, length 12, physical pages 11, type STACK
  16  addr 0x7b03a000, length 14, physical pages 13, type STACK
   1  addr 0x7b03a000, length 15, physical pages 13, type STACK
   1  addr 0x7b03a000, length 16, physical pages 15, type STACK
   2  addr 0x7b03a000, length 28, physical pages 27, type STACK
   1  addr 0x7b03a000, length 190, physical pages 190, type STACK
   1  addr 0x7b03a000, length 514, physical pages 514, type STACK

As best I can tell, "length" is the nominal virtual space for the stack,
and "physical pages" is the actually allocated/resident space, both
measured in 4K pages.  So that again matches pretty well, although the
stack-efficiency of the recursive regex functions seems to get worse with
each new case I look at.

However ... the thread here
https://www.postgresql.org/message-id/flat/21563.1289064886%40sss.pgh.pa.us
says that depending on your choice of compiler and optimization level,
IA64 can be 4x to 5x worse for stack space than x86_64, even after
spotting it double the memory allocation to handle its two separate
stacks.  I don't currently have access to an IA64 machine to check.

Based on what I'm seeing so far, really 100K ought to be more than plenty
of slop for most architectures, but I'm afraid to go there for IA64.

Also, there might be some more places like tzload() that are putting
unreasonably large variables on the stack, but that the regression tests
don't exercise (I've not tested anything replication-related, for
example).

Bottom line: I propose that we keep STACK_DEPTH_SLOP at 512K for IA64
but reduce it to 256K for everything else.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Alvaro Herrera

Tom Lane wrote:
> Simon Riggs  writes:

> > pg_am has existed for decades without supporting DDL
> 
> That argument has been obsoleted by events ;-) ... and in any case, the
> reason we went without CREATE ACCESS METHOD for so long was not that we
> encouraged "INSERT INTO pg_am" but that non-core index AMs were
> effectively unsupported anyway, until we thought of a reasonable way to
> let them generate WAL.  Without the WAL stumbling block, I'm sure we would
> have built CREATE ACCESS METHOD long ago.

Note that the alternative to DDL-based replication handling is not
INSERT INTO pg_replication, but a function-based interface such as
SELECT pg_replication_node_create(foo, bar); so there's no need to
hardcode catalog definitions; nor there is a need to skip backup-ability
of logical replication config: pg_dump support can be added by having it
output function calls -- not catalog INSERTs!

The only difference between DDL and no DDL is that a function-based
interface can be added with a few pg_proc.h entries, whereas the DDL
stuff requires gram.y additions, new nodes, etc.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Tom Lane

Simon Riggs  writes:
> On 8 July 2016 at 02:41, Robert Haas  wrote:
>> DDL is our standard way of getting things into the system catalogs.
>> We have no system catalog metadata that is intended to be populated by
>> any means other than DDL.

> pg_am has existed for decades without supporting DDL

That argument has been obsoleted by events ;-) ... and in any case, the
reason we went without CREATE ACCESS METHOD for so long was not that we
encouraged "INSERT INTO pg_am" but that non-core index AMs were
effectively unsupported anyway, until we thought of a reasonable way to
let them generate WAL.  Without the WAL stumbling block, I'm sure we would
have built CREATE ACCESS METHOD long ago.  It's just way too hard to deal
with issues like cross-version changes otherwise.

> and we have gone to
> great lengths over many years to allow catalog tables to be
> inserted/updated/deleted by normal SQL rather than DDL, so not all catalog
> access is via DDL.

But *all* of that is on "if you break it you get to keep both pieces"
terms.  In particular we do not promise the catalogs will remain stable
across versions, so that inserts or updates on catalogs are very likely
to break in new versions.  I think that all such operations should be
understood as emergency procedures, not recommended standard practice.

> One of my examples was full text search and it does have
> DDL, but that was an anti-example; all the feedback I have is that it was
> much easier to use before it had DDL and that forcing it to use DDL pretty
> much killed it for most users.

That's just unsupported FUD.  I would say that most of the problems we've
had with text search DDL came from the fact that previously people had
done things in other ways and transitioning was hard.  That experience
doesn't make me want to repeat it; but building a feature that's supposed
to be managed by direct catalog updates is precisely repeating that
mistake.

I'm okay with things like replication configuration being managed outside
the system catalogs entirely (as indeed they are now).  But if a feature
has a catalog, it should have DDL to manipulate the catalog.  Direct SQL
on a catalog should *never* become standard operating procedure.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Simon Riggs

On 8 July 2016 at 02:41, Robert Haas  wrote:

> > Personally, I'm in the group of people that don't see the need for DDL.
> > There are already many successful features that don't utilize DDL, such
> as
> > backup, advisory locks and some features that use DDL that don't really
> need
> > to such as LISTEN/NOTIFY, full text search etc.. Also note that both
> Oracle
> > and SQLServer have moved away from DDL in favour of function APIs, most
> > NoSQL databases and almost all languages prefer functional interfaces
> over
> > parsed text languages, so I don't see a huge industry revival for DDL as
> > means of specifying things.
>
> DDL is our standard way of getting things into the system catalogs.
> We have no system catalog metadata that is intended to be populated by
> any means other than DDL.  If you want to add a column to a table, you
> say ALTER TABLE .. ADD COLUMN.  If you want to add a column to an
> extension, you say ALTER EXTENSION .. ADD TABLE.   If you want to add
> an option to a foreign table, you say ALTER FOREIGN TABLE .. OPTIONS
> (ADD ..).  Therefore, I think it is entirely reasonable and obviously
> consistent with existing practice that if you want to add a table to a
> replication set, you should write ALTER REPLICATION SET .. ADD TABLE.
> I don't understand why logical replication should be the one feature
> that departs from the way that all of our other features work.  Sure,
> we have other features that do not involve DDL, but (1) one of your
> examples is full text search, which of course does have DDL, and was
> moved from an interface that did not involve DDL to one that did
> because the latter is better and (2) your other examples don't involve
> defining catalog contents, which makes them apples-to-oranges
> comparisons.
>

pg_am has existed for decades without supporting DDL and we have gone to
great lengths over many years to allow catalog tables to be
inserted/updated/deleted by normal SQL rather than DDL, so not all catalog
access is via DDL. One of my examples was full text search and it does have
DDL, but that was an anti-example; all the feedback I have is that it was
much easier to use before it had DDL and that forcing it to use DDL pretty
much killed it for most users.

Anyway, backups and replication slots don't use DDL because they need to
work on standbys. So if you are arguing in favour of forcing logical
replication to never work on standbys, I'm interested in why that
restriction is useful and sensible, especially since we already agreed that
a failover mechanism for use of logical replication on standbys was
desirable. It seems likely that we're discussing this at too high a level
and that we each see things the other does not.

> The burden of proof isn't on me to demonstrate why this feature "needs
> DDL"; it's on you to explain why replication-related operations that
> establish persistent database state don't need to behave just like all
> other commands.  Really, where this jumped the shark for me is when
> you argued that this stuff didn't even need pg_dump support.  Come on.
> This feature doesn't get a pass from handling all of the things that
> every existing similar feature needs to deal with.

As I already said, I accept that there needs to be some way to backup
replication config; the meeting continued after you left.

I note also that replication slots aren't backed up by pg_dump; I see
analogy here and think that at least some parts of logical replication will
be similar and not require DDL at all, just as slots do not.

pg_dump support doesn't require DDL, in any case, nor is it certain yet
that pg_dump is the right utility for backup.

The main point I see is that the user interface mechanisms have very little
to do with DDL or not. Having a command called ALTER REPLICATION SLOT or a
function called pg_alter_replication_slot() makes little real difference to
a user.

We have much to discuss in terms of security, the way it should work and
what options to support and a sidetrack into syntax isn't warranted at this
early stage. Please lets discuss those important things first, then return
to whether DDL makes sense or not; it may do, or may not, or more likely
which parts of it need DDL and which do not.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Simon Riggs

On 8 July 2016 at 02:41, Robert Haas  wrote:

> On Thu, Jul 7, 2016 at 7:15 PM, Simon Riggs  wrote:
> > Yes, I ran the unconference session. It was a shame you weren't able to
> stay
> > for the whole discussion.
>
> I thought I sat through, at least, most of it, but you barely gave
> anyone else a chance to talk, which kind of misses the point of an
> unconference.  The portion which I attended was not about how to move
> the development of the feature forward, but just involved describing
> it.  I thought it was a shame that the time wasn't used better.


I think the problem was that I gave everybody an even shot at commenting,
rather than focusing on a few key developers.

There were twenty people actively involved in that discussion.


> > We all agreed that an in-core solution was desirable, if only for wider
> > adoption.
>
> Yep.
>
> > About half the people wanted DDL and about half the people didn't. When
> we
> > discussed why we wanted DDL there wasn't any answers apart from the
> thought
> > that we want to be able to backup the replication configurations, which
> > seemed to be possible with or without DDL. Any such backup would need to
> be
> > easily removed from the objects themselves, to avoid external
> dependencies
> > on making recovery work.
>
> I really don't think that's accurate.  There might have been 50% of
> people who thought that not having DDL was acceptable, but I think
> there were very few people who found it preferable.


Without being in the room, its kinda hard for you to know, right?


> > Chris Browne finally summed it up by saying we could wait on having DDL
> > until some time later, once we've decided on things like how we configure
> > it, how we secure it and what/how to store it in the catalog. "We could
> > probably live without DDL in the first version."
>
> Right.  In other words, DDL would be desirable, but he'd be willing to
> live without it if that somehow made things easier.  But it really
> doesn't.  Adding new DDL commands is not particularly difficult.
>
> > Personally, I'm in the group of people that don't see the need for DDL.
>


> The burden of proof isn't on me to demonstrate why this feature "needs
> DDL"; it's on you to explain why replication-related operations that
> establish persistent database state don't need to behave just like all
> other commands.  Really, where this jumped the shark for me is when
> you argued that this stuff didn't even need pg_dump support.  Come on.
> This feature doesn't get a pass from handling all of the things that
> every existing similar feature needs to deal with.


I don't agree, not least because I wasn't the only one saying it.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] pg_xlogfile_name_offset() et al and recovery

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 5:26 AM, Simon Riggs  wrote:
> The behaviour of that function is defined in backbranches, so I suggest we
> should not alter that now.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 7:15 PM, Simon Riggs  wrote:
> Yes, I ran the unconference session. It was a shame you weren't able to stay
> for the whole discussion.

I thought I sat through, at least, most of it, but you barely gave
anyone else a chance to talk, which kind of misses the point of an
unconference.  The portion which I attended was not about how to move
the development of the feature forward, but just involved describing
it.  I thought it was a shame that the time wasn't used better.

> We all agreed that an in-core solution was desirable, if only for wider
> adoption.

Yep.

> About half the people wanted DDL and about half the people didn't. When we
> discussed why we wanted DDL there wasn't any answers apart from the thought
> that we want to be able to backup the replication configurations, which
> seemed to be possible with or without DDL. Any such backup would need to be
> easily removed from the objects themselves, to avoid external dependencies
> on making recovery work.

I really don't think that's accurate.  There might have been 50% of
people who thought that not having DDL was acceptable, but I think
there were very few people who found it preferable.

> Chris Browne finally summed it up by saying we could wait on having DDL
> until some time later, once we've decided on things like how we configure
> it, how we secure it and what/how to store it in the catalog. "We could
> probably live without DDL in the first version."

Right.  In other words, DDL would be desirable, but he'd be willing to
live without it if that somehow made things easier.  But it really
doesn't.  Adding new DDL commands is not particularly difficult.

> Personally, I'm in the group of people that don't see the need for DDL.
> There are already many successful features that don't utilize DDL, such as
> backup, advisory locks and some features that use DDL that don't really need
> to such as LISTEN/NOTIFY, full text search etc.. Also note that both Oracle
> and SQLServer have moved away from DDL in favour of function APIs, most
> NoSQL databases and almost all languages prefer functional interfaces over
> parsed text languages, so I don't see a huge industry revival for DDL as
> means of specifying things.

DDL is our standard way of getting things into the system catalogs.
We have no system catalog metadata that is intended to be populated by
any means other than DDL.  If you want to add a column to a table, you
say ALTER TABLE .. ADD COLUMN.  If you want to add a column to an
extension, you say ALTER EXTENSION .. ADD TABLE.   If you want to add
an option to a foreign table, you say ALTER FOREIGN TABLE .. OPTIONS
(ADD ..).  Therefore, I think it is entirely reasonable and obviously
consistent with existing practice that if you want to add a table to a
replication set, you should write ALTER REPLICATION SET .. ADD TABLE.
I don't understand why logical replication should be the one feature
that departs from the way that all of our other features work.  Sure,
we have other features that do not involve DDL, but (1) one of your
examples is full text search, which of course does have DDL, and was
moved from an interface that did not involve DDL to one that did
because the latter is better and (2) your other examples don't involve
defining catalog contents, which makes them apples-to-oranges
comparisons.

The burden of proof isn't on me to demonstrate why this feature "needs
DDL"; it's on you to explain why replication-related operations that
establish persistent database state don't need to behave just like all
other commands.  Really, where this jumped the shark for me is when
you argued that this stuff didn't even need pg_dump support.  Come on.
This feature doesn't get a pass from handling all of the things that
every existing similar feature needs to deal with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Simon Riggs

On 8 July 2016 at 01:47, Joshua D. Drake  wrote:


> * Long running transaction
>

And of course you can't run any transactions at all during pg_upgrade, not
just long running ones.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] pg_xlogfile_name_offset() et al and recovery

2016-07-07 Thread Simon Riggs

On 8 July 2016 at 00:43, Michael Paquier  wrote:

> On Thu, Jul 7, 2016 at 6:26 PM, Simon Riggs  wrote:
> > The behaviour of that function is defined in backbranches, so I suggest
> we
> > should not alter that now.
>
> Hm..
>
> > What we can do is have another function that makes it clearer that the
> > answer is variable.
> >pg_xlogfile_name_offset(offset, timelineid)
> > always succeeds on standby but needs to be told what timelineid to use
>
> This has clearly value, I have wanted to override the timeline number
> a couple of times now. I have always done so with a custom function
> and saw many people doing it at application level, but it is a weird
> design to have the 1-argument version fail for a node in recovery, and
> the 2-argument version succeed. I'd rather have everything consistent
> to be honest, with the same function name, and the behavior clearly
> documented.


But its been like that for some time; by agreement we don't try to change
the past, just improve the future. I wish it had not been done that way,
but it was. Some things are backpatchable, others not. If you get others to
agree, I'd be fine with it, I'm just trying to explain what I understand to
be the limits of what can be changed.

What I do agree with is the need to fix something in 9.6 here. Releasing a
new feature with an obvious flaw helps nobody, so we must have a function
of some kind that allows you to find the filename for the start and stop
points of backup from a standby.


> > then we have another function
> >   pg_last_xact_replay_timeline() that allows you to find out the current
> > timeline
>
> It would be good to have an equivalent pg_current_xlog_timeline as
> well that can run on nodes not in recovery. I agree that having a
> separate function makes more sense as its equivalents for WAL
> positions.


Agreed. I prefer your name for that function.


> > The actual problem here is somewhat more convoluted, because we would
> like
> > to know the timeline of the basebackup at start and stop. That
> information
> > is not easily available from the backup label data returned by
> > pg_stop_backup().
> > We would like to do something like this...
> >
> > select pg_xlogfile_name_offset(l.lsn, l.tli) from pg_stop_backup(false)
> l;
> >
> > So I suggest we add another column to the output of pg_stop_backup() to
> > return the tli value directly.
>
> This would be useful as well, that's less custom-parsing logic for
> applications based on the segment name in backup_label.
>

Cool


> Note that I am not personally planning to write a patch for all that,
> but any people are welcome to pick up this task!
>

I'm happy to do that.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Simon Riggs

On 8 July 2016 at 01:47, Joshua D. Drake  wrote:

> It is true that something like pg_logical doesn't suffer from those three
> things but it does suffer from others:
>
> * No DDL - Agreed, not "required" but certainly a very nice
> feature.
>
> * Lack of simplicity
>
> Users, like simple. It is one of the key reasons there is a migration to
> the cloud, simplicity. Everything from scaling, to pricing, to provisioning
> etc...
>

Well, you can't run DDL during pg_upgrade either. I've never seen a
solution that supported that, and if it did, it would certainly violate the
"simple" rule you advocate.

Simplicity is key, I agree. But that's just a user interface feature, not a
comment on what's underneath the covers. pg_upgrade is not simple and is
never likely to be so, under the covers.

Anyway, I'm cool if you don't want to use it, for while or never. Options
are good.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Joshua D. Drake


On 07/07/2016 05:14 PM, Simon Riggs wrote:


I would much rather see more brain power put into pg_upgrade or in
place upgrades than logical replication (as a upgrade solution).


Why is that?


First, let me state that I don't have a problem with logical replication 
as an upgrade solution. I have used one form or another many times. I 
have also used pg_upgrade and will use pg_upgrade every single time I 
can over replication (even pg_logical which is reasonably simple) if I 
can. *KISS* is the mantra.


I certainly think logical replication has an absolute place (especially 
if upgrading from something like 9.2 -> 9.5). I just don't think it is 
as useful (generally) as a solid pg_upgrade or in-place upgrade solution.


We have had logical replication as a solution for over a decade. First 
there was slony then londiste and then others. They all suffered from 
various issues and limitations.


* Horrible overhead
* Long running transaction
* Need for lots of extra space

It is true that something like pg_logical doesn't suffer from those 
three things but it does suffer from others:


* No DDL - Agreed, not "required" but certainly a very nice feature.

* Lack of simplicity

Users, like simple. It is one of the key reasons there is a migration to 
the cloud, simplicity. Everything from scaling, to pricing, to 
provisioning etc...


If I take a step back and say to myself, "What would *really* rock in 
terms of PostgreSQL upgrades?" The answer is pretty simple:


apt-get update; apt-get upgrade;
service postgresql upgrade;

Which would pass a flag to "insert technology here" that started 
PostgreSQL in a mode that told it, "Hey, you are going to need to check 
a few things and probably modify a few things before you enter "ready 
for transactions"".


I am fully aware that what I am saying is not easy. There are a whole 
ton of issues (what if we are replicating to a slave?).


Anyway, that's why. I am by far more a consultant than an engineer now 
and I can only relay what I run into when I speak either at conferences 
or clients.


Sincerely,

JD








--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Simon Riggs

On 8 July 2016 at 00:48, Joshua D. Drake  wrote:

> On 07/07/2016 01:10 PM, Robert Haas wrote:
>
>> On Mon, May 16, 2016 at 8:52 PM, David Fetter  wrote:
>>
>>> In light of the above, it is perfectly reasonable to require, at least
>>> temporarily, setting up duplicate storage, or another node.
>>>
>>
> pg_upgrade does that, kinda.  I'd like to have something better, but
>> in the absence of that, I think it's quite wrong to think about
>> deprecating it, even if we had logical replication fully integrated
>> into core today.  Which we by no means do.
>>
>
> I would much rather see more brain power put into pg_upgrade or in place
> upgrades than logical replication (as a upgrade solution).


Why is that?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Issue with bgworker, SPI and pgstat_report_stat

2016-07-07 Thread Michael Paquier

On Fri, Jul 8, 2016 at 3:06 AM, Andres Freund  wrote:
> On 2016-07-07 14:04:36 -0400, Robert Haas wrote:
>> On Thu, Jul 7, 2016 at 1:52 PM, Julien Rouhaud
>>  wrote:
>> > Should a bgworker modifing data have to call pgstat_report_stat() to
>> > avoid this problem? I don't find any documentation suggesting it, and it
>> > seems that worker_spi (used as a template for this bgworker and I
>> > suppose a lot of other one) is also affected.
>>
>> That certainly seems like the simplest fix.  Not sure if there's a better 
>> one.
>
> I think a better fix would be to unify the startup & error handling
> code. We have way to many slightly diverging copies. But that's a bigger
> task, so I'm not protesting against making a more localized fix.

It seems to me that the only fix is to have the bgworker call
pgstat_report_stat() and not mess up with the in-core backend code.
Personally, I always had the image of a bgworker to be an independent
process, so when it wants to do something, be it mimicking normal
backends, it should do it by itself. Take the example of SIGHUP
processing. If the bgworker does not ProcessConfigFile() no parameters
updates will happen in the context of the bgworker.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Joshua D. Drake


On 07/07/2016 01:10 PM, Robert Haas wrote:

On Mon, May 16, 2016 at 8:52 PM, David Fetter  wrote:

In light of the above, it is perfectly reasonable to require, at least
temporarily, setting up duplicate storage, or another node.



pg_upgrade does that, kinda.  I'd like to have something better, but
in the absence of that, I think it's quite wrong to think about
deprecating it, even if we had logical replication fully integrated
into core today.  Which we by no means do.


I would much rather see more brain power put into pg_upgrade or in place 
upgrades than logical replication (as a upgrade solution).


JD






--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlogfile_name_offset() et al and recovery

2016-07-07 Thread Michael Paquier

On Thu, Jul 7, 2016 at 6:26 PM, Simon Riggs  wrote:
> The behaviour of that function is defined in backbranches, so I suggest we
> should not alter that now.

Hm..

> What we can do is have another function that makes it clearer that the
> answer is variable.
>pg_xlogfile_name_offset(offset, timelineid)
> always succeeds on standby but needs to be told what timelineid to use

This has clearly value, I have wanted to override the timeline number
a couple of times now. I have always done so with a custom function
and saw many people doing it at application level, but it is a weird
design to have the 1-argument version fail for a node in recovery, and
the 2-argument version succeed. I'd rather have everything consistent
to be honest, with the same function name, and the behavior clearly
documented.

> then we have another function
>   pg_last_xact_replay_timeline() that allows you to find out the current
> timeline

It would be good to have an equivalent pg_current_xlog_timeline as
well that can run on nodes not in recovery. I agree that having a
separate function makes more sense as its equivalents for WAL
positions.

> The actual problem here is somewhat more convoluted, because we would like
> to know the timeline of the basebackup at start and stop. That information
> is not easily available from the backup label data returned by
> pg_stop_backup().
> We would like to do something like this...
>
> select pg_xlogfile_name_offset(l.lsn, l.tli) from pg_stop_backup(false) l;
>
> So I suggest we add another column to the output of pg_stop_backup() to
> return the tli value directly.

This would be useful as well, that's less custom-parsing logic for
applications based on the segment name in backup_label.

Note that I am not personally planning to write a patch for all that,
but any people are welcome to pick up this task!
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Header and comments describing routines in incorrect shape in visibilitymap.c

2016-07-07 Thread Michael Paquier

On Fri, Jul 8, 2016 at 1:18 AM, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> Regarding the first hunk, I don't like these INTERFACE sections too
>> much; they get seriously outdated over the time and aren't all that
>> helpful anyway.  See the one on heapam.c for example.  I'd rather get
>> rid of that one instead of patching it.  The rest, of course, merit
>> revision.
>
> +1, as long as we make sure that any useful info therein gets migrated
> to the per-function header comments rather than dropped.  If there's
> anything that doesn't seem to fit naturally in any per-function comment,
> maybe move it into the file header comment?

OK, that removes comment duplication. Also, what about replacing
"bit(s)" by "one or more bits" in the comment terms where adapted?
That's bikeshedding, but that's what this patch is about.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Simon Riggs

On 7 July 2016 at 21:10, Robert Haas  wrote:

> pg_upgrade does that, kinda.  I'd like to have something better, but
> in the absence of that, I think it's quite wrong to think about
> deprecating it, even if we had logical replication fully integrated
> into core today.  Which we by no means do.
>

I don't see any problem with extending pg_upgrade to use logical
replication features under the covers.

It seems very smooth to be able to just say

   pg_upgrade --online

and then specify whatever other parameters that requires.

It would be much easier to separate out that as a use-case so we can be
sure we get that in 10.0, even if nothing else lands.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Simon Riggs

On 7 July 2016 at 21:01, Robert Haas  wrote:

> On Mon, May 16, 2016 at 9:22 PM, Alvaro Herrera
>  wrote:
> > David Fetter wrote:
> >> As a relatively (to our users) minor course correction, I would like
> >> to propose the following:
> >
> >> - Develop a logical upgrade path as a part of the (Yay! Sexy!) logical
> >>   replication that's already in large part built.
> >>
> >> This path would, of course, run either locally or across a network,
> >> and be testable in both cases.
> >
> > This is one use case that pglogical intends to fulfill.  If you're able
> > to contribute to that project, I'm sure many would appreciate it.  Right
> > now the hottest question seems to be: is this something that should be
> > an extension, or should it be part of core with its own set of DDL etc?
> > The current patch is geared towards the former, so if the community at
> > large prefers to have it as the latter and would oppose the former, now
> > is the time to speak up so that the course can be corrected.
>
> There was an unconference session on this topic at PGCon and quite a
> number of people there stated that they found DDL to be an ease-of-use
> feature and wanted to have it.
>

Yes, I ran the unconference session. It was a shame you weren't able to
stay for the whole discussion.

We all agreed that an in-core solution was desirable, if only for wider
adoption.

About half the people wanted DDL and about half the people didn't. When we
discussed why we wanted DDL there wasn't any answers apart from the thought
that we want to be able to backup the replication configurations, which
seemed to be possible with or without DDL. Any such backup would need to be
easily removed from the objects themselves, to avoid external dependencies
on making recovery work.

Chris Browne finally summed it up by saying we could wait on having DDL
until some time later, once we've decided on things like how we configure
it, how we secure it and what/how to store it in the catalog. "We could
probably live without DDL in the first version."

Personally, I'm in the group of people that don't see the need for DDL.
There are already many successful features that don't utilize DDL, such as
backup, advisory locks and some features that use DDL that don't really
need to such as LISTEN/NOTIFY, full text search etc.. Also note that both
Oracle and SQLServer have moved away from DDL in favour of function APIs,
most NoSQL databases and almost all languages prefer functional interfaces
over parsed text languages, so I don't see a huge industry revival for DDL
as means of specifying things.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] SELECT DISTINCT never uses an index?

2016-07-07 Thread Tom Lane

Robert Haas  writes:
> The alternative worth considering is presumably something like:

> GroupAggregate
> -> Index Only Scan on grue_size

> Scanning an entire index in order is pretty expensive, but it seems
> possible that this could be faster than the Seq Scan, especially on a
> table with other wide columns, because then the index might be a lot
> smaller than the table.  Even if the index traversal generates some
> random I/O, if it's sufficiently smaller than the table you will still
> come out ahead.  I'm not positive that the planner will actually
> consider this plan,

Of course it does.  Simple example in the regression database:

regression=# explain select distinct unique1 from tenk1;
 QUERY PLAN 



 Unique  (cost=0.29..295.29 rows=1 width=4)
   ->  Index Only Scan using tenk1_unique1 on tenk1  (cost=0.29..270.29 rows=100
00 width=4)
(2 rows)

I think though that this depends on being an IOS, with a fairly wide and
all-all-visible table, in order for the cost estimate to come out cheaper
than a seqscan.  If you disable IOS then the planner's second choice is
a seqscan:

regression=# set enable_indexonlyscan to 0;
SET
regression=# explain select distinct unique1 from tenk1;
   QUERY PLAN
-
 HashAggregate  (cost=483.00..583.00 rows=1 width=4)
   Group Key: unique1
   ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=4)
(3 rows)

A whole-table plain indexscan, or IOS with any significant number of heap
probes needed, is not going to be preferred over a seqscan because of the
amount of random I/O it implies.

> We're probably missing a few tricks on queries of this type. If the
> index-traversal machinery had a mechanism to skip quickly to the next
> distinct value, that could be used here:

Yeah, I suspect Bill was imagining that that sort of plan could be
used; but it requires execution machinery we have not got.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MVCC overheads

2016-07-07 Thread Simon Riggs

On 7 July 2016 at 20:50, Pete Stevenson  wrote:

> Hi Simon -
>
> Thanks for the note. I think it's fair to say that I didn't provide enough
> context, so let me try and elaborate on my question.
>
> I agree, MVCC is a benefit. The research angle here is about enabling MVCC
> with hardware offload. Since I didn't explicitly mention it, the offload I
> refer to will respect all consistency guarantees also.
>
> It is the case that for the database to implement MVCC it must provide
> consistent read to multiple different versions of data, i.e. depending on
> the version used at transaction start. I'm not an expert on postgresql
> internals, but this must have some cost. I think the cost related to MVCC
> guarantees can roughly be categorized as: creating new versions (linking
> them in), version checking on read, garbage collecting old versions, and
> then there is an additional cost that I am interested in (again not
> claiming it is unnecessary in any sense) but there is a cost to generating
> the log.
>
> Thanks, by the way, for the warning about lab vs. reality. That's why I'm
> asking this question here. I want to keep the hypothetical tagged as such,
> but find defensible and realistic metrics where those exist, i.e. in this
> instance, we do have a database that can use MVCC. It should be possible to
> figure out how much work goes into maintaining that property.
>

PostgreSQL uses a no overwrite storage mechanism, so any additional row
versions are maintained in the same table alongside other rows. The MVCC
actions are mostly mixed in with other aspects of the storage, so not
isolated much for offload.

Oracle has a different mechanism that does isolate changed row versions
into a separate data structure, so would be much more amenable to offload
than PostgreSQL, in its current form.

Maybe look at SLRUs (clog etc) as a place to offload something?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] \timing interval

2016-07-07 Thread Peter Geoghegan

On Thu, Jul 7, 2016 at 2:52 PM, Corey Huinker  wrote:
> Wouldn't it be great if we had a way of printing timing in more human
> friendly formats?

Yes, it would. I've thought about doing this myself. So, +1 to the idea from me.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SELECT DISTINCT never uses an index?

2016-07-07 Thread David Rowley

On 8 July 2016 at 09:49, Robert Haas  wrote:
> We're probably missing a few tricks on queries of this type. If the
> index-traversal machinery had a mechanism to skip quickly to the next
> distinct value, that could be used here: walk up the btree until you
> find a page that contains keyspace not equal to the current key, then
> walk back down until you find the first leaf page that contains such a
> value.  That would potentially let you step over large chunks of the
> index without actually examining all the leaf pages, which for a query
> like this seems like it could be a big win.

Thomas Munro did take some initial steps to implementing this a few years ago:
https://www.postgresql.org/message-id/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] delta relations in AFTER triggers

2016-07-07 Thread Robert Haas

On Fri, May 13, 2016 at 2:37 PM, Kevin Grittner  wrote:
> On Fri, May 13, 2016 at 1:02 PM, David Fetter  wrote:
>> On Thu, Jan 22, 2015 at 02:41:42PM +, Kevin Grittner wrote:
>
>>> [ideas on how to pass around references to ephemeral relations]
>>
>> [almost 17 months later]
>>
>> It seems like now is getting close to the time to get this into
>> master.  The patch might have suffered from some bit rot, but the
>> design so far seems sound.
>>
>> What say?
>
> I had a talk with Tom in Brussels about this.  As I understood it,
> he wasn't too keen on the suggestion by Heikki (vaguely
> sorta-endorsed by Robert) of passing ephemeral named relations such
> as these tuplestores around in the structures currently used for
> parameter values.  He intuitively foresaw the types of problems I
> had run into trying to use the same structure to pass a relation
> (with structure and rows and columns of data) as is used to pass,
> say, an integer.

See, the thing that disappoints me about this is that I think we were
pretty closed to having the ParamListInfo-based approach working.  The
thing I liked about that approach is that we already know that any
place where you can provide parameters for a query, there will also be
an opportunity to provide a ParamListInfo.  And I think that
parameterizing a query by an ephemeral table is conceptually similar
to parameterizing it by a scalar value.  If we invent a new mechanism,
it's got to reach all of those same places in the code.

One other comment that I would make here is that I think that it's
important, however we pass the data, that the scope of the tuplestores
is firmly lexical and not dynamic.  We need to make sure that we don't
set some sort of context via global variables that might get used for
some query other than the one to which it's intended to apply.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SELECT DISTINCT never uses an index?

2016-07-07 Thread Thomas Munro

On Fri, Jul 8, 2016 at 9:49 AM, Robert Haas  wrote:
> On Thu, Jul 7, 2016 at 4:56 PM, Bill Moran  wrote:
>> SELECT DISTINCT size FROM grue;
>>
>> Always does a seq scan on Postgres 9.5.2. (Yes, I know we're
>> a patch behind, the upgrade is on the schedule) on
>> Ubuntu 14.
>>
>> I would expect it to be possible, and significantly more
>> efficient to do an index scan for that query.
>
> [...]
>
> We're probably missing a few tricks on queries of this type. If the
> index-traversal machinery had a mechanism to skip quickly to the next
> distinct value, that could be used here: walk up the btree until you
> find a page that contains keyspace not equal to the current key, then
> walk back down until you find the first leaf page that contains such a
> value.  That would potentially let you step over large chunks of the
> index without actually examining all the leaf pages, which for a query
> like this seems like it could be a big win.

FWIW I messed around with prototyping this idea here:

https://www.postgresql.org/message-id/cadlwmxwalk8npzqdnrqipnrzanic7nxykynrkzo_vxyr8en...@mail.gmail.com

I hope to return to that and some related ideas eventually as I learn
more about the relevant areas of the source code, if someone doesn't
beat me to it.

https://wiki.postgresql.org/wiki/Loose_indexscan shows a recursive CTE
that does the same thing at a higher level.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] add option to pg_dumpall to exclude tables from the dump

2016-07-07 Thread Robert Haas

On Fri, Apr 22, 2016 at 6:42 AM, Juergen Hannappel
 wrote:
> A new option -T --exlude-table for pg_dumpall. This option is then
> passed through to the pg_dump which really does the work.
> This feature can be used to exclude large tables that are known not
> to change from a database backup dump so that only the changing parts
> of the database are dumped.
>
> Signed-off-by: Juergen Hannappel 

This seems like it could be useful.  Please add it to the
currently-open CommitFest so it gets reviewed at some point:

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] \timing interval

2016-07-07 Thread Corey Huinker

\timing is great.
\timing is helpful.
\timing has made me really good at mentally estimating numbers modulo
360.

Wouldn't it be great if we had a way of printing timing in more human
friendly formats?

Attached is a patch that allows the following (new/interesting bits in
bold):

# \timing off
Timing is off.
# select pg_sleep(1);
 pg_sleep
--

(1 row)

# \timing
Timing is on.
# select pg_sleep(1);
 pg_sleep
--

(1 row)

Time: 1002.959 ms
*# \timing interval*
*Timing is interval.*
# select pg_sleep(1);
 pg_sleep
--

(1 row)

*Time: 00:00:01.003*
# \timing
Timing is off.



As demonstrated, "interval" toggles to "off". There is no way to toggle to
"interval".

I'm pretty flexible on how something like this gets invoked. We could leave
timing alone and create a format variable. We could actually leverage the
pre-existing interval-to-string code, etc.

Note: the current patch includes no doc changes. I'd figure I'd wait to do
that after this patch or another gains some traction.
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 3f2cebf..1e6686f 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1516,12 +1516,23 @@ exec_command(const char *cmd,

 OT_NORMAL, NULL, false);
 
if (opt)
-   pset.timing = ParseVariableBool(opt, "\\timing");
+   if (strcmp(opt,"interval") == 0)
+   pset.timing = PSQL_TIMING_INTERVAL;
+   else if (ParseVariableBool(opt, "\\timing"))
+   pset.timing = PSQL_TIMING_ON;
+   else
+   pset.timing = PSQL_TIMING_OFF;
else
-   pset.timing = !pset.timing;
+   if (pset.timing == PSQL_TIMING_OFF)
+   pset.timing = PSQL_TIMING_ON;
+   else
+   pset.timing = PSQL_TIMING_OFF;
+
if (!pset.quiet)
{
-   if (pset.timing)
+   if (pset.timing == PSQL_TIMING_INTERVAL)
+   puts(_("Timing is interval."));
+   else if (pset.timing == PSQL_TIMING_ON)
puts(_("Timing is on."));
else
puts(_("Timing is off."));
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 2450b9c..c79843f 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -587,6 +587,47 @@ PSQLexec(const char *query)
return res;
 }
 
+/*
+ * PrintTimingInterval
+ *
+ * Output timing an interval format
+ */
+static void
+PrintTimingInterval(double elapsed_msec)
+{
+   double seconds = elapsed_msec / 1000.0;
+   int minutes;
+   int hours;
+   int days;
+
+   if (seconds < 60.0)
+   {
+   printf(_("Time: 00:00:%06.3f\n"), seconds);
+   return;
+   }
+
+   minutes = (int)seconds / 60;
+   seconds = seconds - (60.0 * minutes);
+   if (minutes < 60)
+   {
+   printf(_("Time: 00:%02d:%06.3f\n"), minutes, seconds);
+   return;
+   }
+
+   hours = minutes / 60;
+   minutes = minutes % 60;
+
+   if (hours < 24)
+   {
+   printf(_("Time: %2d:%02d:%06.3f\n"), hours, minutes, seconds);
+   return;
+   }
+
+   days = hours / 24;
+   hours = hours % 24;
+
+   printf(_("Time: %d, %2d:%02d:%06.3f\n"), days, hours, minutes, seconds);
+}
 
 /*
  * PSQLexecWatch
@@ -613,7 +654,7 @@ PSQLexecWatch(const char *query, const printQueryOpt *opt)
 
SetCancelConn();
 
-   if (pset.timing)
+   if (pset.timing != PSQL_TIMING_OFF)
INSTR_TIME_SET_CURRENT(before);
 
res = PQexec(pset.db, query);
@@ -626,7 +667,7 @@ PSQLexecWatch(const char *query, const printQueryOpt *opt)
return 0;
}
 
-   if (pset.timing)
+   if (pset.timing != PSQL_TIMING_OFF)
{
INSTR_TIME_SET_CURRENT(after);
INSTR_TIME_SUBTRACT(after, before);
@@ -677,7 +718,9 @@ PSQLexecWatch(const char *query, const printQueryOpt *opt)
fflush(pset.queryFout);
 
/* Possible microtiming output */
-   if (pset.timing)
+   if (pset.timing == PSQL_TIMING_INTERVAL)
+   PrintTimingInterval(elapsed_msec);
+   else if (pset.timing == PSQL_TIMING_ON)
printf(_("Time: %.3f ms\n"), elapsed_msec);
 
return 1;
@@ -1228,7 +1271,7 @@ SendQuery(const char *query)
instr_time  before,
after;
 
-   if (pset.timing)
+   if (pset.timing != PSQL_TIMING_OFF)
INSTR_TIME_SET_CURRENT(before);
 
results =

Re: [HACKERS] SELECT DISTINCT never uses an index?

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 4:56 PM, Bill Moran  wrote:
> Take the following table as an example:
>
> CREATE TABLE grue (
>  id SERIAL PRIMARY KEY,
>  size VARCHAR(255)
> );
> CREATE INDEX grue_size ON grue(size);
>
> Now insert approximately eleventy jillion rows, but ensure
> that there are only about 20 distinct values for size.
>
> SELECT DISTINCT size FROM grue;
>
> Always does a seq scan on Postgres 9.5.2. (Yes, I know we're
> a patch behind, the upgrade is on the schedule) on
> Ubuntu 14.
>
> I would expect it to be possible, and significantly more
> efficient to do an index scan for that query.

Well, there are two possible ways to implement uniqueness: hashing and
grouping.  You haven't included the EXPLAIN ANALYZE output so it's
hard to be sure which sort of plan you are getting, but my guess is
that you are getting a HashAggregate.  So basically what it's doing
is:

1. Read a row.
2. Check the value in the hash table to see if we've seen it already.
3. If not, add it.
4. Go to step 1.

If you've got to visit the whole table anyway, doing it in sequential
order is smart, so this plan sounds reasonably good.

The alternative worth considering is presumably something like:

GroupAggregate
-> Index Only Scan on grue_size

Scanning an entire index in order is pretty expensive, but it seems
possible that this could be faster than the Seq Scan, especially on a
table with other wide columns, because then the index might be a lot
smaller than the table.  Even if the index traversal generates some
random I/O, if it's sufficiently smaller than the table you will still
come out ahead.  I'm not positive that the planner will actually
consider this plan, but it seems like it should; you might be able to
persuade it to do so by setting enable_hashagg=false, which would let
you check whether it's actually faster.

We're probably missing a few tricks on queries of this type. If the
index-traversal machinery had a mechanism to skip quickly to the next
distinct value, that could be used here: walk up the btree until you
find a page that contains keyspace not equal to the current key, then
walk back down until you find the first leaf page that contains such a
value.  That would potentially let you step over large chunks of the
index without actually examining all the leaf pages, which for a query
like this seems like it could be a big win.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] SELECT DISTINCT never uses an index?

2016-07-07 Thread Bill Moran


Take the following table as an example:

CREATE TABLE grue (
 id SERIAL PRIMARY KEY,
 size VARCHAR(255)
);
CREATE INDEX grue_size ON grue(size);

Now insert approximately eleventy jillion rows, but ensure
that there are only about 20 distinct values for size.

SELECT DISTINCT size FROM grue;

Always does a seq scan on Postgres 9.5.2. (Yes, I know we're
a patch behind, the upgrade is on the schedule) on
Ubuntu 14.

I would expect it to be possible, and significantly more
efficient to do an index scan for that query. Is this
a bug, an optimization that is simply waiting for someone
to take the time to implement, or is there some underlying
reason why this isn't possible that I'm not seeing.

And, yes, I know that's not a normalized table and that
properly normalizing it makes the problem disappear. And
yes, this is repeatable (I'm working with about 6 tables
with similar structure that all exhibit the same
behavior) and yes I've done ANALYZE and VACUUM and the
behavior doesn't change.

-- 
Bill Moran


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] strange explain in upstream - subplan 1 twice - is it bug?

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 4:13 PM, Pavel Stehule  wrote:
> 2016-07-07 21:57 GMT+02:00 Robert Haas :
>> On Wed, Jun 1, 2016 at 7:29 AM, Pavel Stehule 
>> wrote:
>> > Hi
>> >
>> > When I tested some queries, I found strange plan
>> >
>> > postgres=# explain analyze select s.nazev, o.nazev, o.pocet_obyvatel
>> > from
>> > (select nazev, array(select id from obce_pocet_obyvatel where okresy.id
>> > =
>> > okres_id order by pocet_obyvatel desc limit 3) as obceids from okresy) s
>> > join obce_pocet_obyvatel o on o.id = ANY(obceids) order by 1, 3 desc;
>>
>> The EXPLAIN plan you posted certainly looks weird, since I wouldn't
>> expect SubPlan 1 to be displayed twice, but I'm wondering if it's a
>> display artifact rather than an actual defect in the plan.
>>
>> Just out of curiosity, what does the output look like with FORMAT JSON
>> or similar?
>
> The test case was wrong, the view "" is necessary
>
> create view obce_pocet_obyvatel as select id, okres_id, nazev, pocet_muzu +
> pocet_zen as pocet_obyvatel from obce;
>
> But the result is same (explain is ok, explain analyze is broken):

Hmm, so it looks like the subplan is somehow getting into it's parents
subPlan list twice.  I guess ExecInitExpr must reach that subplan
twice via two different paths, but I'm not quite sure how that's
happening.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Floating point comparison inconsistencies of the geometric types

2016-07-07 Thread Robert Haas

On Wed, Jun 1, 2016 at 10:27 AM, Tom Lane  wrote:
> As I understand it, the key problem is that tests like "is point on line"
> would basically never succeed except in the most trivial cases, because of
> roundoff error.  That's not very nice, and it might cascade to larger
> problems like object-containment tests failing unexpectedly.  We would
> need to go through all the geometric operations and figure out where that
> kind of gotcha is significant and what we can do about it.  Seems like a
> fair amount of work :-(.  If somebody's willing to do that kind of
> investigation, then sure, but I don't think just blindly removing these
> macros is going to lead to anything good.

Yeah, it does seem to need some research.

> Also, I suppose this means that Robert promises not to make any of his
> usual complaints about breaking compatibility?  Because we certainly
> would be.

Pot, meet Mr. Kettle!

Obviously, the inconvenience caused by any backward incompatibility
has to be balanced against the fact that the new behavior is
presumably better.  But I stridently object to the accusation that of
the two of us I'm the one more concerned with backward-compatibility.
There may be some instances where I've had a more conservative
judgement than you about breaking user-facing stuff, but you've
blocked dozens of changes to the C API that would have enabled
meaningful extension development on the grounds that somebody might
complain when a future release changes the API!  I think behavior
changes that users will notice are of vastly greater significance than
those which will only be observed by developers.

In this particular case, I think that the current behavior is pretty
stupid, and that the built-in geometric types are barely used,
possibly because they have stupid behavior.  So I would be willing to
bet on a well-thought-out change in this area coming out to a net
positive.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Comment typo in _readExtensibleNode()

2016-07-07 Thread Robert Haas

On Fri, May 6, 2016 at 3:26 AM, Amit Langote
 wrote:
> Attached fixes a minor typo comment in _readExtensibleNode().
>
> s/skip: extnodename/skip :extnodename/g
>
> I was confused at first as to why there is a skip extnodename on one line
> and get extnodename right on the next line as the comments say, :)

Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] strange explain in upstream - subplan 1 twice - is it bug?

2016-07-07 Thread Pavel Stehule

2016-07-07 21:57 GMT+02:00 Robert Haas :

> On Wed, Jun 1, 2016 at 7:29 AM, Pavel Stehule 
> wrote:
> > Hi
> >
> > When I tested some queries, I found strange plan
> >
> > postgres=# explain analyze select s.nazev, o.nazev, o.pocet_obyvatel from
> > (select nazev, array(select id from obce_pocet_obyvatel where okresy.id
> =
> > okres_id order by pocet_obyvatel desc limit 3) as obceids from okresy) s
> > join obce_pocet_obyvatel o on o.id = ANY(obceids) order by 1, 3 desc;
>
> The EXPLAIN plan you posted certainly looks weird, since I wouldn't
> expect SubPlan 1 to be displayed twice, but I'm wondering if it's a
> display artifact rather than an actual defect in the plan.
>
> Just out of curiosity, what does the output look like with FORMAT JSON
> or similar?
>

The test case was wrong, the view "" is necessary

create view obce_pocet_obyvatel as select id, okres_id, nazev, pocet_muzu +
pocet_zen as pocet_obyvatel from obce;

But the result is same (explain is ok, explain analyze is broken):

┌──┐
│  QUERY
PLAN  │
╞══╡
│
[
↵│
│
{
↵│
│ "Plan":
{   ↵│
│   "Node Type":
"Sort",  ↵│
│   "Startup Cost":
1599.86,  ↵│
│   "Total Cost":
1601.79,↵│
│   "Plan Rows":
769, ↵│
│   "Plan Width":
24, ↵│
│   "Actual Startup Time":
9.525, ↵│
│   "Actual Total Time":
9.547,   ↵│
│   "Actual Rows":
227,   ↵│
│   "Actual Loops":
1,↵│
│   "Sort Key": ["okresy.nazev", "((obce.pocet_muzu + obce.pocet_zen))
DESC"],↵│
│   "Sort Method":
"quicksort",   ↵│
│   "Sort Space Used":
44,↵│
│   "Sort Space Type":
"Memory",  ↵│
│   "Plans":
[↵│
│
{
↵│
│   "Node Type": "Nested
Loop",   ↵│
│   "Parent Relationship":
"Outer",   ↵│
│   "Join Type":
"Inner", ↵│
│   "Startup Cost":
13.95,↵│
│   "Total Cost":
1563.00,↵│
│   "Plan Rows":
769, ↵│
│   "Plan Width":
24, ↵│
│   "Actual Startup Time":
0.212, ↵│
│   "Actual Total Time":
8.991,   ↵│
│   "Actual Rows":
227,   ↵│
│   "Actual Loops":
1,↵│
│   "Plans":
[↵│
│
{   ↵│
│   "Node Type": "Seq
Scan",  ↵│
│   "Parent Relationship":
"Outer",   ↵│
│   "Relation Name":
"okresy",↵│
│   "Alias":
"okresy",↵│
│   "Startup Cost":
0.00, ↵│
│   "Total Cost":
1.77,   ↵│
│   "Plan Rows":
77,  ↵│
│   "Plan Width":
17, ↵│
│   "Actual Startup Time":
0.016, ↵│
│   "Actual Total Time":
0.042,   ↵│
│   "Actual Rows":
77,↵│
│   "Actual Loops":
1 ↵│
│
},  ↵│
│
{

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Robert Haas

On Mon, May 16, 2016 at 8:52 PM, David Fetter  wrote:
> In light of the above, it is perfectly reasonable to require, at least
> temporarily, setting up duplicate storage, or another node.
>
> I am aware that some cases exist where this is not possible, but I
> don't think we should twist ourselves into pretzels to accommodate a
> tiny minority of our users, which my experience in the field leads me
> to believe is the case.

So, on the one hand, I agree that logical replication is a great way
to facilitate major version upgrades.  On the other hand, I think it's
completely wrong to suppose that only a tiny minority of people can't
use it.  In some cases, hardware availability is definitely an issue.
But even when people have the hardware, being able to cleanly do a
cutover from one master to another is not necessarily something people
are set up to do.  Getting that to work well requires more brainpower
than many users are willing to give to their database.  A lot of
people want to just shut the database down, upgrade it, and start it
back up.

pg_upgrade does that, kinda.  I'd like to have something better, but
in the absence of that, I think it's quite wrong to think about
deprecating it, even if we had logical replication fully integrated
into core today.  Which we by no means do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-07 Thread Robert Haas

On Mon, May 16, 2016 at 9:22 PM, Alvaro Herrera
 wrote:
> David Fetter wrote:
>> As a relatively (to our users) minor course correction, I would like
>> to propose the following:
>
>> - Develop a logical upgrade path as a part of the (Yay! Sexy!) logical
>>   replication that's already in large part built.
>>
>> This path would, of course, run either locally or across a network,
>> and be testable in both cases.
>
> This is one use case that pglogical intends to fulfill.  If you're able
> to contribute to that project, I'm sure many would appreciate it.  Right
> now the hottest question seems to be: is this something that should be
> an extension, or should it be part of core with its own set of DDL etc?
> The current patch is geared towards the former, so if the community at
> large prefers to have it as the latter and would oppose the former, now
> is the time to speak up so that the course can be corrected.

There was an unconference session on this topic at PGCon and quite a
number of people there stated that they found DDL to be an ease-of-use
feature and wanted to have it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] strange explain in upstream - subplan 1 twice - is it bug?

2016-07-07 Thread Robert Haas

On Wed, Jun 1, 2016 at 7:29 AM, Pavel Stehule  wrote:
> Hi
>
> When I tested some queries, I found strange plan
>
> postgres=# explain analyze select s.nazev, o.nazev, o.pocet_obyvatel from
> (select nazev, array(select id from obce_pocet_obyvatel where okresy.id =
> okres_id order by pocet_obyvatel desc limit 3) as obceids from okresy) s
> join obce_pocet_obyvatel o on o.id = ANY(obceids) order by 1, 3 desc;

The EXPLAIN plan you posted certainly looks weird, since I wouldn't
expect SubPlan 1 to be displayed twice, but I'm wondering if it's a
display artifact rather than an actual defect in the plan.

Just out of curiosity, what does the output look like with FORMAT JSON
or similar?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MVCC overheads

2016-07-07 Thread Pete Stevenson

Hi Simon -

Thanks for the note. I think it's fair to say that I didn't provide enough
context, so let me try and elaborate on my question.

I agree, MVCC is a benefit. The research angle here is about enabling MVCC
with hardware offload. Since I didn't explicitly mention it, the offload I
refer to will respect all consistency guarantees also.

It is the case that for the database to implement MVCC it must provide
consistent read to multiple different versions of data, i.e. depending on
the version used at transaction start. I'm not an expert on postgresql
internals, but this must have some cost. I think the cost related to MVCC
guarantees can roughly be categorized as: creating new versions (linking
them in), version checking on read, garbage collecting old versions, and
then there is an additional cost that I am interested in (again not
claiming it is unnecessary in any sense) but there is a cost to generating
the log.

Thanks, by the way, for the warning about lab vs. reality. That's why I'm
asking this question here. I want to keep the hypothetical tagged as such,
but find defensible and realistic metrics where those exist, i.e. in this
instance, we do have a database that can use MVCC. It should be possible to
figure out how much work goes into maintaining that property.

Thank you,
Pete

On Thu, Jul 7, 2016 at 11:10 AM, Simon Riggs  wrote:

> On 7 July 2016 at 17:45, Pete Stevenson  wrote:
>
>> Hi postgresql hackers -
>>
>> I would like to find some analysis (published work, blog posts) on the
>> overheads affiliated with the guarantees provided by MVCC isolation. More
>> specifically, assuming the current workload is CPU bound (as opposed to IO)
>> what is the CPU overhead of generating the WAL, the overhead of version
>> checking and version creation, and of garbage collecting old and
>> unnecessary versions? For what it’s worth, I am working on a research
>> project where it is envisioned that some of this work can be offloaded.
>>
>
> MVCC is a benefit, not an overhead. To understand that you should compare
> MVCC with a system that performs S2PL.
>
> If you're thinking that somehow consistency isn't important, I'd hope that
> you also consider some way to evaluate the costs associated with
> inconsistent and incorrect results in applications, or other architectural
> restrictions imposed to make that possible. It's easy to make assumptions
> in the lab that don't work in the real world.
>
> --
> Simon Riggshttp://www.2ndQuadrant.com/
> 
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: [HACKERS] gettimeofday is at the end of its usefulness?

2016-07-07 Thread Andres Freund

On 2016-07-07 14:43:31 -0400, Robert Haas wrote:
> On Tue, Jun 14, 2016 at 4:27 PM, Jim Nasby  wrote:
> > Semi-related: someone (Robert I think) recently mentioned investigating
> > "vectorized" executor nodes, where multiple tuples would be processed in one
> > shot. If we had that presumably the explain penalty would be a moot point.
> 
> Yeah, both Andres and I are interested in that, and I think he's
> actively working on it.  It would be quite neat if this had the effect
> of reducing EXPLAIN ANALYZE's overhead to something trivial.

I am, and it does reduce the overhead. Depends on the type of plan
though. Index nestloops e.g. don't benefit on the inner side.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gettimeofday is at the end of its usefulness?

2016-07-07 Thread Robert Haas

On Tue, Jun 14, 2016 at 4:27 PM, Jim Nasby  wrote:
> Semi-related: someone (Robert I think) recently mentioned investigating
> "vectorized" executor nodes, where multiple tuples would be processed in one
> shot. If we had that presumably the explain penalty would be a moot point.

Yeah, both Andres and I are interested in that, and I think he's
actively working on it.  It would be quite neat if this had the effect
of reducing EXPLAIN ANALYZE's overhead to something trivial.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MVCC overheads

2016-07-07 Thread Simon Riggs

On 7 July 2016 at 17:45, Pete Stevenson  wrote:

> Hi postgresql hackers -
>
> I would like to find some analysis (published work, blog posts) on the
> overheads affiliated with the guarantees provided by MVCC isolation. More
> specifically, assuming the current workload is CPU bound (as opposed to IO)
> what is the CPU overhead of generating the WAL, the overhead of version
> checking and version creation, and of garbage collecting old and
> unnecessary versions? For what it’s worth, I am working on a research
> project where it is envisioned that some of this work can be offloaded.
>

MVCC is a benefit, not an overhead. To understand that you should compare
MVCC with a system that performs S2PL.

If you're thinking that somehow consistency isn't important, I'd hope that
you also consider some way to evaluate the costs associated with
inconsistent and incorrect results in applications, or other architectural
restrictions imposed to make that possible. It's easy to make assumptions
in the lab that don't work in the real world.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2016-07-07 Thread Robert Haas

On Wed, Jun 15, 2016 at 3:25 AM, Ashutosh Bapat
 wrote:
> Amit Langote is working on supporting declarative partitioning in PostgreSQL
> [1]. I have started working on supporting partition-wise join. This mail
> describes very high level design and some insight into the performance
> improvements.
>
> An equi-join between two partitioned tables can be broken down into
> pair-wise join between their partitions. This technique is called
> partition-wise join. Partition-wise joins between similarly partitioned
> tables with equi-join condition can be efficient because [2]
> 1. Each provably non-empty partition-wise join smaller. All such joins
> collectively might be more efficient than the join between their parent.
> 2. Such joins are able to exploit properties of partitions like indexes,
> their storage etc.
> 3. An N-way partition-wise join may have different efficient join orders
> compared to the efficient join order between the parent tables.
>
> A partition-wise join is processed in following stages [2], [3].
> 1. Applicability testing: This phase checks if the join conditions match the
> partitioning scheme. A partition-wise join is efficient if there is an
> equi-join on the partition keys. E.g. join between tables R and S
> partitioned by columns a and b resp. can be broken down into partition-wise
> joins if there exists a join condition is R.a = S.b. Or in other words the
> number of provably non-empty partition-wise joins is O(N) where N is the
> number of partitions.
>
> 2. Matching: This phase determines which joins between the partitions of R
> and S can potentially produce tuples in the join and prunes empty joins
> between partitions.
>
> 3. Clustering: This phase aims at reducing the number of partition-wise
> joins by clubbing together partitions from joining relations. E.g. clubbing
> multiple partitions from either of the partitioned relations which can join
> to a single partition from the other partitioned relation.
>
> 4. Path/plan creation: This phase creates multiple paths for each
> partition-wise join. It also creates Append path/s representing the union of
> partition-wise joins.
>
> The work here focuses on a subset of use-cases discussed in [2]. It only
> considers partition-wise join for join between similarly partitioned tables
> with same number of partitions with same properties, thus producing at most
> as many partition-wise joins as there are partitions. It should be possible
> to apply partition-wise join technique (with some special handling for OUTER
> joins) if both relations have some extra partitions with non-overlapping
> partition conditions, apart from the matching partitions. But I am not
> planning to implement this optimization in the first cut.

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables.  Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition.  If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks.  Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.
  And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization.  Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers.  If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables.  That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing

Re: [HACKERS] Bug in batch tuplesort memory CLUSTER case (9.6 only)

2016-07-07 Thread Peter Geoghegan

On Thu, Jul 7, 2016 at 10:51 AM, Robert Haas  wrote:
> Thanks for testing.  I've committed this patch, breaking off one
> unrelated bit of into a separate commit.

Thank you. To be clear, I still intend to follow up with a CLUSTER
external sort test case, as outlined to Noah.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 2:04 PM, Andres Freund  wrote:
>> Yeah, that's true, but I'm having a bit of trouble imagining exactly
>> we end up with corruption that actually matters.  I guess a torn page
>> could do it.
>
> I think Noah pointed out a bad scenario: If we crash after putting the
> xid in the page header, but before WAL logging, the xid could get reused
> after the crash. By a different transaction. And suddenly the row isn't
> visible anymore, after the reused xid commits...

Oh, wow.  Yikes.  OK, so I guess we should try to back-patch the fix, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] primary_conninfo missing from pg_stat_wal_receiver

2016-07-07 Thread Robert Haas

On Thu, Jun 30, 2016 at 10:24 AM, Alvaro Herrera
 wrote:
> Also, actually, I see no reason for the conninfo to be shown differently
> regardless of a connection being already established.  If we show the
> conninfo that the server is trying to use, it might be easier to
> diagnose a problem.  In short, I think this is all misconceived (mea
> culpa) and that we should have two conninfo members in that struct as
> initially proposed, one obfuscated and the other not.

Seriously!

The whole problem here is being created by trying to use the same
field for two different purposes:

1. The string that should actually be used for connections.
2. The sanitized version that should be exposed to the user.

If you try to use the same variable to store two different values,
both bugs and confusion may result.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Issue with bgworker, SPI and pgstat_report_stat

2016-07-07 Thread Andres Freund

On 2016-07-07 14:04:36 -0400, Robert Haas wrote:
> On Thu, Jul 7, 2016 at 1:52 PM, Julien Rouhaud
>  wrote:
> > Should a bgworker modifing data have to call pgstat_report_stat() to
> > avoid this problem? I don't find any documentation suggesting it, and it
> > seems that worker_spi (used as a template for this bgworker and I
> > suppose a lot of other one) is also affected.
> 
> That certainly seems like the simplest fix.  Not sure if there's a better one.

I think a better fix would be to unify the startup & error handling
code. We have way to many slightly diverging copies. But that's a bigger
task, so I'm not protesting against making a more localized fix.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Andres Freund

On 2016-07-07 14:01:05 -0400, Robert Haas wrote:
> On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund  wrote:
> > On 2016-07-07 10:37:15 -0400, Robert Haas wrote:
> >> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
> >> > Hm. We can't easily do that in the back-patched version; because a
> >> > standby won't know to check for the flag . That's kinda ok, since we
> >> > don't yet need to clear all-visible yet at that point of
> >> > heap_update. But that better means we don't do so on the master either.
> >>
> >> Is there any reason to back-patch this in the first place?
> >
> > It seems not unlikely that this has caused corruption in the past; and
> > that we chalked it up to hardware corruption or something. Both toasting
> > and file extension frequently take extended amounts of time under load,
> > the window for crashing in the wrong moment isn't small...
> 
> Yeah, that's true, but I'm having a bit of trouble imagining exactly
> we end up with corruption that actually matters.  I guess a torn page
> could do it.

I think Noah pointed out a bad scenario: If we crash after putting the
xid in the page header, but before WAL logging, the xid could get reused
after the crash. By a different transaction. And suddenly the row isn't
visible anymore, after the reused xid commits...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Issue with bgworker, SPI and pgstat_report_stat

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 1:52 PM, Julien Rouhaud
 wrote:
> While investigating on a bloat issue with a colleague, we found that if
> a bgworker executes some queries with SPI, the statistic changes will
> never be reported, since pgstat_report_stat() is only called in regular
> backends.
>
> In our case, the bgworker is the only process inserting and deleting a
> large amount of data on some tables, so the autovacuum never tried to do
> any maintenance on these tables.

Ouch.

> Should a bgworker modifing data have to call pgstat_report_stat() to
> avoid this problem? I don't find any documentation suggesting it, and it
> seems that worker_spi (used as a template for this bgworker and I
> suppose a lot of other one) is also affected.

That certainly seems like the simplest fix.  Not sure if there's a better one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund  wrote:
> On 2016-07-07 10:37:15 -0400, Robert Haas wrote:
>> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
>> > Hm. We can't easily do that in the back-patched version; because a
>> > standby won't know to check for the flag . That's kinda ok, since we
>> > don't yet need to clear all-visible yet at that point of
>> > heap_update. But that better means we don't do so on the master either.
>>
>> Is there any reason to back-patch this in the first place?
>
> It seems not unlikely that this has caused corruption in the past; and
> that we chalked it up to hardware corruption or something. Both toasting
> and file extension frequently take extended amounts of time under load,
> the window for crashing in the wrong moment isn't small...

Yeah, that's true, but I'm having a bit of trouble imagining exactly
we end up with corruption that actually matters.  I guess a torn page
could do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous and vectorized execution

2016-07-07 Thread Robert Haas

On Wed, Jul 6, 2016 at 3:29 AM, Kyotaro HORIGUCHI
 wrote:
> This seems to be a good opportunity to show this patch. The
> attched patch set does async execution of foreignscan
> (postgres_fdw) on the Robert's first infrastructure, with some
> modification.

Cool.

> ExecAsyncWaitForNode can get into an inifite-waiting by recursive
> calls of ExecAsyncWaitForNode caused by ExecProcNode called from
> async-unaware nodes. Such recursive calls cause a wait on
> already-ready nodes.

Hmm, that's annoying.

> I solved that in the patch set by allocating a separate
> async-execution context for every async-execution subtrees, which
> is made by ExecProcNode, instead of one async-exec context for
> the whole execution tree. This works fine but the way switching
> contexts seems ugly.  This may also be solved by make
> ExecAsyncWaitForNode return when no node to wait even if the
> waiting node is not ready. This might keep the async-exec context
> (state) simpler so I'll try this.

I think you should instead try to make ExecAsyncWaitForNode properly reentrant.

> Does the ParallelWorkerSetLatchesForGroup use mutex or semaphore
> or something like instead of latches?

Why would it do that?

>> BTW, we also need to benchmark those changes to add the parent
>> pointers and change the return conventions and see if they have any
>> measurable impact on performance.
>
> I have to bring you a bad news.
>
> With the attached patch, an append on four foreign scans on one
> server (at local) performs faster by about 10% and by twice for
> three or four foreign scns on separate foreign servers
> (connections) respectively, but only when compiled with -O0. I
> found that it can take hopelessly small amount of advantage from
> compiler optimization, while unpatched version gets faster.

Two things:

1. That's not the scenario I'm talking about.  I'm concerned about
making sure that query plans that don't use asynchronous execution
don't get slower.

2. I have to believe that's a defect in your implementation rather
than something intrinsic, or maybe your test scenario is bad.  It's
very hard - really impossible -  to believe that all queries involving
FDW pushdown are locally CPU-bound.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Andres Freund

On 2016-07-07 10:37:15 -0400, Robert Haas wrote:
> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
> > Hm. We can't easily do that in the back-patched version; because a
> > standby won't know to check for the flag . That's kinda ok, since we
> > don't yet need to clear all-visible yet at that point of
> > heap_update. But that better means we don't do so on the master either.
> 
> Is there any reason to back-patch this in the first place?

It seems not unlikely that this has caused corruption in the past; and
that we chalked it up to hardware corruption or something. Both toasting
and file extension frequently take extended amounts of time under load,
the window for crashing in the wrong moment isn't small...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in batch tuplesort memory CLUSTER case (9.6 only)

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 3:34 AM, Noah Misch  wrote:
>> Are you satisfied that I have adequately described steps to reproduce?
>
> I can confirm that (after 62 minutes) your test procedure reached SIGSEGV
> today and then completed successfully with your patch.

Thanks for testing.  I've committed this patch, breaking off one
unrelated bit of into a separate commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Issue with bgworker, SPI and pgstat_report_stat

2016-07-07 Thread Julien Rouhaud

Hello,

While investigating on a bloat issue with a colleague, we found that if
a bgworker executes some queries with SPI, the statistic changes will
never be reported, since pgstat_report_stat() is only called in regular
backends.

In our case, the bgworker is the only process inserting and deleting a
large amount of data on some tables, so the autovacuum never tried to do
any maintenance on these tables.

Should a bgworker modifing data have to call pgstat_report_stat() to
avoid this problem? I don't find any documentation suggesting it, and it
seems that worker_spi (used as a template for this bgworker and I
suppose a lot of other one) is also affected.

If yes, I think at least worker_spi should be fixed (patched attached).

Regards.

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index 314e371..7c9a3eb 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -292,6 +292,7 @@ worker_spi_main(Datum main_arg)
 		SPI_finish();
 		PopActiveSnapshot();
 		CommitTransactionCommand();
+		pgstat_report_stat(false);
 		pgstat_report_activity(STATE_IDLE, NULL);
 	}
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Don't include MMAP DSM's transient files in basebackup

2016-07-07 Thread Andres Freund

On 2016-07-07 08:49:18 +0100, Magnus Hagander wrote:
> On Thursday, July 7, 2016, Andres Freund  wrote:
> 
> > On 2016-07-06 16:46:53 +0300, Oskari Saarenmaa wrote:
> > > The files are not useful when restoring a backup and would be
> > automatically
> > > deleted on startup anyway.  Also deleted an out-of-date comment in dsm.c.
> >
> > No arguments against the change. But I am wondering how you ran into
> > this - the mmap dsm backend really should pretty much never be used...
> >
> >
> Unfortunately I think that's still what happens by default if you use
> pg_upgradecluster on debian :(

Ugh. Sounds like that should better be fixed before 9.6. It'll make
parallel query seriously slower.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] MVCC overheads

2016-07-07 Thread Pete Stevenson

Hi postgresql hackers -

I would like to find some analysis (published work, blog posts) on the 
overheads affiliated with the guarantees provided by MVCC isolation. More 
specifically, assuming the current workload is CPU bound (as opposed to IO) 
what is the CPU overhead of generating the WAL, the overhead of version 
checking and version creation, and of garbage collecting old and unnecessary 
versions? For what it’s worth, I am working on a research project where it is 
envisioned that some of this work can be offloaded.

Thank you,
Pete Stevenson



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PARALLEL SAFE/UNSAFE question

2016-07-07 Thread Tom Lane

Satoshi Nagayasu  writes:
> I was trying writing my own parallel aggregation functions on 9.6beta2.
> During that, we found a bit confusing behavior with SAFE/UNSAFE options.

> Once a PARALLEL UNSAFE function f1_unsafe() is wrapped by
> a PARALLEL SAFE function f1_safe_unsafe(), calling f1_safe_unsafe()
> produces a parallel execution plan despite it implicitly calls
> the UNSAFE FUNCTION f1_unsafe().

> Is this intentional?

Yes.  If you mismark the parallel safety of your functions, the
consequences are on your own head.  The system can't be expected to
look inside functions to figure out what their actual behavior is.
(See halting problem.)

As things stand, we actually trust the parallel safety marking of
an aggregate itself, though in principle we could look at the markings
of the component functions instead.  Again, the expectation is that
the programmer will take care with safety markings.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] EXPLAIN ANALYZE for parallel query doesn't report the SortMethod information.

2016-07-07 Thread Fujii Masao

On Fri, Jul 8, 2016 at 12:55 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Jul 7, 2016 at 10:07 AM, Tom Lane  wrote:
>>> Presumably the instrumentation data needed for that is not getting
>>> returned from the worker to the leader.
>
>> Yes.
>
>> ...
>> I'm not sure about the rest of you, but I'd kind of like to finish
>> this release and start working on the next one.
>
> Agreed.  We should make sure that the possible omissions from EXPLAIN
> output are adequately documented, but actually fixing that seems like
> material for a future release cycle.

Fair enough.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] can we optimize STACK_DEPTH_SLOP

2016-07-07 Thread Tom Lane

I found out that pmap can give much more fine-grained results than I was
getting before, if you give it the -x flag and then pay attention to the
"dirty" column rather than the "nominal size" column.  That gives a
reliable indication of how much stack space the process ever actually
touched, with resolution apparently 4KB on my machine.

I redid my measurements with commit 62c8421e8 applied, and now get results
like this for one run of the standard regression tests:

$ grep '\[ stack \]' postmaster.log  | sort -k 4n | uniq -c
137 7fff0f615000  84  36  36 rw---[ stack ]
 21 7fff0f615000  84  40  40 rw---[ stack ]
  4 7fff0f615000  84  44  44 rw---[ stack ]
 20 7fff0f615000  84  48  48 rw---[ stack ]
  8 7fff0f615000  84  52  52 rw---[ stack ]
  2 7fff0f615000  84  56  56 rw---[ stack ]
 10 7fff0f615000  84  60  60 rw---[ stack ]
  3 7fff0f615000  84  64  64 rw---[ stack ]
  3 7fff0f615000  84  68  68 rw---[ stack ]
  2 7fff0f615000  84  72  72 rw---[ stack ]
  1 7fff0f612000  96  76  76 rw---[ stack ]
  2 7fff0f60e000 112 112 112 rw---[ stack ]
  1 7fff0f5e 296 296 296 rw---[ stack ]
  1 7fff0f427000206020602060 rw---[ stack ]

The rightmost numeric column is the "dirty KB in region" column, and 36KB
is the floor established by the postmaster.  (It looks like selecting
timezone is still the largest stack-space hog in that, but it's no longer
enough to make me want to do something about it.)  So now we're seeing
some cases that exceed that floor, which is good.  regex and errors are
still the outliers, as expected.

Also, I found that on OS X "vmmap -dirty" could produce results comparable
to pmap, so here's the numbers for the same test case on current OS X:

 154 Stack 8192K  36K2 
   5 Stack 8192K  40K2 
  11 Stack 8192K  44K2 
   6 Stack 8192K  48K2 
  11 Stack 8192K  52K2 
   7 Stack 8192K  56K2 
   8 Stack 8192K  60K2 
   2 Stack 8192K  64K2 
   2 Stack 8192K  68K2 
   4 Stack 8192K  72K2 
   1 Stack 8192K  76K2 
   2 Stack 8192K 108K2 
   1 Stack 8192K 384K2 
   1 Stack 8192K2056K2 

(The "virtual" stack size seems to always be the same as ulimit -s,
ie 8MB by default, on this platform.)  This is good confirmation
that the actual stack consumption is pretty stable across different
compilers, though it looks like OS X's version of clang is a bit
more stack-wasteful for the regex recursion.

Based on these numbers, I'd have no fear of reducing STACK_DEPTH_SLOP
to 256KB on x86_64.  It would sure be good to check things on some
other architectures, though ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Header and comments describing routines in incorrect shape in visibilitymap.c

2016-07-07 Thread Tom Lane

Alvaro Herrera  writes:
> Regarding the first hunk, I don't like these INTERFACE sections too
> much; they get seriously outdated over the time and aren't all that
> helpful anyway.  See the one on heapam.c for example.  I'd rather get
> rid of that one instead of patching it.  The rest, of course, merit
> revision.

+1, as long as we make sure that any useful info therein gets migrated
to the per-function header comments rather than dropped.  If there's
anything that doesn't seem to fit naturally in any per-function comment,
maybe move it into the file header comment?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] EXPLAIN ANALYZE for parallel query doesn't report the SortMethod information.

2016-07-07 Thread Tom Lane

Robert Haas  writes:
> On Thu, Jul 7, 2016 at 10:07 AM, Tom Lane  wrote:
>> Presumably the instrumentation data needed for that is not getting
>> returned from the worker to the leader.

> Yes.

> ...
> I'm not sure about the rest of you, but I'd kind of like to finish
> this release and start working on the next one.

Agreed.  We should make sure that the possible omissions from EXPLAIN
output are adequately documented, but actually fixing that seems like
material for a future release cycle.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel query and temp_file_limit

2016-07-07 Thread Robert Haas

On Tue, Jul 5, 2016 at 3:59 PM, Peter Geoghegan  wrote:
> On Tue, Jul 5, 2016 at 12:58 PM, Tom Lane  wrote:
>> Perhaps we could change the wording of temp_file_limit's description
>> from "space that a session can use" to "space that a process can use"
>> to help clarify this?
>
> That's all that I was looking for, really.

OK, done that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cluster on NAS and data center.

2016-07-07 Thread Robert Haas

On Tue, Jul 5, 2016 at 7:46 AM, Krzysztof Kaczkowski
 wrote:
> Thanks to emails, we have achieved what we wanted. This is what we’ve done:
>
> Compilation (for 32 bit version, initdb with locale=C):
>
> CFLAGS="-mx32 -fexcess-precision=standard -O2"
> CXXFLAGS="-mx32"
>
> configure --without-zlib --disable-float8-byval --without-readline
> --host=x86_64-linux-gnux32
>
> We also have installed libx32.
>
> Right now we received cluster that is fully manageable from both systems.
> Anyone see something dangerous with this compilation?

That seems to cover a fair amount of territory, but it might be unwise
to assume that there are no other issues.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 10:53 AM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
>> > Hm. We can't easily do that in the back-patched version; because a
>> > standby won't know to check for the flag . That's kinda ok, since we
>> > don't yet need to clear all-visible yet at that point of
>> > heap_update. But that better means we don't do so on the master either.
>>
>> Is there any reason to back-patch this in the first place?
>
> Wasn't this determined to be a pre-existing bug?  I think the
> probability of occurrence has increased, but it's still possible in
> earlier releases.  I wonder if there are unexplained bugs that can be
> traced down to this.
>
> I'm not really following this (sorry about that) but I wonder if (in
> back branches) the failure to propagate in case the standby wasn't
> updated can cause actual problems.  If it does, maybe it'd be a better
> idea to have a new WAL record type instead of piggybacking on lock
> tuple.  Then again, apparently the probability of this bug is low enough
> that we shouldn't sweat over it ... Moreso considering Robert's apparent
> opinion that perhaps we shouldn't backpatch at all in the first place.
>
> In any case I would like to see much more commentary in the patch next
> to the new XLHL flag.  It's sufficiently different than the rest than it
> deserves so, IMO.

There are two issues being discussed on this thread.  One of them is a
new issue in 9.6: heap_lock_tuple needs to clear the all-frozen bit in
the freeze map even though it does not clear all-visible.  The one
that's actually a preexisting bug is that we can start to update a
tuple without WAL-logging anything and then release the page lock in
order to go perform TOAST insertions.  At this point, other backends
(on the master) will see this tuple as in the process of being updated
because xmax has been set and ctid has been made to point back to the
same tuple.

I'm guessing that if the UPDATE goes on to complete, any discrepancy
between the master and the standby is erased by the replay of the WAL
record covering the update itself.  I haven't checked that, but it
seems like that WAL record must set both xmax and ctid appropriately
or we'd be in big trouble.  The infomask bits are in play too, but
presumably the update's WAL is going to set those correctly also.  So
in this case I don't think there's really any issue for the standby.
Or for the master, either: it may technically be true the tuple is not
all-visible any more, but the only backend that could potentially fail
to see it is the one performing the update, and no user code can run
in the middle of toast_insert_or_update, so I think we're OK.

On the other hand, if the UPDATE aborts, there's now a persistent
difference between the master and standby: the infomask, xmax, and
ctid of the tuple may differ.  I don't know whether that could cause
any problem.  It's probably a very rare case, because there aren't all
that many things that will cause us to abort in the middle of
inserting TOAST tuples.  Out of disk space comes to mind, or maybe
some kind of corruption that throws an elog().

As far as back-patching goes, the question is whether it's worth the
risk.  Introducing new WAL logging at this point could certainly cause
performance problems if nothing else, never mind the risk of
garden-variety bugs.  I'm not sure it's worth taking that risk in
released branches for the sake of a bug which has existed for a decade
without anybody finding it until now.  I'm not going to argue strongly
for that position, but I think it's worth thinking about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Header and comments describing routines in incorrect shape in visibilitymap.c

2016-07-07 Thread Alvaro Herrera

Michael Paquier wrote:
> Hi all,
> 
> I just bumped into a couple of things in visibilitymap.c:
> - visibilitymap_clear clears all bits of a visibility map, its header
> comment mentions the contrary
> - The header of visibilitymap.c mentions correctly "a bit" when
> referring to setting them, but when clearing, it should say that all
> bits are cleared.
> - visibilitymap_set can set multiple bits
> - visibilitymap_pin can be called to set up more than 1 bit.
> 
> This can be summed by the patch attached.

Regarding the first hunk, I don't like these INTERFACE sections too
much; they get seriously outdated over the time and aren't all that
helpful anyway.  See the one on heapam.c for example.  I'd rather get
rid of that one instead of patching it.  The rest, of course, merit
revision.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Alvaro Herrera

Robert Haas wrote:
> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
> > Hm. We can't easily do that in the back-patched version; because a
> > standby won't know to check for the flag . That's kinda ok, since we
> > don't yet need to clear all-visible yet at that point of
> > heap_update. But that better means we don't do so on the master either.
> 
> Is there any reason to back-patch this in the first place?

Wasn't this determined to be a pre-existing bug?  I think the
probability of occurrence has increased, but it's still possible in
earlier releases.  I wonder if there are unexplained bugs that can be
traced down to this.

I'm not really following this (sorry about that) but I wonder if (in
back branches) the failure to propagate in case the standby wasn't
updated can cause actual problems.  If it does, maybe it'd be a better
idea to have a new WAL record type instead of piggybacking on lock
tuple.  Then again, apparently the probability of this bug is low enough
that we shouldn't sweat over it ... Moreso considering Robert's apparent
opinion that perhaps we shouldn't backpatch at all in the first place.

In any case I would like to see much more commentary in the patch next
to the new XLHL flag.  It's sufficiently different than the rest than it
deserves so, IMO.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Robert Haas

On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
> Hm. We can't easily do that in the back-patched version; because a
> standby won't know to check for the flag . That's kinda ok, since we
> don't yet need to clear all-visible yet at that point of
> heap_update. But that better means we don't do so on the master either.

Is there any reason to back-patch this in the first place?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] EXPLAIN ANALYZE for parallel query doesn't report the SortMethod information.

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 10:07 AM, Tom Lane  wrote:
> Amit Kapila  writes:
>> On Thu, Jul 7, 2016 at 1:23 PM, Fujii Masao  wrote:
>>> I found $SUBJECT while trying to test parallel queries. Is this a bug?
>
> Presumably the instrumentation data needed for that is not getting
> returned from the worker to the leader.

Yes.

> I would bet there's a lot
> of other plan-node-specific data that doesn't work either.

That's probably true, too.  Generally, what's going to happen here is
that if you have a true parallel query plan, any of this sort of
subsidiary information is going to reflect what the leader did, but
not what the workers did.  If the leader did nothing, as in the case
of force_parallel_mode, then EXPLAIN ANALYZE will show the same thing
that it would have shown if that node had never executed.

>> I think this can never happen for force_parallel_mode TO off, because
>> we don't generate a gather on top of sort node.  The reason why we are
>> able to push Sort below gather, because it is marked as parallel_safe
>> (create_sort_path).  I think we should not mark it as parallel_safe.
>
> That seems rather ridiculous.  An oversight in managing EXPLAIN data
> is not a sufficient reason to cripple parallel query.

+1.

Fixing this is actually somewhat difficult.  The parallel query stuff
does handle propagating the common instrumentation information from
the leader to the workers, but the EXPLAIN ANALYZE output can depend
in arbitrary ways on the final executor state tree, which is, of
course, unshared, and which is also not something we can propagate
between backends since executor state nodes don't have (and can't
really support) serialization and deserialization functions.  I think
we can eventually fix this by teaching individual nodes to store the
relevant information in dynamic shared memory rather than
backend-local memory when parallel query is in use: the
Estimate/InitializeDSM callbacks already give the nodes a chance to
obtain control in the right places, except that right now they're only
invoked for parallel-aware nodes.  I think, though, that it will take
more development than we want to undrertake at this point in the
cycle.

I'm not sure about the rest of you, but I'd kind of like to finish
this release and start working on the next one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] EXPLAIN ANALYZE for parallel query doesn't report the SortMethod information.

2016-07-07 Thread Amit Kapila

On Thu, Jul 7, 2016 at 7:37 PM, Tom Lane  wrote:
> Amit Kapila  writes:
>> On Thu, Jul 7, 2016 at 1:23 PM, Fujii Masao  wrote:
>>> I found $SUBJECT while trying to test parallel queries. Is this a bug?
>
> Presumably the instrumentation data needed for that is not getting
> returned from the worker to the leader.  I would bet there's a lot
> of other plan-node-specific data that doesn't work either.
>
>> I think this can never happen for force_parallel_mode TO off, because
>> we don't generate a gather on top of sort node.  The reason why we are
>> able to push Sort below gather, because it is marked as parallel_safe
>> (create_sort_path).  I think we should not mark it as parallel_safe.
>
> That seems rather ridiculous.  An oversight in managing EXPLAIN data
> is not a sufficient reason to cripple parallel query.
>

I am analyzing that point only and you seems to be right that we have
missed to propagate some information.  We have taken care of
instrumentation information to be propagated back to leader, but it
seems there are other things that needs to be taken care in that area.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] EXPLAIN ANALYZE for parallel query doesn't report the SortMethod information.

2016-07-07 Thread Tom Lane

Amit Kapila  writes:
> On Thu, Jul 7, 2016 at 1:23 PM, Fujii Masao  wrote:
>> I found $SUBJECT while trying to test parallel queries. Is this a bug?

Presumably the instrumentation data needed for that is not getting
returned from the worker to the leader.  I would bet there's a lot
of other plan-node-specific data that doesn't work either.

> I think this can never happen for force_parallel_mode TO off, because
> we don't generate a gather on top of sort node.  The reason why we are
> able to push Sort below gather, because it is marked as parallel_safe
> (create_sort_path).  I think we should not mark it as parallel_safe.

That seems rather ridiculous.  An oversight in managing EXPLAIN data
is not a sufficient reason to cripple parallel query.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disable WAL completely - Performance and Persistency research

2016-07-07 Thread Michael Paquier

On Thu, Jul 7, 2016 at 5:01 PM, Netanel Katzburg  wrote:
> 1. Disable the WAL by not writing anything to the xlog directory. I don't
> care about recovery/fault tolerance or PITR/ replication etc at the moment.
> I'm aware that the WAL and checkpoint are bind in many ways and are crucial
> for PG core features.
> Any guidance on how to do so would be appreciated :)

WAL insertion routines are in xloginsert.c. Did you try to play with those?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Disable WAL completely - Performance and Persistency research

2016-07-07 Thread Netanel Katzburg

Hi All,

As part of my masters at TAU, I'm currently conducting some research
regarding new persistent memory technology.
I'm using PG for this research and would like to better understand some of
the performance bottlenecks.
For this reason I'm trying to disable the WAL completely, using some hacks
on the source code and compiling my own version.

So what I'm actually looking for, is some guidance about a simple way to:

1. Disable the WAL by not writing anything to the xlog directory. I don't
care about recovery/fault tolerance or PITR/ replication etc at the moment.
I'm aware that the WAL and checkpoint are bind in many ways and are crucial
for PG core features.
I tried changing the status of all tables to "unlogged" tables by changing
RelationNeedsWAL MACRO, as well as "needs_wal" parameter at storage.c.
But, got no performance benefit, so I guess this was the wrong place to
change.

2. Cancel the locking around WAL files  - I don't care about corrupted
files at the moment, I just want to see what is the maximum performance
benefit that I can get without lock contention.

Any guidance on how to do so would be appreciated :)

Kind regards,
Netanel

Re: [HACKERS] EXPLAIN ANALYZE for parallel query doesn't report the SortMethod information.

2016-07-07 Thread Amit Kapila

On Thu, Jul 7, 2016 at 1:23 PM, Fujii Masao  wrote:
> Hi,
>
> I found $SUBJECT while trying to test parallel queries. Is this a bug?
>
>
> In not parallel mode, EXPLAIN ANALYZE reports the information about
> Sort Method as follows.
>
> =# EXPLAIN ANALYZE SELECT * FROM pgbench_accounts ORDER BY bid;
> QUERY PLAN
> ---
>  Sort  (cost=180739.34..183239.34 rows=100 width=97) (actual
> time=1501.342..1836.057 rows=100 loops=1)
>Sort Key: bid
>Sort Method: external sort  Disk: 104600kB
>->  Seq Scan on pgbench_accounts  (cost=0.00..26394.00 rows=100
> width=97) (actual time=0.013..179.315 rows=100 loops=1)
>
>
> However, in parallel mode, it's not reported, as follows.
>
> =# SET force_parallel_mode TO on;
> =# EXPLAIN ANALYZE SELECT * FROM pgbench_accounts ORDER BY bid;
>QUERY
> PLAN
> -
>  Gather  (cost=181739.34..284239.34 rows=100 width=97) (actual
> time=1507.138..2394.028 rows=100 loops=1)
>Workers Planned: 1
>Workers Launched: 1
>Single Copy: true
>->  Sort  (cost=180739.34..183239.34 rows=100 width=97) (actual
> time=1503.112..1901.117 rows=100 loops=1)
>  Sort Key: bid
>  ->  Seq Scan on pgbench_accounts  (cost=0.00..26394.00
> rows=100 width=97) (actual time=0.021..181.079 rows=100
> loops=1)
>

I think this can never happen for force_parallel_mode TO off, because
we don't generate a gather on top of sort node.  The reason why we are
able to push Sort below gather, because it is marked as parallel_safe
(create_sort_path).  I think we should not mark it as parallel_safe.
Will investigate some more and send a patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlogfile_name_offset() et al and recovery

2016-07-07 Thread Simon Riggs

On 7 July 2016 at 08:26, Michael Paquier  wrote:

> On Thu, Jul 7, 2016 at 3:59 PM, Amit Langote
>  wrote:
> > While reading the thread "BUG #14230: Wrong timeline returned by
> > pg_stop_backup on a standby", I came to know that ThisTimelineId is
> > invalid on standby.  And because pg_xlogfile_name_offset() uses the same
> > to compute its result, it makes sense to prevent it from being used on a
> > standby.
>
> To be honest, I have always found that this restriction was hard to
> justify on a function that basically performs a static calculation.

I know its annoying behaviour, but this is not an immutable function i.e.
not a static calculation.
The timeline is not so much invalid as variable over time.

The behaviour of that function is defined in backbranches, so I suggest we
should not alter that now.

What we can do is have another function that makes it clearer that the
answer is variable.
   pg_xlogfile_name_offset(offset, timelineid)
always succeeds on standby but needs to be told what timelineid to use

then we have another function
  pg_last_xact_replay_timeline() that allows you to find out the current
timeline

The actual problem here is somewhat more convoluted, because we would like
to know the timeline of the basebackup at start and stop. That information
is not easily available from the backup label data returned by
pg_stop_backup().
We would like to do something like this...

select pg_xlogfile_name_offset(l.lsn, l.tli) from pg_stop_backup(false) l;

So I suggest we add another column to the output of pg_stop_backup() to
return the tli value directly.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] pg_xlogfile_name_offset() et al and recovery

2016-07-07 Thread Kyotaro HORIGUCHI

Hello,

At Thu, 07 Jul 2016 17:26:15 +0900, Kyotaro HORIGUCHI 
 wrote in 
<20160707.172615.232806225.horiguchi.kyot...@lab.ntt.co.jp>
> > As the result, this relaxation adds more good than bad and I
> > agree to this.
> > 
> > Addition to that, I'd like to propose to add a description (or a
> > notice) about this restriction in documentation that the timeline
> > part of the file name may be wrong. Will post soon.

The following sentense will be added by this patch. Does it make sense?

| Note that these two functions use the timeline ID at the time
| to construct the file name. Therefore they give wrong names for
| transaction log locations on the other timelines. This can
| happen during runtime on replication standby when a promotion
| occurs on itself or upstreams.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index baef80d..451f0ec 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17796,7 +17796,11 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
 these functions return the name of the preceding transaction log file.
 This is usually the desired behavior for managing transaction log archiving
 behavior, since the preceding file is the last one that currently
-needs to be archived.
+needs to be archived. Note that these two functions use the timeline ID at
+the time to construct the file name. Therefore they give wrong names for
+transaction log locations on the other timelines. This can happen during
+runtime on replication standby when a promotion occurs on itself or
+upstreams.

 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Set sgml-basic-offset to 1 in .dir-locals.el

2016-07-07 Thread Dagfinn Ilmari Mannsåker

Peter Eisentraut  writes:

> On 7/6/16 4:52 AM, Dagfinn Ilmari Mannsåker wrote:
>> This keeps the indentation consistent when editing the documentation
>> using Emacs.
>
> Unfortunately, sgml-basic-offset is not a "safe" variable, so if we put
> it into .dir-locals.el, then users will always be bothered with a
> warning about that.

Ah, I see I've added it to safe-local-variables in my .emacs.  I must
have done that a while back (as evidenced by the date in the patch) and
forgotten about it.  Oh well, never mind.

-- 
- Twitter seems more influential [than blogs] in the 'gets reported in
  the mainstream press' sense at least.   - Matt McLeod
- That'd be because the content of a tweet is easier to condense down
  to a mainstream media article.  - Calle Dybedahl


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlogfile_name_offset() et al and recovery

2016-07-07 Thread Kyotaro HORIGUCHI

I wrote a bit wrong thing.

At Thu, 07 Jul 2016 17:20:52 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160707.172052.213690356.horiguchi.kyot...@lab.ntt.co.jp>
> > > To be honest, I have always found that this restriction was hard to
> > > justify on a function that basically performs a static calculation. So
> > > what about removing this error and use GetXLogReplayRecPtr() to fetch
> > > the timeline ID? Per se the patch attached:
> > > =# select * from pg_xlogfile_name_offset(pg_last_xlog_replay_location());
> > > file_name | file_offset
> > > --+-
> > >  00010003 |2192
> > > (1 row)
> > > 
> > > The same applies of course to pg_xlogfile_name():
> > > =# select pg_xlogfile_name(pg_last_xlog_replay_location());
> > >  pg_xlogfile_name
> > > --
> > >  00010003
> > > (1 row)
> > 
> > +1 to enabling these functions' usage on standby and the patch.
> 
> Strongly agreed. At first I thought that using replay-loc TLI is
> bogus but ThisTimeLineID for non-recovery time is also bogus at
> almost the same degree.
> 
> > =# select * from pg_xlogfile_name_offset('0/100'::pg_lsn);
> > file_name | file_offset 
> > --+-
> >  0003 | 256
> 
> The rest of the "almost" is master's timeline evolution. A master

It makes non sense. should be the follwoing,

| The rest of the "almost" is the upstreams' timeline evolution
| and promotion of itself. A master


> won't get timeline evolution during runtime but a standby can do.
> 
> As the result, this relaxation adds more good than bad and I
> agree to this.
> 
> Addition to that, I'd like to propose to add a description (or a
> notice) about this restriction in documentation that the timeline
> part of the file name may be wrong. Will post soon.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlogfile_name_offset() et al and recovery

2016-07-07 Thread Kyotaro HORIGUCHI

At Thu, 7 Jul 2016 16:48:57 +0900, Amit Langote  
wrote in <3649f1e4-ac74-841e-6ae5-cfe0022aa...@lab.ntt.co.jp>
> On 2016/07/07 16:26, Michael Paquier wrote:
> > On Thu, Jul 7, 2016 at 3:59 PM, Amit Langote
> >  wrote:
> >> While reading the thread "BUG #14230: Wrong timeline returned by
> >> pg_stop_backup on a standby", I came to know that ThisTimelineId is
> >> invalid on standby.  And because pg_xlogfile_name_offset() uses the same
> >> to compute its result, it makes sense to prevent it from being used on a
> >> standby.
> > 
> > To be honest, I have always found that this restriction was hard to
> > justify on a function that basically performs a static calculation. So
> > what about removing this error and use GetXLogReplayRecPtr() to fetch
> > the timeline ID? Per se the patch attached:
> > =# select * from pg_xlogfile_name_offset(pg_last_xlog_replay_location());
> > file_name | file_offset
> > --+-
> >  00010003 |2192
> > (1 row)
> > 
> > The same applies of course to pg_xlogfile_name():
> > =# select pg_xlogfile_name(pg_last_xlog_replay_location());
> >  pg_xlogfile_name
> > --
> >  00010003
> > (1 row)
> 
> +1 to enabling these functions' usage on standby and the patch.

Strongly agreed. At first I thought that using replay-loc TLI is
bogus but ThisTimeLineID for non-recovery time is also bogus at
almost the same degree.

> =# select * from pg_xlogfile_name_offset('0/100'::pg_lsn);
> file_name | file_offset 
> --+-
>  0003 | 256

The rest of the "almost" is master's timeline evolution. A master
won't get timeline evolution during runtime but a standby can do.

As the result, this relaxation adds more good than bad and I
agree to this.

Addition to that, I'd like to propose to add a description (or a
notice) about this restriction in documentation that the timeline
part of the file name may be wrong. Will post soon.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Header and comments describing routines in incorrect shape in visibilitymap.c

2016-07-07 Thread Kyotaro HORIGUCHI

At Thu, 7 Jul 2016 16:48:00 +0900, Michael Paquier  
wrote in 
> On Thu, Jul 7, 2016 at 4:41 PM, Kyotaro HORIGUCHI
>  wrote:
> > So, the 'pinned' is not necessary here or should be added also to
> > _clear.  I think the former is preferable since it is already
> > written in the comments for the functions and seems to be a bit
> > too detailed to be put here.
> >
> > - * visibilitymap_set- set a bit in a previously pinned 
> > page
> > + * visibilitymap_set- set bits in the visibility map
> 
> As far as I know, it is possible to set one or two bits,

That's right. 

> so I would
> think that using parenthesis makes more sense. Same when pinning a
> page, the caller may want to just set one of the two bits available.
> That's the choice I am trying to outline here.

I'm not strongly opposed to the paretheses. That's fine with me.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] EXPLAIN ANALYZE for parallel query doesn't report the SortMethod information.

2016-07-07 Thread Fujii Masao

Hi,

I found $SUBJECT while trying to test parallel queries. Is this a bug?


In not parallel mode, EXPLAIN ANALYZE reports the information about
Sort Method as follows.

=# EXPLAIN ANALYZE SELECT * FROM pgbench_accounts ORDER BY bid;
QUERY PLAN
---
 Sort  (cost=180739.34..183239.34 rows=100 width=97) (actual
time=1501.342..1836.057 rows=100 loops=1)
   Sort Key: bid
   Sort Method: external sort  Disk: 104600kB
   ->  Seq Scan on pgbench_accounts  (cost=0.00..26394.00 rows=100
width=97) (actual time=0.013..179.315 rows=100 loops=1)


However, in parallel mode, it's not reported, as follows.

=# SET force_parallel_mode TO on;
=# EXPLAIN ANALYZE SELECT * FROM pgbench_accounts ORDER BY bid;
   QUERY
PLAN
-
 Gather  (cost=181739.34..284239.34 rows=100 width=97) (actual
time=1507.138..2394.028 rows=100 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   Single Copy: true
   ->  Sort  (cost=180739.34..183239.34 rows=100 width=97) (actual
time=1503.112..1901.117 rows=100 loops=1)
 Sort Key: bid
 ->  Seq Scan on pgbench_accounts  (cost=0.00..26394.00
rows=100 width=97) (actual time=0.021..181.079 rows=100
loops=1)

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlogfile_name_offset() et al and recovery

2016-07-07 Thread Amit Langote

On 2016/07/07 16:26, Michael Paquier wrote:
> On Thu, Jul 7, 2016 at 3:59 PM, Amit Langote
>  wrote:
>> While reading the thread "BUG #14230: Wrong timeline returned by
>> pg_stop_backup on a standby", I came to know that ThisTimelineId is
>> invalid on standby.  And because pg_xlogfile_name_offset() uses the same
>> to compute its result, it makes sense to prevent it from being used on a
>> standby.
> 
> To be honest, I have always found that this restriction was hard to
> justify on a function that basically performs a static calculation. So
> what about removing this error and use GetXLogReplayRecPtr() to fetch
> the timeline ID? Per se the patch attached:
> =# select * from pg_xlogfile_name_offset(pg_last_xlog_replay_location());
> file_name | file_offset
> --+-
>  00010003 |2192
> (1 row)
> 
> The same applies of course to pg_xlogfile_name():
> =# select pg_xlogfile_name(pg_last_xlog_replay_location());
>  pg_xlogfile_name
> --
>  00010003
> (1 row)

+1 to enabling these functions' usage on standby and the patch.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Don't include MMAP DSM's transient files in basebackup

2016-07-07 Thread Magnus Hagander

On Thursday, July 7, 2016, Andres Freund  wrote:

> On 2016-07-06 16:46:53 +0300, Oskari Saarenmaa wrote:
> > The files are not useful when restoring a backup and would be
> automatically
> > deleted on startup anyway.  Also deleted an out-of-date comment in dsm.c.
>
> No arguments against the change. But I am wondering how you ran into
> this - the mmap dsm backend really should pretty much never be used...
>
>
Unfortunately I think that's still what happens by default if you use
pg_upgradecluster on debian :(

//Magnus



-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] Header and comments describing routines in incorrect shape in visibilitymap.c

2016-07-07 Thread Michael Paquier

On Thu, Jul 7, 2016 at 4:41 PM, Kyotaro HORIGUCHI
 wrote:
> So, the 'pinned' is not necessary here or should be added also to
> _clear.  I think the former is preferable since it is already
> written in the comments for the functions and seems to be a bit
> too detailed to be put here.
>
> - * visibilitymap_set- set a bit in a previously pinned 
> page
> + * visibilitymap_set- set bits in the visibility map

As far as I know, it is possible to set one or two bits, so I would
think that using parenthesis makes more sense. Same when pinning a
page, the caller may want to just set one of the two bits available.
That's the choice I am trying to outline here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in batch tuplesort memory CLUSTER case (9.6 only)

2016-07-07 Thread Peter Geoghegan

On Thu, Jul 7, 2016 at 12:34 AM, Noah Misch  wrote:
>> How do you feel about adding testing to tuplesort.c not limited to
>> hitting this bug (when Valgrind memcheck is used)?
>
> Sounds great, but again, not in the patch fixing this bug.

I'll work on a minimal CLUSTER testcase within the next day or two,
and post a revision. Separately, I'll propose a patch that further
expands tuplesort test coverage. This will add coverage for hash
tuplesorts, as well as coverage of external sorts that require
multiple passes.

>> Are you satisfied that I have adequately described steps to reproduce?
>
> I can confirm that (after 62 minutes) your test procedure reached SIGSEGV
> today and then completed successfully with your patch.

Thanks for going to the trouble of confirming that the test procedure
causes a segmentation fault, and that my patch appears to fix the
issue.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Header and comments describing routines in incorrect shape in visibilitymap.c

2016-07-07 Thread Kyotaro HORIGUCHI

Hello,

At Wed, 6 Jul 2016 11:28:19 +0900, Michael Paquier  
wrote in 
> I just bumped into a couple of things in visibilitymap.c:
> - visibilitymap_clear clears all bits of a visibility map, its header
> comment mentions the contrary
> - The header of visibilitymap.c mentions correctly "a bit" when
> referring to setting them, but when clearing, it should say that all
> bits are cleared.
> - visibilitymap_set can set multiple bits
> - visibilitymap_pin can be called to set up more than 1 bit.
> 
> This can be summed by the patch attached.

Thank you, I must have been careless on reviewing.


Although some of these are not directly related to 'a bit'
correction, I have some comments on the edited words.


- * visibilitymap_pin- pin a map page for setting a bit
+ * visibilitymap_pin- pin a map page for setting bit(s)

Is the parentheses needed? And then pinning is needed not only
for setting bits, but also for clearing. How about the following?

- * visibilitymap_pin- pin a map page for setting a bit
+ * visibilitymap_pin- pin a map page for changing bits


- * visibilitymap_set- set a bit in a previously pinned page
+ * visibilitymap_set- set bit(s) in a previously pinned 
page

So, the 'pinned' is not necessary here or should be added also to
_clear.  I think the former is preferable since it is already
written in the comments for the functions and seems to be a bit
too detailed to be put here.

- * visibilitymap_set- set a bit in a previously pinned page
+ * visibilitymap_set- set bits in the visibility map


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b472d31..7985c1d 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -11,10 +11,10 @@
  *	  src/backend/access/heap/visibilitymap.c
  *
  * INTERFACE ROUTINES
- *		visibilitymap_clear  - clear a bit in the visibility map
- *		visibilitymap_pin	 - pin a map page for setting a bit
+ *		visibilitymap_clear  - clear all bits in the visibility map
+ *		visibilitymap_pin	 - pin a map page for changing bits
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
- *		visibilitymap_set	 - set a bit in a previously pinned page
+ *		visibilitymap_set	 - set bits in the visibility map
  *		visibilitymap_get_status - get status of bits
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_truncate	- truncate the visibility map
@@ -34,7 +34,7 @@
  * might not be true.
  *
  * Clearing visibility map bits is not separately WAL-logged.  The callers
- * must make sure that whenever a bit is cleared, the bit is cleared on WAL
+ * must make sure that whenever bits are cleared, the bits are cleared on WAL
  * replay of the updating operation as well.
  *
  * When we *set* a visibility map during VACUUM, we must write WAL.  This may
@@ -78,8 +78,8 @@
  * When a bit is set, the LSN of the visibility map page is updated to make
  * sure that the visibility map update doesn't get written to disk before the
  * WAL record of the changes that made it possible to set the bit is flushed.
- * But when a bit is cleared, we don't have to do that because it's always
- * safe to clear a bit in the map from correctness point of view.
+ * But when bits are cleared, we don't have to do that because it's always
+ * safe to clear bits in the map from correctness point of view.
  *
  *-
  */
@@ -195,9 +195,9 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
 }
 
 /*
- *	visibilitymap_pin - pin a map page for setting a bit
+ *	visibilitymap_pin - pin a map page for setting bit(s)
  *
- * Setting a bit in the visibility map is a two-phase operation. First, call
+ * Setting bit(s) in the visibility map is a two-phase operation. First, call
  * visibilitymap_pin, to pin the visibility map page containing the bit for
  * the heap page. Because that can require I/O to read the map page, you
  * shouldn't hold a lock on the heap page while doing that. Then, call

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in batch tuplesort memory CLUSTER case (9.6 only)

2016-07-07 Thread Noah Misch

This PostgreSQL 9.6 open item is past due for a status update.

On Sat, Jul 02, 2016 at 08:47:20PM -0700, Peter Geoghegan wrote:
> On Sat, Jul 2, 2016 at 3:17 PM, Robert Haas  wrote:
> > In the interest of clarity, I was not intending to say that there
> > should be a regression test in the patch.  I was intending to say that
> > there should be a test case with the bug report.  I'm not opposed to
> > adding a regression test, and I like the idea of attempting to do so
> > while requiring only a relatively small amount of data by changing
> > maintenance_work_mem, but that wasn't the target at which I was
> > aiming.  Nevertheless, carry on.
> 
> How do you feel about adding testing to tuplesort.c not limited to
> hitting this bug (when Valgrind memcheck is used)?

Sounds great, but again, not in the patch fixing this bug.

> Are you satisfied that I have adequately described steps to reproduce?

I can confirm that (after 62 minutes) your test procedure reached SIGSEGV
today and then completed successfully with your patch.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_xlogfile_name_offset() et al and recovery

2016-07-07 Thread Michael Paquier

On Thu, Jul 7, 2016 at 3:59 PM, Amit Langote
 wrote:
> While reading the thread "BUG #14230: Wrong timeline returned by
> pg_stop_backup on a standby", I came to know that ThisTimelineId is
> invalid on standby.  And because pg_xlogfile_name_offset() uses the same
> to compute its result, it makes sense to prevent it from being used on a
> standby.

To be honest, I have always found that this restriction was hard to
justify on a function that basically performs a static calculation. So
what about removing this error and use GetXLogReplayRecPtr() to fetch
the timeline ID? Per se the patch attached:
=# select * from pg_xlogfile_name_offset(pg_last_xlog_replay_location());
file_name | file_offset
--+-
 00010003 |2192
(1 row)

The same applies of course to pg_xlogfile_name():
=# select pg_xlogfile_name(pg_last_xlog_replay_location());
 pg_xlogfile_name
--
 00010003
(1 row)
-- 
Michael
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index baef80d..46ca93e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17796,7 +17796,9 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
 these functions return the name of the preceding transaction log file.
 This is usually the desired behavior for managing transaction log archiving
 behavior, since the preceding file is the last one that currently
-needs to be archived.
+needs to be archived. The timeline number used to build the log file
+name refers to the current timeline on a master server, and the timeline
+currently being replayed on a server in recovery.

 

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 33383b4..91f8e03 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -464,12 +464,7 @@ pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
 	TupleDesc	resultTupleDesc;
 	HeapTuple	resultHeapTuple;
 	Datum		result;
-
-	if (RecoveryInProgress())
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("recovery is in progress"),
- errhint("pg_xlogfile_name_offset() cannot be executed during recovery.")));
+	TimeLineID	tli;
 
 	/*
 	 * Construct a tuple descriptor for the result row.  This must match this
@@ -483,11 +478,16 @@ pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
 
 	resultTupleDesc = BlessTupleDesc(resultTupleDesc);
 
+	if (RecoveryInProgress())
+		(void) GetXLogReplayRecPtr();
+	else
+		tli = ThisTimeLineID;
+
 	/*
 	 * xlogfilename
 	 */
 	XLByteToPrevSeg(locationpoint, xlogsegno);
-	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno);
+	XLogFileName(xlogfilename, tli, xlogsegno);
 
 	values[0] = CStringGetTextDatum(xlogfilename);
 	isnull[0] = false;
@@ -520,15 +520,15 @@ pg_xlogfile_name(PG_FUNCTION_ARGS)
 	XLogSegNo	xlogsegno;
 	XLogRecPtr	locationpoint = PG_GETARG_LSN(0);
 	char		xlogfilename[MAXFNAMELEN];
+	TimeLineID	tli;
 
 	if (RecoveryInProgress())
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("recovery is in progress"),
-		 errhint("pg_xlogfile_name() cannot be executed during recovery.")));
+		(void) GetXLogReplayRecPtr();
+	else
+		tli = ThisTimeLineID;
 
 	XLByteToPrevSeg(locationpoint, xlogsegno);
-	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno);
+	XLogFileName(xlogfilename, tli, xlogsegno);
 
 	PG_RETURN_TEXT_P(cstring_to_text(xlogfilename));
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Don't include MMAP DSM's transient files in basebackup

2016-07-07 Thread Andres Freund

Hi,

On 2016-07-06 16:46:53 +0300, Oskari Saarenmaa wrote:
> The files are not useful when restoring a backup and would be automatically
> deleted on startup anyway.  Also deleted an out-of-date comment in dsm.c.

No arguments against the change. But I am wondering how you ran into
this - the mmap dsm backend really should pretty much never be used...

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Don't include MMAP DSM's transient files in basebackup

2016-07-07 Thread Michael Paquier

On Wed, Jul 6, 2016 at 10:46 PM, Oskari Saarenmaa  wrote:
> The files are not useful when restoring a backup and would be automatically
> deleted on startup anyway.

The same could be said about pg_snapshots.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Masahiko Sawada

Than you for reviewing!

On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund  wrote:
> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
>> diff --git a/src/backend/access/heap/heapam.c 
>> b/src/backend/access/heap/heapam.c
>> index 57da57a..fd66527 100644
>> --- a/src/backend/access/heap/heapam.c
>> +++ b/src/backend/access/heap/heapam.c
>> @@ -3923,6 +3923,17 @@ l2:
>>
>>   if (need_toast || newtupsize > pagefree)
>>   {
>> + /*
>> +  * To prevent data corruption due to updating old tuple by
>> +  * other backends after released buffer
>
> That's not really the reason, is it? The prime problem is crash safety /
> replication. The row-lock we're faking (by setting xmax to our xid),
> prevents concurrent updates until this backend died.

Fixed.

>> , we need to emit that
>> +  * xmax of old tuple is set and clear visibility map bits if
>> +  * needed before releasing buffer. We can reuse xl_heap_lock
>> +  * for this purpose. It should be fine even if we crash midway
>> +  * from this section and the actual updating one later, since
>> +  * the xmax will appear to come from an aborted xid.
>> +  */
>> + START_CRIT_SECTION();
>> +
>>   /* Clear obsolete visibility flags ... */
>>   oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
>>   oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
>> @@ -3936,6 +3947,46 @@ l2:
>>   /* temporarily make it look not-updated */
>>   oldtup.t_data->t_ctid = oldtup.t_self;
>>   already_marked = true;
>> +
>> + /* Clear PD_ALL_VISIBLE flags */
>> + if (PageIsAllVisible(BufferGetPage(buffer)))
>> + {
>> + all_visible_cleared = true;
>> + PageClearAllVisible(BufferGetPage(buffer));
>> + visibilitymap_clear(relation, 
>> BufferGetBlockNumber(buffer),
>> + vmbuffer);
>> + }
>> +
>> + MarkBufferDirty(buffer);
>> +
>> + if (RelationNeedsWAL(relation))
>> + {
>> + xl_heap_lock xlrec;
>> + XLogRecPtr recptr;
>> +
>> + /*
>> +  * For logical decoding we need combocids to properly 
>> decode the
>> +  * catalog.
>> +  */
>> + if (RelationIsAccessibleInLogicalDecoding(relation))
>> + log_heap_new_cid(relation, );
>
> Hm, I don't see that being necessary here. Row locks aren't logically
> decoded, so there's no need to emit this here.

Fixed.

>
>> + /* Clear PD_ALL_VISIBLE flags */
>> + if (PageIsAllVisible(page))
>> + {
>> + Buffer  vmbuffer = InvalidBuffer;
>> + BlockNumber block = BufferGetBlockNumber(*buffer);
>> +
>> + all_visible_cleared = true;
>> + PageClearAllVisible(page);
>> + visibilitymap_pin(relation, block, );
>> + visibilitymap_clear(relation, block, vmbuffer);
>> + }
>> +
>
> That clears all-visible unnecessarily, we only need to clear all-frozen.
>

Fixed.

>
>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
>>   }
>>   HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>>   HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
>> +
>> + /* The visibility map need to be cleared */
>> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
>> + {
>> + RelFileNode rnode;
>> + Buffer  vmbuffer = InvalidBuffer;
>> + BlockNumber blkno;
>> + Relationreln;
>> +
>> + XLogRecGetBlockTag(record, 0, , NULL, );
>> + reln = CreateFakeRelcacheEntry(rnode);
>> +
>> + visibilitymap_pin(reln, blkno, );
>> + visibilitymap_clear(reln, blkno, vmbuffer);
>> + PageClearAllVisible(page);
>> + }
>> +
>
>
>>   PageSetLSN(page, lsn);
>>   MarkBufferDirty(buffer);
>>   }
>> diff --git a/src/include/access/heapam_xlog.h 
>> b/src/include/access/heapam_xlog.h
>> index a822d0b..41b3c54 100644
>> --- a/src/include/access/heapam_xlog.h
>> +++ b/src/include/access/heapam_xlog.h
>> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
>>  #define XLHL_XMAX_EXCL_LOCK  0x04
>>  #define XLHL_XMAX_KEYSHR_LOCK0x08
>>  #define XLHL_KEYS_UPDATED0x10
>> +#define XLHL_ALL_VISIBLE_CLEARED 0x20
>
> Hm. We can't easily do that in the back-patched version; because a
> standby won't know to check for the flag . That's kinda ok, since we
> don't yet need to clear

Re: [HACKERS] pg_xlogfile_name_offset() et al and recovery

2016-07-07 Thread Amit Langote

On 2016/05/19 17:34, Amit Langote wrote:
> Currently in HEAD and 9.6, one can issue a non-exclusive backup on
> standby, so this is OK:
> 
> select pg_is_in_recovery();
>  pg_is_in_recovery
> ---
>  t
> (1 row)
> 
> select pg_start_backup('sby-bkp-test', 'f', 'f');
>  pg_start_backup
> -
>  0/5000220
> (1 row)
> 
> However the following happens:
> 
> select pg_xlogfile_name_offset(pg_start_backup('sby-bkp-test', 'f', 'f'));
> ERROR:  recovery is in progress
> HINT:  pg_xlogfile_name_offset() cannot be executed during recovery.
> 
> Should this restriction be relaxed or am I missing something?

Answering my own question:

While reading the thread "BUG #14230: Wrong timeline returned by
pg_stop_backup on a standby", I came to know that ThisTimelineId is
invalid on standby.  And because pg_xlogfile_name_offset() uses the same
to compute its result, it makes sense to prevent it from being used on a
standby.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

2016-07-07 Thread Michael Paquier

On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
 wrote:
> After further analysis, the issue is that we retrieve the starttli from
> the ControlFile structure, but it was using ThisTimeLineID when writing
> the backup label.
>
> I've attached a very simple patch that fixes it.

ThisTimeLineID is always set at 0 on purpose on a standby, so we
cannot rely on it (well it is set temporarily when recycling old
segments). At recovery when parsing the backup_label file there is no
actual use of the start segment name, so that's only a cosmetic
change. But surely it would be better to get that fixed, because
that's useful for debugging.

While looking at your patch, I thought that it would have been
tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
recovery, but what we really want to know here is the timeline of the
last REDO pointer, which is starttli, and that's more consistent with
the fact that we use startpoint when writing the backup_label file. In
short, +1 for this fix.

I am adding that in the list of open items, adding Magnus in CC whose
commit for non-exclusive backups is at the origin of this defect.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

92 matches

Mail list logo