date:20150420


On 4/20/15 4:13 PM, Bruce Momjian wrote:

On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote:

I also think there's better ways we could handle *all* our cleanup
work. Tuples have a definite lifespan, and there's potentially a lot
of efficiency to be gained if we could track tuples through their
stages of life... but I don't see any easy ways to do that.


See the TODO list:

https://wiki.postgresql.org/wiki/Todo
o  Avoid the requirement of freezing pages that are infrequently
   modified


Right, but do you have a proposal for how that would actually happen?

Perhaps I'm mis-understanding you, but it sounded like you were
opposed to this patch because it doesn't do anything to avoid the
need to freeze. My point is that no one has any good ideas on how to
avoid freezing, and I think it's a safe bet that any ideas people do
come up with there will be a lot more invasive than a FrozenMap is.


Didn't you think any of the TODO threads had workable solutions?  And


I didn't realize there were threads there.

The first three are discussion around the idea of eliminating the need 
to freeze based on a page already being all visible. No patches.


http://www.postgresql.org/message-id/ca+tgmoaemnolzmvbb8gvy69na8zw9bwpiz9+tlz-lnabozi...@mail.gmail.com 
has a WIP patch that goes the route of using a tuple flag to indicate 
frozen, but also raises a lot of concerns about visibility, because it 
means we'd stop using FrozenXID. That impacts a large amount of code. 
There were some followup patches as well as a bunch of discussion of how 
to make it visible that a tuple was frozen or not. That thread died in 
January 2014.


The fifth thread is XID to LSN mapping. AFAICT this has a significant 
drawback in that it breaks page compatibility, meaning no pg_upgrade. It 
ends 5/14/2014 with this comment:


Well, Heikki was saying on another thread that he had kind of gotten
cold feet about this, so I gather he's not planning to pursue it.  Not
sure if I understood that correctly.  If so, I guess it depends on
whether someone else can pick it up, but we might first want to
establish why he got cold feet and how worrying those problems seem to
other people. - 
http://www.postgresql.org/message-id/CA+TgmoYoN8LzSuaffUaEkyV8Mhv1wi=zlbxq3vofezno1db...@mail.gmail.com


So work was done on two alternative approaches, and then abandoned. Both 
of those approaches might still be valid, but seem to need more work. 
They're also higher risk because they're changing MVCC at a very 
fundamental level.


As I mentioned, I think there's a lot better stuff we could be doing 
about tuple lifetime, but there's no easy fixes to be had. This patch 
solves a problem today, using a concept that's now well proven 
(visibility map). If we had something more sophisticated being developed 
then I'd be inclined not to pursue this patch, but that's not the case.


Perhaps others can elaborate on where those two patches are at...


don't expect adding an additional file per relation will be zero cost
--- added over the lifetime of 200M transactions, I question if this
approach would be a win.


Can you elaborate on this? I don't see how the number of transactions 
would come into play, but the overhead here is not large; the FrozenMap 
would be the same size as the VM map, which is 1/64,000th as large as 
the heap. So a 64G table means a 1M FM. That doesn't seem very expensive.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.


On 4/20/15 2:45 AM, Sawada Masahiko wrote:

Current patch adds new source file src/backend/access/heap/frozenmap.c
which is quite similar to visibilitymap.c. They have similar code but
are separated for now. I do refactoring these source code like adding
bitmap.c, if needed.


My feeling is we'd definitely want this refactored; it looks to be a 
whole lot of duplicated code. But before working on that we should get 
consensus that a FrozenMap is a good idea.


Are there any meaningful differences between the two, besides the 
obvious name changes?


I think there's also a bunch of XLOG stuff that could be refactored too...


Also, when skipping vacuum by visibility map, we can skip at least
SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
frozen map.


That's probably something else that can be factored out, since it's 
basically the same logic. I suspect we just need to  some of the 
checks so we're looking at both FM and VM at the same time.


Other comments...

It would be nice if we didn't need another page bit for FM; do you see 
any reasonable way that could happen?


+* If we didn't pin the visibility(and frozen) map page and the page has
+* become all visible(and frozen) while we were busy locking the buffer,
+* or during some subsequent window during which we had it unlocked,
+* we'll have to unlock and re-lock, to avoid holding the buffer lock
+* across an I/O.  That's a bit unfortunate, especially since we'll now
+* have to recheck whether the tuple has been locked or updated under 
us,
+* but hopefully it won't happen very often.
 */

s/(and frozen)/ or frozen/


+ * Reply XLOG_HEAP3_FROZENMAP record.
s/Reply/Replay/


+   /*
+* XLogReplayBufferExtended locked the buffer. But frozenmap_set
+* will handle locking itself.
+*/
+   LockBuffer(fmbuffer, BUFFER_LOCK_UNLOCK);

Doesn't this create a race condition?


Are you sure the bit in finish_heap_swap() is safe? If so, we should add 
the same the same for the visibility map too (it certainly better be all 
visible if it's frozen...)




+   /*
+* Current block is all-visible.
+* If frozen map represents that it's all frozen and 
this
+* function is called for freezing tuples, we can skip 
to
+* vacuum block.
+*/

I would state this as Even if scan_all is true, we can skip blocks that 
are marked as frozen.


+   if (frozenmap_test(onerel, blkno, fmbuffer)  
scan_all)

I suspect it's faster to reverse those tests (scan_all  
frozenmap_test())... but why do we even need to look at scan_all? AFAICT 
if a block as frozen we can skip it unconditionally.



+   /*
+* If the un-frozen tuple is remaining in current page 
and
+* current page is marked as ALL_FROZEN, we should 
clear it.
+*/

That needs to NEVER happen. If it does then we're going to consider 
tuples as visible/frozen that shouldn't be. We should probably throw an 
error here, because it means the heap is now corrupted. At the minimum 
it needs to be an assert().




Note that I haven't reviewed all the logic in detail at this point. If 
this ends up being refactored it'll be a lot easier to spot logic 
problems, so I'll hold off on that for now.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

Bruce Momjian wrote:
 On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:
  Bruce Momjian wrote:
  
  This seems simple to implement: keep two counters, where the second one
  is pages we skipped cleanup in.  Once that counter hits SOME_MAX_VALUE,
  reset the first counter so that further 5 pages will get HOT pruned.  5%
  seems a bit high though.  (In Simon's design, SOME_MAX_VALUE is
  essentially +infinity.)
 
 This would tend to dirty non-sequential heap pages --- it seems best to
 just clean as many as we are supposed to, then skip the rest, so we can
 write sequential dirty pages to storage.

Keep in mind there's a disconnect between dirtying a page and writing it
to storage.  A page could remain dirty for a long time in the buffer
cache.  This writing of sequential pages would occur at checkpoint time
only, which seems the wrong thing to optimize.  If some other process
needs to evict pages to make room to read some other page in, surely
it's going to try one page at a time, not write many sequential dirty
pages.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?

On Mon, Apr 20, 2015 at 4:11 PM, Bruce Momjian br...@momjian.us wrote:
 Slightly improved patch applied.

Is it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote:
 I also think there's better ways we could handle *all* our cleanup
 work. Tuples have a definite lifespan, and there's potentially a lot
 of efficiency to be gained if we could track tuples through their
 stages of life... but I don't see any easy ways to do that.
 
 See the TODO list:
 
  https://wiki.postgresql.org/wiki/Todo
  o  Avoid the requirement of freezing pages that are infrequently
 modified
 
 Right, but do you have a proposal for how that would actually happen?
 
 Perhaps I'm mis-understanding you, but it sounded like you were
 opposed to this patch because it doesn't do anything to avoid the
 need to freeze. My point is that no one has any good ideas on how to
 avoid freezing, and I think it's a safe bet that any ideas people do
 come up with there will be a lot more invasive than a FrozenMap is.

Didn't you think any of the TODO threads had workable solutions?  And
don't expect adding an additional file per relation will be zero cost
--- added over the lifetime of 200M transactions, I question if this
approach would be a win.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:
 Bruce Momjian wrote:
 
  I think the limit has to be in terms of a percentage of the table size. 
  For example, if we do one SELECT on a table with all non-dirty pages, it
  would be good to know that 5% of the pages were pruned --- that tells me
  that another 19 SELECTs will totally prune the table, assuming no future
  writes.
 
 This seems simple to implement: keep two counters, where the second one
 is pages we skipped cleanup in.  Once that counter hits SOME_MAX_VALUE,
 reset the first counter so that further 5 pages will get HOT pruned.  5%
 seems a bit high though.  (In Simon's design, SOME_MAX_VALUE is
 essentially +infinity.)

This would tend to dirty non-sequential heap pages --- it seems best to
just clean as many as we are supposed to, then skip the rest, so we can
write sequential dirty pages to storage.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 3:28 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while
 the user waits, which is fundamentally VACUUM's duty to do in the
 background?  If there are a handful of very hot pages, then it makes sense
 not to wait for vacuum to get to them.  And that is what a block-count limit
 does.

I think that's a fundamental mischaracterization of the problem.  As
soon as you define this as vacuum's problem, then of course it makes
no sense to prune in the foreground, ever.  But if you define the
problem as get the best overall system performance, then it clearly
DOES sometimes make sense to prune in the foreground, as benchmark
results upthread demonstrate.

The fact is that on a workload like pgbench - and it doesn't have to
be exactly pgbench, just any kind of workload where there are lots of
changes to the table - vacuum can at any given time be pruning at most
one page of the table.  That is because only one vacuum process can be
running in a given table at one time, and it can't be doing two things
at once.  But there can be many processes doing inserts, updates, or
deletes on that table, as many as whatever you have max_connections
set to.  There can easily be dozens even on a well-configured system;
on a poorly configured system, there could be hundreds.  It seems
obvious that if you can have dozens or hundreds of processes creating
garbage and at most one process cleaning it up, there will be cases
where you get further and further behind.

Now, it might well be that the right solution to that problem is to
allow multiple vacuum processes in the same database, or add
background workers to help with opportunistic HOT-pruning of pages so
it doesn't get done in the foreground.  Fine.  But as of today, on a
heavily-modified table, the ONLY way that we can possibly remove junk
from the table as fast as we're creating junk is if the backends
touching the table do some of the work.  Now, Simon is making the
argument that it should be good enough to have people *modifying* the
table help with the cleanup rather than imposing that load on the
people who are only *reading* it, and that's not a dumb argument, but
there are still cases where that strategy loses - specifically, where
the table churn has stopped or paused, by autovacuum hasn't run yet.
If you're going to do 1 sequential scan of the table and then go home
for the day, HOT-pruning is dumb even in that case.  If you're going
to do 1000 sequential scans of that table in a row, HOT-pruning may
very well be smart.  There's no guarantee that the table has met the
autovacuum threshold, but HOT-pruning it could well be a win anyway.
Or it might be a loss.  You can make any policy here look smart or
dumb by picking a particular workload, and you don't even have to
invent crazy things that will never happen in real life to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Peter Geoghegan

On Tue, Apr 14, 2015 at 7:02 PM, Robert Haas robertmh...@gmail.com wrote:
On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan p...@heroku.com wrote:
Why is that good?

We did discuss this before. I've recapped some of what I believe to
be the most salient points below.

I think that people were all too quick to dismiss the idea of a wall
time interval playing some role here (at least as a defense against
correlated references, as a correlated reference period). I suppose
that that's because it doesn't fit with an intuition that says that
that kind of interval ought to be derived algebraically - magic delay
settings are considered suspect.

Yep, Tom gave that reason here:

http://www.postgresql.org/message-id/11258.1397673...@sss.pgh.pa.us

But there was also this point from Andres - gettimeofday is not free:

http://www.postgresql.org/message-id/20140416075307.gc3...@awork2.anarazel.de

And this point from me - this can degrade to random eviction under
high pressure:

http://www.postgresql.org/message-id/ca+tgmoayuxr55zueapp6d2xbyicjwacc9myyn5at4tindsj...@mail.gmail.com

You'll notice that my proposal avoids all three of those objections.

All I'm saying is that you shouldn't dismiss the idea without trying
it out properly. The LRU-K paper, from the early 1990s, recommends
this, and gettimeofday() calls were a lot more expensive back then.
I'm sure that there is a way to overcome these issues if it turns out
to be worth it (by amortizing gettimeofday() calls by driving it from
an auxiliary process like the bgwriter, for example). In fact, I'm
almost certain there is, because at least one other major database
system uses just such a reference period.

Back when ARC (or was it 2Q?) was committed before being reverted
shortly thereafter, there was a similar idea actually implemented, but
with XIDs preventing correlated references (which the LRU-K paper also
hints at). I think that an actual delay is more robust than that,
though. Even though this correlated reference delay is all I
implemented with my prototype,it's just one piece of the puzzle.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-20 Thread Peter Geoghegan

On Sun, Apr 19, 2015 at 9:37 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch, V3.4, implements what I believe you and Heikki have in
 mind here.

Any plans to look at this, Svenne? You are signed up as a reviewer on
the commitfest app. A review of the documentation, and interactions
with other features like inheritance, updatable views and postgres_fdw
would be useful at this point. Obviously I've gone to a lot of effort
to document how things fit together at a high level on the UPSERT wiki
page, but these aspects have not been commented on all that much.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.


On 4/20/15 2:09 PM, Bruce Momjian wrote:

On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote:

On 4/20/15 1:48 PM, Bruce Momjian wrote:

On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:

Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.

Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.


So, this patch avoids reading the all-frozen pages if it has not been
modified since the last VACUUM FREEZE?  Since it is already frozen, the
running VACUUM FREEZE will not modify the page or generate WAL, so is it
really worth maintaining a new per-page bitmap just to avoid the
sequential scan of tables every 200MB transactions?  I would like to see
us reduce the need for VACUUM FREEZE, rather than go in this direction.


How would you propose we do that?

I also think there's better ways we could handle *all* our cleanup
work. Tuples have a definite lifespan, and there's potentially a lot
of efficiency to be gained if we could track tuples through their
stages of life... but I don't see any easy ways to do that.


See the TODO list:

https://wiki.postgresql.org/wiki/Todo
o  Avoid the requirement of freezing pages that are infrequently
   modified


Right, but do you have a proposal for how that would actually happen?

Perhaps I'm mis-understanding you, but it sounded like you were opposed 
to this patch because it doesn't do anything to avoid the need to 
freeze. My point is that no one has any good ideas on how to avoid 
freezing, and I think it's a safe bet that any ideas people do come up 
with there will be a lot more invasive than a FrozenMap is.


While not perfect, a FrozenMap is something we can do today, without a 
lot of effort, and it will provide definite value for any tables that 
have a good amount of frozen pages. Without performance testing, we 
don't know what good actually looks like, but we can't test without a 
patch (which we now have). Assuming performance numbers look good I 
think it would be folly to reject this patch in the hopes that 
eventually we'll have some way to avoid the need to freeze.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 09:56:20PM +0100, Simon Riggs wrote:
 On 20 April 2015 at 20:28, Jeff Janes jeff.ja...@gmail.com wrote:
  
 
 But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job,
 while the user waits, which is fundamentally VACUUM's duty to do in the
 background? 
 
 
 Agreed. I don't see a % as giving us anything at all.
 
 The idea is that we want to turn an O(N) problem for one query into an O(1)
 task.
  
 
 The use case I see for this is when there is a mixed workload.  There is
 one select which reads the entire table, and hundreds of thousands of
 selects/updates/insert that don't, and of course vacuum comes along every
 now and then and does it thing.  Why should the one massive SELECT have
 horrible performance just because it was run right before autovacuum would
 have kicked in instead of right after if finished?
 
 
 +1

You can +1 all you want, but if you ignore the specific workloads I
mentioned, you are not going to get much traction.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Simon Riggs

On 20 April 2015 at 20:28, Jeff Janes jeff.ja...@gmail.com wrote:


 But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job,
 while the user waits, which is fundamentally VACUUM's duty to do in the
 background?


Agreed. I don't see a % as giving us anything at all.

The idea is that we want to turn an O(N) problem for one query into an O(1)
task.


 The use case I see for this is when there is a mixed workload.  There is
 one select which reads the entire table, and hundreds of thousands of
 selects/updates/insert that don't, and of course vacuum comes along every
 now and then and does it thing.  Why should the one massive SELECT have
 horrible performance just because it was run right before autovacuum would
 have kicked in instead of right after if finished?


+1

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?

On Mon, Apr 20, 2015 at 05:04:14PM -0400, Robert Haas wrote:
 On Mon, Apr 20, 2015 at 4:11 PM, Bruce Momjian br...@momjian.us wrote:
  Slightly improved patch applied.
 
 Is it?

The patch has a slightly modified 'if' statement to check a constant
before calling a function, and use elseif:

 + if (!interpretOidsOption(stmt-options, true)  cxt.hasoids)
---
 + if (cxt.hasoids  !interpretOidsOption(stmt-options, true))
47c57
 + if (interpretOidsOption(stmt-options, true)  !cxt.hasoids)
---
 + else if (!cxt.hasoids  interpretOidsOption(stmt-options, 
true))

I realize the change is subtle.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallel mode and parallel contexts

2015-04-20 Thread Peter Geoghegan

On Thu, Mar 19, 2015 at 11:13 AM, Robert Haas robertmh...@gmail.com wrote:
 Here is yet another version of this patch.  In addition to the fixes
 mentioned above, this version includes some minor rebasing around
 recent commits, and also better handling of the case where we discover
 that we cannot launch workers after all.  This can happen because (1)
 dynamic_shared_memory_type=none, (2) the maximum number of DSM
 segments supported by the system configuration are already in use, or
 (3) the user creates a parallel context with nworkers=0.  In any of
 those cases, the system will now create a backend-private memory
 segment instead of a dynamic shared memory segment, and will skip
 steps that don't need to be done in that case.  This is obviously an
 undesirable scenario.  If we choose a parallel sequential scan, we
 want it to launch workers and really run in parallel.  Hopefully, in
 case (1) or case (3), we will avoid choosing a parallel plan in the
 first place, but case (2) is pretty hard to avoid completely, as we
 have no idea what other processes may or may not be doing with dynamic
 shared memory segments ... and, in any case, degrading to non-parallel
 execution beats failing outright.

I see that you're using git format-patch to generate this. But the
patch is only patch 1/4. Is that intentional? Where are the other
pieces?

I think that the parallel seqscan patch, and the assessing parallel
safety patch are intended to fit together with this patch, but I can't
find a place where there is a high level overview explaining just how
they fit together (I notice Amit's patch has an #include
access/parallel.h, which is here, but that wasn't trivial to figure
out). I haven't been paying too much attention to this patch series.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Josh Berkus

On 04/20/2015 02:13 PM, Bruce Momjian wrote:
 Didn't you think any of the TODO threads had workable solutions?  And
 don't expect adding an additional file per relation will be zero cost
 --- added over the lifetime of 200M transactions, I question if this
 approach would be a win.

Well, the only real way to test that is a prototype, no?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-04-20 Thread Amit Langote

On 2015-04-21 AM 03:29, Robert Haas wrote:
 On Wed, Apr 8, 2015 at 3:38 AM, Amit Langote wrote:
 On 08-04-2015 PM 12:46, Amit Kapila wrote:
 Going forward, I think we can improve the same if we decide not to shutdown
 parallel workers till postmaster shutdown once they are started and
 then just allocate them during executor-start phase.

 I wonder if it makes sense to invent the notion of a global pool of workers
 with configurable number of workers that are created at postmaster start and
 destroyed at shutdown and requested for use when a query uses parallelizable
 nodes.
 
 Short answer: Yes, but not for the first version of this feature.
 
 Longer answer: We can't actually very reasonably have a global pool
 of workers so long as we retain the restriction that a backend
 connected to one database cannot subsequently disconnect from it and
 connect to some other database instead.  However, it's certainly a
 good idea to reuse the same workers for subsequent operations on the
 same database, especially if they are also by the same user.  At the
 very minimum, it would be good to reuse the same workers for
 subsequent operations within the same query, instead of destroying the
 old ones and creating new ones.  Notwithstanding the obvious value of
 all of these ideas, I don't think we should do any of them for the
 first version of this feature.  This is too big a thing to get perfect
 on the first try.
 

Agreed.

Perhaps, Amit has worked (is working) on reuse the same workers for
subsequent operations within the same query

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2015-04-20 Thread Kyotaro HORIGUCHI

Hi,

At Mon, 20 Apr 2015 16:40:52 +0900, Etsuro Fujita fujita.ets...@lab.ntt.co.jp 
wrote in 5534ad84.3020...@lab.ntt.co.jp
 On 2015/04/17 13:16, Amit Langote wrote:
  On 17-04-2015 PM 12:35, Etsuro Fujita wrote:
  (2) that might cause the problem of associating subplans' update
  information with subplans' scan information, pointed out by Tom [1].
 
  Having realized how it really works now, my +1 to Foreign Modifying Scan 
  for
  cases of pushed down update as suggested by Albe Laurenz. I guess it would 
  be
  signaled by the proposed ForeignScan.CmdType being CMD_UPDATE / CMP_UPDATE
  (/CMD_INSERT).
 
 Thanks for the opinion!  I think that that is an idea.  However, I'd
 like to propose to rename Foreign Update (Foreign Delete) of
 ModifyTable simply to Update (Delete) not only because (1) that
 solves the duplication problem but also because (2) ISTM that is
 consistent with the non-inherited updates in both of the
 non-pushed-down-update case and the pushed-down-update case.  Here are
 examples for (2).

Update node without Foreign that runs Remote SQL looks to me
somewhat unusual..

It seems to me that the problem is Foreign Updates for
ModifyTable that does nothing eventually. Even though I don't
understand this fully, especially what Foreign Update for
ModifyTable does when Foreign Update in place of Foreign Scan
finished substantial work, I think the ForeignUpdate nodes should
be removed during planning if it is really ineffective, or such
Foreign Updates shoud be renamed or provided with some
explaination in explain output if it does anything or unremovable
from some reason.

If removed it looks like,

| =# explain verbose update p set b = b + 1;
|   QUERY PLAN  
| --
|  Update on public.p  (cost=0.00..360.08 rows=4311 width=14)
|Update on public.p
|-  Seq Scan on public.p  (cost=0.00..0.00 rows=1 width=14)
|  Output: p.a, (p.b + 1), p.ctid
|-  Foreign Update on public.ft1  (cost=100.00..180.04 rows=2155 width=14)
|  Remote SQL: UPDATE public.t1 SET b = (b + 1)
|-  Foreign Update on public.ft2  (cost=100.00..180.04 rows=2155 width=14)
|  Remote SQL: UPDATE public.t2 SET b = (b + 1)

regards,

 * Inherited and non-inherited updates for the non-pushed-down case:
 
 postgres=# explain verbose update parent set c1 = trunc(random() * 9 +
 1)::int;
  QUERY PLAN
 -
  Update on public.parent  (cost=0.00..452.06 rows=5461 width=6)
Update on public.parent
Update on public.ft1
  Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1
Update on public.ft2
  Remote SQL: UPDATE public.t2 SET c1 = $2 WHERE ctid = $1
-  Seq Scan on public.parent  (cost=0.00..0.01 rows=1 width=6)
  Output: (trunc(((random() * '9'::double precision) +
 '1'::double precision)))::integer, parent.ctid
-  Foreign Scan on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
  Output: (trunc(((random() * '9'::double precision) +
 '1'::double precision)))::integer, ft1.ctid
  Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE
-  Foreign Scan on public.ft2  (cost=100.00..226.03 rows=2730 width=6)
  Output: (trunc(((random() * '9'::double precision) +
 '1'::double precision)))::integer, ft2.ctid
  Remote SQL: SELECT ctid FROM public.t2 FOR UPDATE
 (14 rows)
 
 postgres=# explain verbose update ft1 set c1 = trunc(random() * 9 + 1)::int;
   QUERY PLAN
 --
  Update on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1
-  Foreign Scan on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
  Output: (trunc(((random() * '9'::double precision) +
 '1'::double precision)))::integer, ctid
  Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE
 (5 rows)
 
 * Inherited and non-inherited updates for the pushed-down case:
 
 postgres=# explain verbose update parent set c1 = c1 + 1;
   QUERY PLAN
 --
  Update on public.parent  (cost=0.00..376.59 rows=4819 width=10)
Update on public.parent
Update on public.ft1
Update on public.ft2
-  Seq Scan on public.parent  (cost=0.00..0.00 rows=1 width=10)
  Output: (parent.c1 + 1), parent.ctid
-  Foreign Update on public.ft1  (cost=100.00..188.29 rows=2409
 width=10)
  Remote SQL: UPDATE public.t1 SET c1 = (c1 + 1)
-  Foreign Update on public.ft2  (cost=100.00..188.29 rows=2409
 width=10)
  Remote SQL: UPDATE public.t2 SET c1 = (c1 + 1)
 (10

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-20 Thread Shigeru HANADA

Kaigai-san,

I reviewed the Custom/Foreign join API patch again after writing a patch of 
join push-down support for postgres_fdw.

2015/03/26 10:51、Kouhei Kaigai kai...@ak.jp.nec.com のメール：

 Or bottom of make_join_rel().  IMO build_join_rel() is responsible for just
 building (or searching from a list) a RelOptInfo for given relids.  After 
 that
 make_join_rel() calls add_paths_to_joinrel() with appropriate arguments per 
 join
 type to generate actual Paths implements the join.  make_join_rel() is called
 only once for particular relid combination, and there SpecialJoinInfo and
 restrictlist (conditions specified in JOIN-ON and WHERE), so it seems 
 promising
 for FDW cases.
 
 I like that idea, but I think we will have complex hook signature, it won't
 remain as simple as hook (root, joinrel).
 
 Signature of the hook (or the FDW API handler) would be like this:
 
 typedef void (*GetForeignJoinPaths_function ) (PlannerInfo *root,
   RelOptInfo *joinrel,
   RelOptInfo *outerrel,
   RelOptInfo *innerrel,
   JoinType jointype,
   SpecialJoinInfo *sjinfo,
   List *restrictlist);
 
 This is very similar to add_paths_to_joinrel(), but lacks semifactors and
 extra_lateral_rels.  semifactors can be obtained with
 compute_semi_anti_join_factors(), and extra_lateral_rels can be constructed
 from root-placeholder_list as add_paths_to_joinrel() does.
 
 From the viewpoint of postgres_fdw, jointype and restrictlist is necessary to
 generate SELECT statement, so it would require most work done in 
 make_join_rel
 again if the signature was hook(root, joinrel).  sjinfo will be necessary for
 supporting SEMI/ANTI joins, but currently it is not in the scope of 
 postgres_fdw.
 
 I guess that other FDWs require at least jointype and restrictlist.
 
 The attached patch adds GetForeignJoinPaths call on make_join_rel() only when
 'joinrel' is actually built and both of child relations are managed by same
 FDW driver, prior to any other built-in join paths.
 I adjusted the hook definition a little bit, because jointype can be 
 reproduced
 using SpecialJoinInfo. Right?

Yes, it can be derived from the expression below:
jointype = sjinfo ? sjinfo-jointype : JOIN_INNER;

 
 Probably, it will solve the original concern towards multiple calls of FDW
 handler in case when it tries to replace an entire join subtree with a 
 foreign-
 scan on the result of remote join query.
 
 How about your opinion?

AFAIS it’s well-balanced about calling count and available information.

New FDW API GetForeignJoinPaths is called only once for a particular 
combination of join, such as (A join B join C).  Before considering all joins 
in a join level (number of relations contained in the join tree), possible join 
combinations of lower join level are considered recursively. As Tom pointed out 
before, say imagine a case like ((huge JOIN large) LEFT JOIN small), expensive 
path in lower join level might be 

Here, please let me summarize the changes in the patch as the result of my 
review.

* Add set_join_pathlist_hook_type in add_paths_to_joinrel
This hook is intended to provide a chance to add one or more CustomPaths for an 
actual join combination.  If the join is reversible, the hook is called for 
both A * B and B * A.  This is different from FDW API but it seems fine because 
FDWs should have chances to process the join in more abstract level than CSPs.

Parameters are same as hash_inner_and_outer, so they would be enough for 
hash-like or nestloop-like methods.  I’m not sure whether mergeclause_list is 
necessary as a parameter or not.  It’s information for merge join which is 
generated when enable_mergejoin is on and the join is not FULL OUTER.  Does 
some CSP need it for processing a join in its own way?  Then it must be in 
parameter list because select_mergejoin_clauses is static so it’s not 
accessible from external modules.

The timing of the hooking, after considering all built-in path types, seems 
fine because some of CSPs might want to use built-in paths as a template or a 
source.

One concern is in the document of the hook function.  Implementing Custom 
Paths” says:

 A custom scan provider will be also able to add paths by setting the 
 following hook, to replace built-in join paths by custom-scan that performs 
 as if a scan on preliminary joined relations, which us called after the core 
 code has generated what it believes to be the complete and correct set of 
 access paths for the join.

I think “replace” would mis-lead readers that CSP can remove or edit existing 
built-in paths listed in RelOptInfo#pathlist or linked from cheapest_foo.  IIUC 
CSP can just add paths for the join relation, and planner choose it if it’s the 
cheapest.

* Add new FDW API

[HACKERS] installcheck missing in src/bin/pg_rewind/Makefile

2015-04-20 Thread Michael Paquier

Hi all,

As mentioned in $subject, the TAP tests of pg_rewind are currently not
run by buildfarm machines as the buildfarm client uses installcheck to
run the tests in src/bin. A patch is attached to correct the problem.
Regards,
-- 
Michael
diff --git a/src/bin/pg_rewind/Makefile b/src/bin/pg_rewind/Makefile
index e3400f5..98213c4 100644
--- a/src/bin/pg_rewind/Makefile
+++ b/src/bin/pg_rewind/Makefile
@@ -49,3 +49,6 @@ clean distclean maintainer-clean:
 
 check:
 	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Decoding follows timelines

2015-04-20 Thread Michael Paquier

On Fri, Feb 13, 2015 at 4:57 PM, Michael Paquier wrote:
 Moved patch to CF 2015-02 to not lose track of it, also because it does not
 seem it received a proper review.

This patch does not apply anymore, so attached is a rebased version.
The comments mentioned here have not been addressed:
http://www.postgresql.org/message-id/54a7bf61.9080...@vmware.com
Also, what kind of tests have been done? Logical decoding cannot be
used while a node is in recovery.
Regards,
-- 
Michael
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4a20569..3036ce6 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -216,7 +216,8 @@ static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, Transac
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
-
+static XLogRecPtr GetLatestRequestPtr(void);
+static TimeLineID ReadSendTimeLine(TimeLineID tli);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -535,8 +536,6 @@ StartReplication(StartReplicationCmd *cmd)
 
 	if (cmd-timeline != 0)
 	{
-		XLogRecPtr	switchpoint;
-
 		sendTimeLine = cmd-timeline;
 		if (sendTimeLine == ThisTimeLineID)
 		{
@@ -545,18 +544,13 @@ StartReplication(StartReplicationCmd *cmd)
 		}
 		else
 		{
-			List	   *timeLineHistory;
-
 			sendTimeLineIsHistoric = true;
 
 			/*
 			 * Check that the timeline the client requested for exists, and
 			 * the requested start location is on that timeline.
 			 */
-			timeLineHistory = readTimeLineHistory(ThisTimeLineID);
-			switchpoint = tliSwitchPoint(cmd-timeline, timeLineHistory,
-		 sendTimeLineNextTLI);
-			list_free_deep(timeLineHistory);
+			(void) ReadSendTimeLine(cmd-timeline);
 
 			/*
 			 * Found the requested timeline in the history. Check that
@@ -576,8 +570,8 @@ StartReplication(StartReplicationCmd *cmd)
 			 * that's older than the switchpoint, if it's still in the same
 			 * WAL segment.
 			 */
-			if (!XLogRecPtrIsInvalid(switchpoint) 
-switchpoint  cmd-startpoint)
+			if (!XLogRecPtrIsInvalid(sendTimeLineValidUpto) 
+sendTimeLineValidUpto  cmd-startpoint)
 			{
 ereport(ERROR,
 		(errmsg(requested starting point %X/%X on timeline %u is not in this server's history,
@@ -586,10 +580,9 @@ StartReplication(StartReplicationCmd *cmd)
 cmd-timeline),
 		 errdetail(This server's history forked from timeline %u at %X/%X.,
    cmd-timeline,
-   (uint32) (switchpoint  32),
-   (uint32) (switchpoint;
+   (uint32) (sendTimeLineValidUpto  32),
+   (uint32) (sendTimeLineValidUpto;
 			}
-			sendTimeLineValidUpto = switchpoint;
 		}
 	}
 	else
@@ -928,6 +921,8 @@ static void
 StartLogicalReplication(StartReplicationCmd *cmd)
 {
 	StringInfoData buf;
+	XLogRecPtr	FlushPtr;
+	List	   *timeLineHistory;
 
 	/* make sure that our requirements are still fulfilled */
 	CheckLogicalDecodingRequirements();
@@ -940,6 +935,8 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	 * Force a disconnect, so that the decoding code doesn't need to care
 	 * about an eventual switch from running in recovery, to running in a
 	 * normal environment. Client code is expected to handle reconnects.
+	 * This covers the race condition where we are promoted half way
+	 * through starting up.
 	 */
 	if (am_cascading_walsender  !RecoveryInProgress())
 	{
@@ -948,6 +945,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 		walsender_ready_to_stop = true;
 	}
 
+	if (am_cascading_walsender)
+	{
+		/* this also updates ThisTimeLineID */
+		FlushPtr = GetStandbyFlushRecPtr();
+	}
+	else
+		FlushPtr = GetFlushRecPtr();
+
 	WalSndSetState(WALSNDSTATE_CATCHUP);
 
 	/* Send a CopyBothResponse message, and start streaming */
@@ -974,6 +979,24 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	logical_startptr = MyReplicationSlot-data.restart_lsn;
 
 	/*
+	 * Find the timeline for the start location, or throw an error.
+	 *
+	 * Logical replication relies upon replication slots. Each slot has a
+	 * single timeline history baked into it, so this should be easy.
+	 */
+	timeLineHistory = readTimeLineHistory(ThisTimeLineID);
+	sendTimeLine = tliOfPointInHistory(logical_startptr, timeLineHistory);
+	if (sendTimeLine != ThisTimeLineID)
+	{
+		sendTimeLineIsHistoric = true;
+		sendTimeLineValidUpto = tliSwitchPoint(sendTimeLine, timeLineHistory,
+		 sendTimeLineNextTLI);
+	}
+	list_free_deep(timeLineHistory);
+
+	streamingDoneSending = streamingDoneReceiving = false;
+
+	/*
 	 * Report the location after which we'll send out further commits as the
 	 * current sentPtr.
 	 */
@@ -2179,93 +2202,10 @@ XLogSendPhysical(void)
 		return;
 	}
 
-	/* Figure out how far we can safely send the WAL. */
-	if (sendTimeLineIsHistoric)
-	{
-		/*
-		 * Streaming an old timeline that's in this server's history, but is
-		 * not the one we're currently inserting or

Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?

On Thu, Apr  9, 2015 at 05:33:20PM -0400, Bruce Momjian wrote:
 On Thu, Apr  9, 2015 at 12:32:23PM -0300, Alvaro Herrera wrote:
  Bruce Momjian wrote:
  
   Should this be listed in the release notes as a backward-incompatibility?
  
  Isn't this a backpatchable bug fix?
 
 Uh, I don't think so.  I think users are used to the existing behavior
 and changing it on them will cause more harm than good.  Also, we have
 had zero field reports about this problem.
 
 The updated attached patch handles cases where the default_with_oids =
 true.

Slightly improved patch applied.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)

2015-04-20 Thread David Steele

On 4/20/15 4:40 AM, Sawada Masahiko wrote:
 
 Thank you for updating the patch.
 
 One question about regarding since v7 (or later) patch is;
 What is the different between OBJECT logging and SESSION logging?

In brief, session logging can log anything that happens in a user
session while object logging only targets DML and SELECT on selected
relations.

The pg_audit.log_relation setting is intended to mimic the prior
behavior of pg_audit before object logging was added.

In general, I would not expect pg_audit.log = 'read, write' to be used
with pg_audit.role.  In other words, session logging of DML and SELECT
should probably not be turned on at the same time as object logging is
in use.  Object logging is intended to be a fine-grained version of
pg_audit.log = 'read, write' (one or both).

 I used v9 patch with pg_audit.log_relation = on, and got quite
 similar but different log as follows.
 
 =# select * from hoge, bar where hoge.col = bar.col;
 NOTICE:  AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.hoge,select *
 from hoge, bar where hoge.col = bar.col;
 NOTICE:  AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.hoge,select *
 from hoge, bar where hoge.col = bar.col;
 NOTICE:  AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.bar,select * from
 hoge, bar where hoge.col = bar.col;
 NOTICE:  AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.bar,select *
 from hoge, bar where hoge.col = bar.col;
 
 The behaviour of SESSION log is similar to OBJECT log now, and SESSION
 log per session (i.g, pg_audit.log_relation = off) is also similar to
 'log_statement = all'. (enhancing log_statement might be enough)
 So I couldn't understand needs of SESSION log.

Session logging is quite different from 'log_statement = all'.  See:

http://www.postgresql.org/message-id/552323b2.8060...@pgmasters.net

and/or the Why pg_audit? section of the pg_audit documentation.

I agree that it may make sense in the future to merge session logging
into log_statements, but for now it does provide important additional
functionality for creating audit logs.

Regards,
-- 
- David Steele
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] alternative compression algorithms?

2015-04-20 Thread Tomas Vondra




On 04/20/15 05:07, Andres Freund wrote:

Hi,

On 2015-04-19 22:51:53 +0200, Tomas Vondra wrote:

The reason why I'm asking about this is the multivariate statistics patch -
while optimizing the planning overhead, I realized that considerable amount
of time is spent decompressing the statistics (serialized as bytea), and
using an algorithm with better decompression performance (lz4 comes to mind)
would help a lot. The statistics may be a few tens/hundreds kB, and in the
planner every millisecond counts.


I've a hard time believing that a different compression algorithm is
 going to be the solution for that. Decompressing, quite possibly
repeatedly, several hundreds of kb during planning isn't going to be
 fun. Even with a good, fast, compression algorithm.


Sure, it's not an ultimate solution, but it might help a bit. I do have 
other ideas how to optimize this, but in the planner every milisecond 
counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.


regards
Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-04-20 Thread Sawada Masahiko

On Tue, Apr 7, 2015 at 11:22 AM, Sawada Masahiko sawada.m...@gmail.com wrote:
 On Tue, Apr 7, 2015 at 7:53 AM, Jim Nasby jim.na...@bluetreble.com wrote:
 On 4/6/15 5:18 PM, Greg Stark wrote:

 Only I would suggest thinking of it in terms of two orthogonal boolean
 flags rather than three states. It's easier to reason about whether a
 table has a specific property than trying to control a state machine in
 a predefined pathway.

 So I would say the two flags are:
 READONLY: guarantees nothing can be dirtied
 ALLFROZEN: guarantees no unfrozen tuples are present

 In practice you can't have the later without the former since vacuum
 can't know everything is frozen unless it knows nobody is inserting. But
 perhaps there will be cases in the future where that's not true.


 I'm not so sure about that. There's a logical state progression here (see
 below). ISTM it's easier to just enforce that in one place instead of a
 bunch of places having to check multiple conditions. But, I'm not wed to a
 single field.

 Incidentally there are number of other optimisations tat over had in
 mind that are only possible on frozen read-only tables:

 1) Compression: compress the pages and pack them one after the other.
 Build a new fork with offsets for each page.

 2) Automatic partition elimination where the statistics track the
 minimum and maximum value per partition (and number of tuples) and treat
 then as implicit constraints. In particular it would magically make read
 only empty parent partitions be excluded regardless of the where clause.


 AFAICT neither of those actually requires ALLFROZEN, no? You'll need to
 uncompact and re-compact for #1 when you actually freeze (which maybe isn't
 worth it), but freezing isn't absolutely required. #2 would only require
 that everything in the relation is visible; not frozen.

 I think there's value here to having an ALLVISIBLE state as well as
 ALLFROZEN.


 Based on may suggestions, I'm going to deal with FM at first as one
 patch. It would be simply mechanism and similar to VM, at first patch.
 - Each bit of FM represent single page
 - The bit is set only by vacuum
 - The bit is un-set by inserting and updating and deleting

 At second, I'll deal with simply read-only table and 2 states,
 Read/Write(default) and ReadOnly as one patch. ITSM the having the
 Frozen state needs to more discussion. read-only table just allow us
 to disable any updating table, and it's controlled by read-only flag
 pg_class has. And DDL command which changes these status is like ALTER
 TABLE SET READ ONLY, or READ WRITE.
 Also as Alvaro's suggested, the read-only table affect not only
 freezing table but also performance optimization. I'll consider
 including them when I deal with read-only table.


Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.

Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.

Current patch adds new source file src/backend/access/heap/frozenmap.c
which is quite similar to visibilitymap.c. They have similar code but
are separated for now. I do refactoring these source code like adding
bitmap.c, if needed.
Also, when skipping vacuum by visibility map, we can skip at least
SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
frozen map.

Please give me feedbacks.

Regards,

---
Sawada Masahiko
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index b83d496..53f07fd 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/heap
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o
+OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o frozenmap.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/frozenmap.c b/src/backend/access/heap/frozenmap.c
new file mode 100644
index 000..6e64cb8
--- /dev/null
+++ b/src/backend/access/heap/frozenmap.c
@@ -0,0 +1,567 @@
+/*-
+ *
+ * frozenmap.c
+ *	  bitmap for tracking frozen heap tuples
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/heap/frozenmap.c
+ *
+ *-
+ */
+#include postgres.h
+
+#include access/frozenmap.h
+#include access/heapam_xlog.h
+#include access/xlog.h

Re: [HACKERS] Replication identifiers, take 4

On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
 I just realized that it talks about replication identifier as the new
 fundamental concept. The system table is called pg_replication_identifier.
 But that's like talking about index identifiers, instead of just indexes,
 and calling the system table pg_index_oid.
 
 The important concept this patch actually adds is the *origin* of each
 transaction. That term is already used in some parts of the patch. I think
 we should roughly do a search-replace of replication identifier -
 replication origin to the patch. Or even transaction origin.

Sounds good to me.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)

2015-04-20 Thread Sawada Masahiko

On Thu, Apr 16, 2015 at 2:34 AM, David Steele da...@pgmasters.net wrote:
 On 4/15/15 11:30 AM, Sawada Masahiko wrote:
 On Wed, Apr 15, 2015 at 10:52 AM, Sawada Masahiko sawada.m...@gmail.com 
 wrote:
 I tested v8 patch with CURSOR case I mentioned before, and got
 segmentation fault again.
 Here are log messages in my environment,

 =# select test();
  LOG:  server process (PID 29730) was terminated by signal 11:
 Segmentation fault
 DETAIL:  Failed process was running: select test();
 LOG:  terminating any other active server processes
 WARNING:  terminating connection because of crash of another server process
 DETAIL:  The postmaster has commanded this server process to roll back
 the current transaction and exit, because another server process
 exited abnormally and possibly corrupted shared memory.
 HINT:  In a moment you should be able to reconnect to the database and
 repeat your command.
 FATAL:  the database system is in recovery mode

 I investigated this problem and inform you my result here.

 When we execute test() function I mentioned, the three stack items in
 total are stored into auditEventStack.
 The two of them are freed by stack_pop() - stack_free() (i.g,
 stack_free() is called by stack_pop()).
 One of them is freed by PortalDrop() - .. -
 MemoryContextDeleteChildren() - ... - stack_free().
 And it is freed at the same time as deleting pg_audit memory context,
 and stack will be completely empty.

 But after freeing all items, finish_xact_command() function could call
 PortalDrop(), so ExecutorEnd() function could be called again.
 pg_audit has ExecutorEnd_hook, so postgres tries to free that item.. SEGV.

 In my environment, the following change fixes it.

 --- pg_audit.c.org2015-04-15 14:21:07.541866525 +0900
 +++ pg_audit.c2015-04-15 11:36:53.758877339 +0900
 @@ -1291,7 +1291,7 @@
  standard_ExecutorEnd(queryDesc);

  /* Pop the audit event off the stack */
 -if (!internalStatement)
 +if (!internalStatement  auditEventStack != NULL)
  {
  stack_pop(auditEventStack-stackId);
  }

 It might be good idea to add Assert() at before calling stack_pop().

 I'm not sure this change is exactly correct, but I hope this
 information helps you.

 I appreciate you taking the time to look - this is the same conclusion I
 came to.  I also hardened another area that I thought might be vulnerable.

 I've seen this problem with explicit cursors before (and fixed it in
 another place a while ago).  The memory context that is current in
 ExecutorStart is freed before ExecutorEnd is called only in this case as
 far as I can tell.  I'm not sure this is very consistent behavior.

 I have attached patch v9 which fixes this issue as you suggested, but
 I'm not completely satisfied with it.  It seems like there could be an
 unintentional pop from the stack in a case of deeper nesting.  This
 might not be possible but it's hard to disprove.

 I'll think about it some more, but meanwhile this patch addresses the
 present issue.

Thank you for updating the patch.

One question about regarding since v7 (or later) patch is;
What is the different between OBJECT logging and SESSION logging?

I used v9 patch with pg_audit.log_relation = on, and got quite
similar but different log as follows.

=# select * from hoge, bar where hoge.col = bar.col;
NOTICE:  AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.hoge,select *
from hoge, bar where hoge.col = bar.col;
NOTICE:  AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.hoge,select *
from hoge, bar where hoge.col = bar.col;
NOTICE:  AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.bar,select * from
hoge, bar where hoge.col = bar.col;
NOTICE:  AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.bar,select *
from hoge, bar where hoge.col = bar.col;

The behaviour of SESSION log is similar to OBJECT log now, and SESSION
log per session (i.g, pg_audit.log_relation = off) is also similar to
'log_statement = all'. (enhancing log_statement might be enough)
So I couldn't understand needs of SESSION log.

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Supporting TAP tests with MSVC and Windows

2015-04-20 Thread Michael Paquier

On Sun, Apr 19, 2015 at 10:01 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Note as well that this patch uses the following patches fixing
 independent issues:
 ...

Attached is v4. I added a switch in config.pl to be consistent with
./configure and --enable-tap-tests.
-- 
Michael
From f532bbd2522b3a1784038d9863325d055fc445bf Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 20 Apr 2015 04:57:37 -0700
Subject: [PATCH 1/2] Add support for TAP tests on Windows

Nodes initialized by the TAP tests use SSPI to securely perform the
tests, and test scripts are patched in a couple of places to support
Windows grammar. In the case of MSVC, tests can be run with this
command:
vcregress tapcheck
---
 .gitignore |   3 +
 doc/src/sgml/install-windows.sgml  |   1 +
 src/Makefile.global.in |   2 +-
 src/bin/initdb/t/001_initdb.pl |  18 ++--
 src/bin/pg_basebackup/t/010_pg_basebackup.pl   |  79 +---
 src/bin/pg_controldata/t/001_pg_controldata.pl |   5 +-
 src/bin/pg_ctl/t/001_start_stop.pl |  14 ++-
 src/bin/pg_ctl/t/002_status.pl |  12 ++-
 src/bin/pg_rewind/RewindTest.pm| 119 -
 src/bin/scripts/t/020_createdb.pl  |   3 +
 src/test/perl/TestLib.pm   |  25 --
 src/tools/msvc/Solution.pm |   1 +
 src/tools/msvc/config_default.pl   |   1 +
 src/tools/msvc/vcregress.pl|  60 -
 14 files changed, 271 insertions(+), 72 deletions(-)

diff --git a/.gitignore b/.gitignore
index 8d3af50..3cd37fe 100644
--- a/.gitignore
+++ b/.gitignore
@@ -36,3 +36,6 @@ lib*.pc
 /pgsql.sln.cache
 /Debug/
 /Release/
+
+# Generated by tests
+/tmp_check/
diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml
index d154b44..2047790 100644
--- a/doc/src/sgml/install-windows.sgml
+++ b/doc/src/sgml/install-windows.sgml
@@ -439,6 +439,7 @@ $ENV{CONFIG}=Debug;
 userinputvcregress modulescheck/userinput
 userinputvcregress ecpgcheck/userinput
 userinputvcregress isolationcheck/userinput
+userinputvcregress tapcheck/userinput
 userinputvcregress upgradecheck/userinput
 /screen
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 4b06fc2..b27ed78 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -324,7 +324,7 @@ endef
 define prove_check
 $(MKDIR_P) tmp_check/log
 $(MAKE) -C $(top_builddir) DESTDIR='$(CURDIR)'/tmp_check/install install '$(CURDIR)'/tmp_check/log/install.log 21
-cd $(srcdir)  TESTDIR='$(CURDIR)' PATH=$(CURDIR)/tmp_check/install$(bindir):$$PATH $(call add_to_path,$(ld_library_path_var),$(CURDIR)/tmp_check/install$(libdir)) top_builddir='$(CURDIR)/$(top_builddir)' PGPORT='6$(DEF_PGPORT)' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
+cd $(srcdir)  TESTDIR='$(CURDIR)' TESTREGRESS='$(top_builddir)/src/test/regress/pg_regress' PATH=$(CURDIR)/tmp_check/install$(bindir):$$PATH $(call add_to_path,$(ld_library_path_var),$(CURDIR)/tmp_check/install$(libdir)) top_builddir='$(CURDIR)/$(top_builddir)' PGPORT='6$(DEF_PGPORT)' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl
 endef
 
 else
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index d12be84..0865107 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -1,6 +1,8 @@
 use strict;
 use warnings;
+use Config;
 use TestLib;
+use File::Path qw(rmtree);
 use Test::More tests = 19;
 
 my $tempdir = TestLib::tempdir;
@@ -18,27 +20,31 @@ command_fails([ 'initdb', '-S', $tempdir/data3 ],
 mkdir $tempdir/data4 or BAIL_OUT($!);
 command_ok([ 'initdb', $tempdir/data4 ], 'existing empty data directory');
 
-system_or_bail rm -rf '$tempdir'/*;
-
+rmtree($tempdir);
+mkdir $tempdir;
 command_ok([ 'initdb', '-X', $tempdir/pgxlog, $tempdir/data ],
 	'separate xlog directory');
 
-system_or_bail rm -rf '$tempdir'/*;
+rmtree($tempdir);
+mkdir $tempdir;
 command_fails(
 	[ 'initdb', $tempdir/data, '-X', 'pgxlog' ],
 	'relative xlog directory not allowed');
 
-system_or_bail rm -rf '$tempdir'/*;
+rmtree($tempdir);
+mkdir $tempdir;
 mkdir $tempdir/pgxlog;
 command_ok([ 'initdb', '-X', $tempdir/pgxlog, $tempdir/data ],
 	'existing empty xlog directory');
 
-system_or_bail rm -rf '$tempdir'/*;
+rmtree($tempdir);
+mkdir $tempdir;
 mkdir $tempdir/pgxlog;
 mkdir $tempdir/pgxlog/lost+found;
 command_fails([ 'initdb', '-X', $tempdir/pgxlog, $tempdir/data ],
 	'existing nonempty xlog directory');
 
-system_or_bail rm -rf '$tempdir'/*;
+rmtree($tempdir);
+mkdir $tempdir;
 command_ok([ 'initdb', '-T', 'german', $tempdir/data ],
 	'select default dictionary');
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 7e9a776..4cb1b5a 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -1,8 +1,9 @@
 use

Re: [HACKERS] Replication identifiers, take 4


On 04/17/2015 11:45 PM, Petr Jelinek wrote:

The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend
would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we
don't do that.

The change does nothing useful, since I doubt anyone will ever need
  32768 nodes in their cluster.


And if they did there would be other much bigger problems than
replication identifier being 16bit (it's actually 65534 as it's
unsigned btw).


Can you name some of the bigger problems you'd have?

Obviously, if you have 10 high-volume OLTP nodes connected to a 
single server, feeding transactions as a continous stream, you're going 
to choke the system. But you might have 10 tiny satellite databases 
that sync up with the master every few hours, and each of them do only a 
few updates per day.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication identifiers, take 4

On 2015-04-20 11:09:20 +0300, Heikki Linnakangas wrote:
 Can you name some of the bigger problems you'd have?

Several parts of the system are O(#max_replication_slots). Having 100k
outgoing logical replication slots is going to be expensive.

Nothing unsolvable, but the 65k 16 bit limit surely isn't going to be
the biggest problem.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2015-04-20 Thread Etsuro Fujita

On 2015/04/17 13:16, Amit Langote wrote:
 On 17-04-2015 PM 12:35, Etsuro Fujita wrote:
 (2) that might cause the problem of associating subplans' update
 information with subplans' scan information, pointed out by Tom [1].

 Having realized how it really works now, my +1 to Foreign Modifying Scan for
 cases of pushed down update as suggested by Albe Laurenz. I guess it would be
 signaled by the proposed ForeignScan.CmdType being CMD_UPDATE / CMP_UPDATE
 (/CMD_INSERT).

Thanks for the opinion!  I think that that is an idea.  However, I'd
like to propose to rename Foreign Update (Foreign Delete) of
ModifyTable simply to Update (Delete) not only because (1) that
solves the duplication problem but also because (2) ISTM that is
consistent with the non-inherited updates in both of the
non-pushed-down-update case and the pushed-down-update case.  Here are
examples for (2).

* Inherited and non-inherited updates for the non-pushed-down case:

postgres=# explain verbose update parent set c1 = trunc(random() * 9 +
1)::int;
 QUERY PLAN
-
 Update on public.parent  (cost=0.00..452.06 rows=5461 width=6)
   Update on public.parent
   Update on public.ft1
 Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1
   Update on public.ft2
 Remote SQL: UPDATE public.t2 SET c1 = $2 WHERE ctid = $1
   -  Seq Scan on public.parent  (cost=0.00..0.01 rows=1 width=6)
 Output: (trunc(((random() * '9'::double precision) +
'1'::double precision)))::integer, parent.ctid
   -  Foreign Scan on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
 Output: (trunc(((random() * '9'::double precision) +
'1'::double precision)))::integer, ft1.ctid
 Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE
   -  Foreign Scan on public.ft2  (cost=100.00..226.03 rows=2730 width=6)
 Output: (trunc(((random() * '9'::double precision) +
'1'::double precision)))::integer, ft2.ctid
 Remote SQL: SELECT ctid FROM public.t2 FOR UPDATE
(14 rows)

postgres=# explain verbose update ft1 set c1 = trunc(random() * 9 + 1)::int;
  QUERY PLAN
--
 Update on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
   Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1
   -  Foreign Scan on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
 Output: (trunc(((random() * '9'::double precision) +
'1'::double precision)))::integer, ctid
 Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE
(5 rows)

* Inherited and non-inherited updates for the pushed-down case:

postgres=# explain verbose update parent set c1 = c1 + 1;
  QUERY PLAN
--
 Update on public.parent  (cost=0.00..376.59 rows=4819 width=10)
   Update on public.parent
   Update on public.ft1
   Update on public.ft2
   -  Seq Scan on public.parent  (cost=0.00..0.00 rows=1 width=10)
 Output: (parent.c1 + 1), parent.ctid
   -  Foreign Update on public.ft1  (cost=100.00..188.29 rows=2409
width=10)
 Remote SQL: UPDATE public.t1 SET c1 = (c1 + 1)
   -  Foreign Update on public.ft2  (cost=100.00..188.29 rows=2409
width=10)
 Remote SQL: UPDATE public.t2 SET c1 = (c1 + 1)
(10 rows)

postgres=# explain verbose update ft1 set c1 = c1 + 1;
  QUERY PLAN
--
 Update on public.ft1  (cost=100.00..188.29 rows=2409 width=10)
   -  Foreign Update on public.ft1  (cost=100.00..188.29 rows=2409
width=10)
 Remote SQL: UPDATE public.t1 SET c1 = (c1 + 1)
(3 rows)

Comments are welcome.

Best regards,
Etsuro Fujita


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Jeff Janes

On Mon, Sep 29, 2014 at 2:13 AM, Andres Freund and...@anarazel.de wrote:

 On 2014-09-28 19:51:36 +0100, Simon Riggs wrote:
  On 27 September 2014 09:29, Andres Freund and...@anarazel.de wrote:
   On 2014-09-27 10:23:33 +0300, Heikki Linnakangas wrote:
   This patch has gotten a fair amount of review, and has been
rewritten once
   during the commitfest. I think it's pretty close to being
committable, the
   only remaining question seems to be what to do with system catalogs.
I'm
   marking this as Returned with feedback, I take it that Simon can
proceed
   from here, outside the commitfest.
  
   FWIW, I don't think it is, even with that. As is it seems very likely
   that it's going to regress a fair share of workloads. At the very
least
   it needs a fair amount of benchmarking beforehand.
 
  There is some doubt there. We've not seen a workload that does
  actually exhibit a negative behaviour.

 Neither is there much data about the magnitude of positive effect the
 patch has...

  I'm not saying one doesn't exist, but it does matter how common/likely
  it is. If anyone can present a performance test case that demonstrates
  a regression, I think it will make it easier to discuss how wide that
  case is and what we should do about it. Discussing whether to do
  various kinds of limited pruning are moot until that is clear.

 I doubt it'll be hard to construct a case where it'll show. My first try
 of using a pgbench scale 100, -M prepared, -cj8 with a custom file with
 1 write and 5 read transaction yielded the following on my laptop:

 Baseline:
  relname| pgbench_tellers
  pg_total_relation_size | 458752
  relname| pgbench_accounts
  pg_total_relation_size | 1590337536
  relname| pgbench_branches
  pg_total_relation_size | 286720
  relname| pgbench_history
  pg_total_relation_size | 49979392
 Patched:
  relname| pgbench_tellers
  pg_total_relation_size | 516096
  relname| pgbench_accounts
  pg_total_relation_size | 1590337536
  relname| pgbench_branches
  pg_total_relation_size | 360448
  relname| pgbench_history
  pg_total_relation_size | 49528832

 So, there's a noticeable increase in size. Mostly on the smaller tables,
 so probably HOT cleanup was sometimes skipped during UPDATEs due to
 locks.

 Baseline was:
 tps = 9655.486532 (excluding connections establishing)
 Patched was:
 tps = 9466.158701 (including connections establishing)

Was this reproducible?  I've run your custom sql file with 4 clients (that
is how many CPUs I have) on a machine
with a BBU.  I had wal_level = hot_standby, but the archive_command just
returned true without archiving anything. And using the latest patch.

The size of the pgbench_tellers and pgbench_branches relations were
surprisingly variable in both patched and unpatched, but there was no
reliable difference between them, just within them.

On the TPS front, there was a hint that patched one was slightly slower but
the within sample variation was also high, and the p-val for difference was
only 0.214 on n of 66.

test case attached.

 That's not a unrealistic testcase.

 I'm pretty sure this could be made quite a bit more pronounced by not
 using a uniform distribution in the pgbench runs. And selecting a test
 that's more vulnerable to the change (e.g. using a wider distribution
 for the read only statements than the modifying ones) would make the the
 CPU overhead of the additional heap_hot_search_buffer() overhead
 heavier.

Sorry I don't understand this description.  Why would queries selecting
data that is not changing have any extra overhead?

Is the idea that the hot part of the table for updates would move around
over time, but the hot part for selects would be even throughout?  I'm not
sure how to put that to the test.

Cheers,

Jeff

Re: [HACKERS] Sequence Access Method WIP


On 03/15/2015 09:07 PM, Petr Jelinek wrote:

Slightly updated version of the patch.

Mainly rebased against current master (there were several conflicts) and
fixed some typos, no real functional change.

I also attached initial version of the API sgml doc.


Thanks!

With the patch, pg_class.relam column references to the pg_seqam table 
for sequences, but pg_indexam for indexes. I believe it's the first 
instance where we reuse a foreign key column like that. It's not a 
real foreign key, of course - that wouldn't work with a real foreign key 
at all - but it's a bit strange. That makes me a bit uncomfortable. How 
do others feel about that?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2015-04-20 Thread Svenne Krap

Oh, and I build it on top of f92fc4c95ddcc25978354a8248d3df22269201bc

On 20-04-2015 10:36, Svenne Krap wrote:
 The following review has been posted through the commitfest application:
 make installcheck-world:  tested, failed
 Implements feature:   tested, passed
 Spec compliant:   not tested
 Documentation:tested, passed

 Hi, 

 I have (finally) found time to review this. 

 The syntax is as per spec as I can see, and the queries I have tested have 
 all produced the correct output. 

 The documentation looks good and is clear.

 I think it is spec compliant, but I am not used enough to the spec to be 
 sure. Also I have not understood the function of set quantifier 
 (DISTINCT,ALL) part in the group by clause (and hence not tested it). Hence I 
 haven't marked the spec compliant part.

 The installcheck-world fails, but in src/pl/tcl/results/pltcl_queries.out (a 
 sorting problem when looking at the diff) which should be unrelated to GSP. I 
 don't know enough of the check to know if it has already run the GSP tests..

 I have also been running a few tests on some real data. This is run on my 
 laptop with 32 GB of memory and a fast SSD. 

 The first dataset is a join between a data table of 472 MB (4,3 Mrows) and a 
 tiny multi-column lookup table. I am returning a count(*).
 Here the data is hierarchical so CUBE does not make sense. GROUPING SETS and 
 ROLLUP both works fine and if work_buffers are large enough it slightly beats 
 the handwritten union all equivalent (runtimes as 7,6 seconds  to 7,7 
 seconds). If work_buffers are the default 4MB the union-all-equivalent (UAE) 
 beats the GS-query almost 2:1 due to disk spill (14,3 (GS) vs. 8,2 (UAE) 
 seconds). 

 The other query is on the same datatable as before, but with three columns 
 (two calculated and one natural) for a cube. I am returning a count(*). 
 First column is extract year from date column
 Second column is divide a value by something and truncate (i.e. make 
 buckets)
 Third column is a litteral integer column. 
 Here the GS-version is slightly slower than the UAE-version (17,5 vs. 14,2). 
 Nothing obvious about why in the explain (analyze,buffers,costs,timing) .

 I have the explains, but as the dataset is semi-private and I don't have any 
 easy way to edit out names in it, I will send it on request (non-disclosure 
 from the recipient is of course a must) and not post it on the list.

 I think the feature is ready to be commited, but am unsure whether I am 
 qualified to gauge that :)

 /Svenne

 The new status of this patch is: Ready for Committer





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication identifiers, take 4


On 04/17/2015 11:36 PM, Simon Riggs wrote:

On 17 April 2015 at 19:18, Heikki Linnakangas hlinn...@iki.fi wrote:

To be honest, I'm not entirely sure what we're arguing over.


When arguing over something you consider small, it is customary to allow
the author precedence. We can't do things our own way all the time.


Sure, I'm not going to throw a tantrum if Andres commits this as it is.


I said that IMO the difference in WAL size is so small that we should just
use 4-byte OIDs for the replication identifiers, instead of trying to make
do with 2 bytes. Not because I find it too likely that you'll run out of
IDs (although it could happen), but more to make replication IDs more like
all other system objects we have. Andreas did some pgbench benchmarking to
show that the difference in WAL size is about 10%. The WAL records
generated by pgbench happen to have just the right sizes so that the 2-3
extra bytes bump them over to the next alignment boundary. That's why there
is such a big difference - on average it'll be less. I think that's
acceptable, Andreas seems to think otherwise. But if the WAL size really is
so precious, we could remove the two padding bytes from XLogRecord, instead
of dedicating them for the replication ids. That would be an even better
use for them.


The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend would
be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do
that.


That's a straw man argument. I'm not saying we should never use 2 byte 
values anywhere. OID is usually used as the primary key in system 
tables. There are exceptions, but that is nevertheless the norm. I'm 
saying that saving in WAL size is not worth making an exception here, 
and we should go with the simplest option of using OIDs.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication identifiers, take 4


On 04/17/2015 11:54 AM, Andres Freund wrote:

I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for external 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward.


Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:

I just realized that it talks about replication identifier as the new 
fundamental concept. The system table is called 
pg_replication_identifier. But that's like talking about index 
identifiers, instead of just indexes, and calling the system table 
pg_index_oid.


The important concept this patch actually adds is the *origin* of each 
transaction. That term is already used in some parts of the patch. I 
think we should roughly do a search-replace of replication identifier 
- replication origin to the patch. Or even transaction origin.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2015-04-20 Thread Svenne Krap

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

Hi, 

I have (finally) found time to review this. 

The syntax is as per spec as I can see, and the queries I have tested have all 
produced the correct output. 

The documentation looks good and is clear.

I think it is spec compliant, but I am not used enough to the spec to be sure. 
Also I have not understood the function of set quantifier (DISTINCT,ALL) part 
in the group by clause (and hence not tested it). Hence I haven't marked the 
spec compliant part.

The installcheck-world fails, but in src/pl/tcl/results/pltcl_queries.out (a 
sorting problem when looking at the diff) which should be unrelated to GSP. I 
don't know enough of the check to know if it has already run the GSP tests..

I have also been running a few tests on some real data. This is run on my 
laptop with 32 GB of memory and a fast SSD. 

The first dataset is a join between a data table of 472 MB (4,3 Mrows) and a 
tiny multi-column lookup table. I am returning a count(*).
Here the data is hierarchical so CUBE does not make sense. GROUPING SETS and 
ROLLUP both works fine and if work_buffers are large enough it slightly beats 
the handwritten union all equivalent (runtimes as 7,6 seconds  to 7,7 
seconds). If work_buffers are the default 4MB the union-all-equivalent (UAE) 
beats the GS-query almost 2:1 due to disk spill (14,3 (GS) vs. 8,2 (UAE) 
seconds). 

The other query is on the same datatable as before, but with three columns 
(two calculated and one natural) for a cube. I am returning a count(*). 
First column is extract year from date column
Second column is divide a value by something and truncate (i.e. make buckets)
Third column is a litteral integer column. 
Here the GS-version is slightly slower than the UAE-version (17,5 vs. 14,2). 
Nothing obvious about why in the explain (analyze,buffers,costs,timing) .

I have the explains, but as the dataset is semi-private and I don't have any 
easy way to edit out names in it, I will send it on request (non-disclosure 
from the recipient is of course a must) and not post it on the list.

I think the feature is ready to be commited, but am unsure whether I am 
qualified to gauge that :)

/Svenne

The new status of this patch is: Ready for Committer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On 2015-04-20 01:04:18 -0700, Jeff Janes wrote:
 Was this reproducible?

Yes, at least with an old version of the patch.

I don't think you could see a difference using exactly that with the
newer versions which have the 5 page limit. After all it'll pretty much
never reach it.

  That's not a unrealistic testcase.
 
  I'm pretty sure this could be made quite a bit more pronounced by not
  using a uniform distribution in the pgbench runs. And selecting a test
  that's more vulnerable to the change (e.g. using a wider distribution
  for the read only statements than the modifying ones) would make the the
  CPU overhead of the additional heap_hot_search_buffer() overhead
  heavier.

 Sorry I don't understand this description.  Why would queries selecting
 data that is not changing have any extra overhead?

The idea, I think, was that by having a uniform (or just wider)
distribution of the reads they'd be more likely to land on values that
have been updated at some point, but not been pruned since (because at
that point the patch IIRC didn't prune during reads at all). I.e. ones
wer

 Is the idea that the hot part of the table for updates would move around
 over time, but the hot part for selects would be even throughout?

Pretty much.

 I'm not sure how to put that to the test.

That pretty much was what I'd tried to model, yea. I guess it'd be
possible to model this by inserting NOW()/updating values NOW() - 5 and
selecting values up to NOW() - 60. That'd roughly model some realistic
insert/update/select patterns I've seen.

To possibly see any difference with the new patch this would have to be
done in a way that regularly a couple of pages would be touched, with
not that many selected tuples on each.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] inherit support for foreign tables

2015-04-20 Thread Etsuro Fujita

On 2015/04/16 16:05, Etsuro Fujita wrote:
 On 2015/03/23 2:57, Tom Lane wrote:
 Etsuro Fujita fujita.ets...@lab.ntt.co.jp writes:
 [ fdw-inh-8.patch ]

 I've committed this with some substantial rearrangements, notably:
 
 * As I mentioned earlier, I got rid of a few unnecessary restrictions on
 foreign tables so as to avoid introducing warts into inheritance behavior.
 In particular, we now allow NOT VALID CHECK constraints (and hence ALTER
 ... VALIDATE CONSTRAINT), ALTER SET STORAGE, and ALTER SET WITH/WITHOUT
 OIDS.  These are probably no-ops anyway for foreign tables, though
 conceivably an FDW might choose to implement some behavior for STORAGE
 or OIDs.
 
 I agree with you on this point.  However, ISTM there is a bug in
 handling OIDs on foreign tables; while we now allow for ALTER SET
 WITH/WITHOUT OIDS, we still don't allow the default_with_oids parameter
 for foreign tables.  I think that since CREATE FOREIGN TABLE should be
 consistent with ALTER FOREIGN TABLE, we should also allow the parameter
 for foreign tables.  Attached is a patch for that.

I also updated docs.  Attached is an updated version of the patch.

Best regards,
Etsuro Fujita
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 6745,6752  dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
/term
listitem
 para
! This controls whether commandCREATE TABLE/command and
! commandCREATE TABLE AS/command include an OID column in
  newly-created tables, if neither literalWITH OIDS/literal
  nor literalWITHOUT OIDS/literal is specified. It also
  determines whether OIDs will be included in tables created by
--- 6745,6753 
/term
listitem
 para
! This controls whether commandCREATE TABLE/command,
! commandCREATE TABLE AS/command and
! commandCREATE FOREIGN TABLE/command include an OID column in
  newly-created tables, if neither literalWITH OIDS/literal
  nor literalWITHOUT OIDS/literal is specified. It also
  determines whether OIDs will be included in tables created by
*** a/doc/src/sgml/ref/create_foreign_table.sgml
--- b/doc/src/sgml/ref/create_foreign_table.sgml
***
*** 293,298  CHECK ( replaceable class=PARAMETERexpression/replaceable )
--- 293,304 
  responsibility to ensure that the constraint definition matches
  reality.
 /para
+ 
+   para
+To add OIDs to the table created by commandCREATE FOREIGN TABLE/command,
+enable the xref linkend=guc-default-with-oids configuration
+variable.
+   /para
   /refsect1
  
   refsect1 id=SQL-CREATEFOREIGNTABLE-examples
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 580,586  DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
  	descriptor = BuildDescForRelation(schema);
  
  	localHasOids = interpretOidsOption(stmt-options,
! 	   (relkind == RELKIND_RELATION));
  	descriptor-tdhasoid = (localHasOids || parentOidCount  0);
  
  	/*
--- 580,587 
  	descriptor = BuildDescForRelation(schema);
  
  	localHasOids = interpretOidsOption(stmt-options,
! 	   (relkind == RELKIND_RELATION ||
! 		relkind == RELKIND_FOREIGN_TABLE));
  	descriptor-tdhasoid = (localHasOids || parentOidCount  0);
  
  	/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sequence Access Method WIP

On 2015-04-20 12:49:39 +0300, Heikki Linnakangas wrote:
 With the patch, pg_class.relam column references to the pg_seqam table for
 sequences, but pg_indexam for indexes. I believe it's the first instance
 where we reuse a foreign key column like that. It's not a real foreign
 key, of course - that wouldn't work with a real foreign key at all - but
 it's a bit strange. That makes me a bit uncomfortable. How do others feel
 about that?

Hm. I'd modeled it more as an extension of the 'relkind' column
mentally. I.e. it further specifies how exactly the relation is
behaving. Given that the field has been added to pg_class and not
pg_index, combined with it not having index in its name, makes me think
that it actually was intended to be used the way it's done in the patch.

It's not the first column that behaves that way btw, at least pg_depend
comes to mind.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sequence Access Method WIP

2015-04-20 Thread Petr Jelinek


On 20/04/15 12:05, Andres Freund wrote:

On 2015-04-20 12:49:39 +0300, Heikki Linnakangas wrote:

With the patch, pg_class.relam column references to the pg_seqam table for
sequences, but pg_indexam for indexes. I believe it's the first instance
where we reuse a foreign key column like that. It's not a real foreign
key, of course - that wouldn't work with a real foreign key at all - but
it's a bit strange. That makes me a bit uncomfortable. How do others feel
about that?


Hm. I'd modeled it more as an extension of the 'relkind' column
mentally. I.e. it further specifies how exactly the relation is
behaving. Given that the field has been added to pg_class and not
pg_index, combined with it not having index in its name, makes me think
that it actually was intended to be used the way it's done in the patch.



That's how I think about it too. It's also short for access method, 
nothing really suggests to me that it should be index specific by design.



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Supporting src/test/modules in MSVC builds

Michael Paquier wrote:

 And currawong is satisfied with this patch and the new buildfarm code,
 test modules being run as testmodules-install-check-C:
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=currawongdt=2015-04-18%2014%3A51%3A14
 So this closes the loop for this thread.

Yay!  Thanks, Michael and Andrew.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Merlin Moncure

On Mon, Apr 20, 2015 at 9:56 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout
 klep...@svana.org wrote:
 I've been following this thread from the side with interest and got
 twigged by the point about loss of information.  If you'd like better
 information about relative ages, you can acheive this by raising the
 cap on the usage count and dividing (or right-shifting) each sweep.

 Yeah, I thought about that, too.  It might be worth experimenting with.

 This would allow you to remember much more about about the relative
 worth of often used pages.  With a cap of 32 you'd have the same effect
 as now where after 5 sweeps the buffer is evicted.  Mathematically the
 count would converge to the number of times the block is used per
 sweep.

 Hmm, interesting point.  It's possible that we'd still have problems
 with everything maxing out at 32 on some workloads, but at least it'd
 be a little harder to max out at 32 than at 5.

Do we have any reproducible test cases to evaluate these assumptions?
 I haven't looked at this stuff for a while, but my main issue with
the clock sweep was finding sweep heavy cases that did not also have
trouble with the buffer mapping locks (although the facts on the
ground my have changed since then).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout
klep...@svana.org wrote:
 I've been following this thread from the side with interest and got
 twigged by the point about loss of information.  If you'd like better
 information about relative ages, you can acheive this by raising the
 cap on the usage count and dividing (or right-shifting) each sweep.

Yeah, I thought about that, too.  It might be worth experimenting with.

 This would allow you to remember much more about about the relative
 worth of often used pages.  With a cap of 32 you'd have the same effect
 as now where after 5 sweeps the buffer is evicted.  Mathematically the
 count would converge to the number of times the block is used per
 sweep.

Hmm, interesting point.  It's possible that we'd still have problems
with everything maxing out at 32 on some workloads, but at least it'd
be a little harder to max out at 32 than at 5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Simon Riggs

On 20 April 2015 at 18:33, Bruce Momjian br...@momjian.us wrote:


 Also, I am also not sure we should be designing features at this stage
 in our release process.


I see this more as a process of gaining approval. I don't think patches at
the back of the queue should get the its too late treatment just because
they are at the back of the queue. They are logically all at the same
stage. There should be a way to allow people that show patience and respect
for the process to get the same timeshare as those that push their patches
daily.

Anyway, in this case, the patch conflicts with other things going in now,
so changing things isn't really sensible for this release, in terms of my
time.

We clearly need to do a better job of piggybacking actions on a dirty
block. The discussion has been about whether to do early cleanup, but we
should also consider post-access cleanup if we set hint bits or dirtied the
block in other ways.

Since we have many votes in favour of change in this area I'll post a new
version and look for an early review/commit for next release.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On Thu, Apr 16, 2015 at 03:41:54PM +0100, Simon Riggs wrote:
 That is how we arrive at the idea of a cleanup limit, further enhanced by a
 limit that applies only to dirtying clean blocks, which we have 4? recent 
 votes
 in favour of.
 
 I would personally be in favour of a parameter to control the limit, since
 whatever we chose is right/wrong depending upon circumstances. I am however
 comfortable with not having a parameter if people think it is hard to tune
 that, which I agree it would be, hence no parameter in the patch.

I think the limit has to be in terms of a percentage of the table size. 
For example, if we do one SELECT on a table with all non-dirty pages, it
would be good to know that 5% of the pages were pruned --- that tells me
that another 19 SELECTs will totally prune the table, assuming no future
writes.  If there are future writes, they would dirty the pages and
cause even more pruning, but the 5% gives me the maximum pruning number
of SELECTs.  If there aren't another 19 SELECTs, do I care if the table
is pruned or not?  Probably not.  Measuring in page count doesn't do
that, and a large table could receive millions of selects before being
fully cleaned.

Also, I am also not sure we should be designing features at this stage
in our release process.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] optimizing vacuum truncation scans


On 4/20/15 1:50 AM, Jeff Janes wrote:

Shouldn't completely empty pages be set as all-visible in the VM? If
so, can't we just find the largest not-all-visible page and move
forward from there, instead of moving backwards like we currently do?


If the entire table is all-visible, we would be starting from the
beginning, even though the beginning of the table still has read only
tuples present.


Except we'd use it in conjunction with nonempty_pages. IIRC, that's set 
to the last page that vacuum saw data on. If any page after that got 
written to after vacuum visited it, the VM bit would be cleared. So 
after we acquire the exclusive lock, AFAICT it's safe to just scan the 
VM starting with nonempty_pages.



For that matter, why do we scan backwards anyway? The comments don't
explain it, and we have nonempty_pages as a starting point, so why
don't we just scan forward? I suspect that eons ago we didn't have
that and just blindly reverse-scanned until we finally hit a
non-empty buffer...


nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan.  But maybe the combination of the
two?  If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.


Right.


In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.


Yeah, but I think we'd do a LOT better with the VM idea, because we 
could immediately truncate without scanning anything.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

On Tue, Apr 7, 2015 at 11:58 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 One disadvantage of retaining parallel-paths could be that it can
 increase the number of combinations planner might need to evaluate
 during planning (in particular during join path evaluation) unless we
 do some special handling to avoid evaluation of such combinations.

Yes, that's true.  But the overhead might not be very much.  In the
common case, many baserels and joinrels will have no parallel paths
because the non-parallel paths is known to be better anyway.  Also, if
parallelism does seem to be winning, we're probably planning a query
that involves accessing a fair amount of data, so a little extra
planner overhead may not be so bad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?

On Thu, Apr 9, 2015 at 5:33 PM, Bruce Momjian br...@momjian.us wrote:
 On Thu, Apr  9, 2015 at 12:32:23PM -0300, Alvaro Herrera wrote:
 Bruce Momjian wrote:

  Should this be listed in the release notes as a backward-incompatibility?

 Isn't this a backpatchable bug fix?

 Uh, I don't think so.  I think users are used to the existing behavior
 and changing it on them will cause more harm than good.  Also, we have
 had zero field reports about this problem.

I agree.  This should not be back-patched, but fixing it in master seems fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: Spinlock Documentation

On Sat, Apr 11, 2015 at 12:03 PM, Artem Luzyanin lisyono...@yahoo.ca wrote:
 Thank you again for your feedback. I have improved the patch with your
 suggestions. Please let me know what you think and if I can do anything
 else.

 Current CommitFest link for this patch is:
 https://commitfest.postgresql.org/5/208/

Some review comments:

- The first hunk in s_lock.h touches only whitespace.  Changing the
space to a tab on the Usually line would make sense for consistency,
but adding a trailing space to the override them line does not.

- As Tom basically said before, I think the File layout block
comment will just get out of date and be a maintenance annoyance to
future updaters of this file.   It's not really that hard to see the
structure of the file just by going through it, so I don't think this
is worthwhile.

- Similarly, adding all of the Currently implemented lines looks
useless to me.  Why can't somebody see that from just reading the code
itself?

Overall, I'm not seeing much point to this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark st...@mit.edu wrote:
 This is my point though (you're right that flushed isn't always the
 same as eviction but that's not the important point here). Right now
 we only demote when we consider buffers for eviction. But we promote
 when we pin buffers. Those two things aren't necessarily happening at
 the same rate and in fact are often orders of magnitude different.

I am absolutely, positively, violently in 100% agreement with this.  I
have made the same point before, but it sure is nice to hear someone
else thinking about it the same way.

 So
 it makes sense that we end up with a lot of buffers promoted all the
 way to the most recently used ntile and then have to do n passes
 around the clock and have no good information about which buffer to
 evict.

Right.

 What I'm saying is that we should demote a buffer every time we
 promote a buffer. So every time we pin a buffer we should advance the
 clock a corresponding amount. I know I'm being intentionally vague
 about what the corresponding amount is.) The important thing is that
 the two should be tied together.

Yes, absolutely.  If you tilt your head the right way, my proposal of
limiting the number of promotions per clock sweep has the effect of
tying buffer demotion and buffer promotion together much more tightly
than is the case right now.  You are limited to 2 promotions per
demotion; and practically speaking not all buffers eligible to be
promoted will actually get accessed, so the number of promotions per
demotion will in reality be somewhere between 0 and 2.  Ideally it
would be exactly 1, but 1 +/- 1 is still a tighter limit than we have
at present.  Which is not to say there isn't some other idea that is
better still.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

On Mon, Apr 20, 2015 at 11:00 AM, Merlin Moncure mmonc...@gmail.com wrote:
 Hmm, interesting point.  It's possible that we'd still have problems
 with everything maxing out at 32 on some workloads, but at least it'd
 be a little harder to max out at 32 than at 5.

 Do we have any reproducible test cases to evaluate these assumptions?

The particular point that you are responding to here is easy to
reproduce.  Just create any workload that fits in shared_buffers - a
small pgbench database, for example - and access stuff.  No usage
counts will go down, but every access will drive usage counts up.
Eventually everything will hit any maximum you care to install, and
actually I don't think it takes very long.  You can use pg_buffercache
to see the results.

  I haven't looked at this stuff for a while, but my main issue with
 the clock sweep was finding sweep heavy cases that did not also have
 trouble with the buffer mapping locks (although the facts on the
 ground my have changed since then).

We increased the number of buffer mapping locks to 128, so that
problem should be considerably ameliorated now.  But it's not totally
gone, as demonstrated by Andres's experiments with my chash patch.

There was also a patch that eliminated BufFreelistLock in favor of a
spinlock held for much shorter time periods, and then Andres took that
one step further and made it use atomics.  That used to be a
*terrible* bottleneck on eviction-heavy workloads and is now much
improved.  More work may remain to be done, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sequence Access Method WIP

Heikki Linnakangas wrote:

 * The transformations of the arrays in get_state() and set_state() functions
 are a bit complicated. The seqam_get_state() function returns two C arrays,
 and pg_sequence_get_state() turns them into a text[] array. Why not
 construct the text[] array directly in the AM? I guess it's a bit more
 convenient to the AM, if the pg_sequence_get_state() do that, but if that's
 an issue, you could create generic helper function to turn two C string
 arrays into text[], and call that from the AM.

Um, see strlist_to_textarray() in objectaddress.c if you do that.  Maybe
we need some generic place to store that kind of helper function.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

On Wed, Apr 8, 2015 at 3:34 AM, David Rowley dgrowle...@gmail.com wrote:
 On 8 April 2015 at 14:24, Robert Haas robertmh...@gmail.com wrote:
 I think one of the philosophical questions that has to be answered
 here is what does it mean to talk about the cost of a parallel
 plan?.  For a non-parallel plan, the cost of the plan means both the
 amount of effort we will spend executing the plan and also the
 amount of time we think the plan will take to complete, but those two
 things are different for parallel plans.  I'm inclined to think it's
 right to view the cost of a parallel plan as a proxy for execution
 time, because the fundamental principle of the planner is that we pick
 the lowest-cost plan.  But there also clearly needs to be some way to
 prevent the selection of a plan which runs slightly faster at the cost
 of using vastly more resources.

 I'd agree with that as far as CPU costs, or maybe I'd just disagree with the
 alternative, as if we costed in cost of individual worker's work * number
 of workers then we'd never choose a parallel plan, as by the time we costed
 in tuple communication costs between the processes a parallel plan would
 always cost more than the serial equivalent. I/O costs are different, I'd
 imagine these shouldn't be divided by the estimated number of workers.

It's hard to say.  If the I/O is from the OS buffer cache, then
there's no reason why several workers can't run in parallel.   And
even if it's from the actual storage, we don't know what degree of I/O
parallelism will be possible.  Maybe effective_io_concurrency should
play into the costing formula somehow, but it's not very clear to me
that captures the information we care about.  In general, I'm not sure
how common it is for the execution speed of a sequential scan to be
limited by I/O.

For example, on a pgbench database, scale factor 300, on a POWERPC
machine provided by IBM for performance testing (thanks, IBM!) a
cached read of the pgbench_accounts files took 1.122 seconds.  After
dropping the caches, it took 10.427 seconds.  select * from
pgbench_accounts where abalance  3 took 10.244 seconds with a
cold cache and 5.029 seconds with a warm cache.  So on this particular
hardware, on this particular test, parallelism is useless if the cache
is cold, but it could be right to use ~4-5 processes for the scan if
the cache is warm.  However, we have no way of knowing whether the
cache will be cold or warm at execution time.

This isn't a new problem.  As it is, the user has to set seq_page_cost
and random_page_cost based on either a cold-cache assumption or a
warm-cache assumption, and if they guess wrong, their costing
estimates will be off (on this platform, on this test case) by 4-5x.
That's pretty bad, and it's totally unclear to me what to do about it.
I'm guessing it's unclear to other people, too, or we would likely
have done something about it by now.

 Some ideas for GUCs:

 max_parallel_degree = The largest number of processes we'll consider
 using for a single query.
 min_parallel_speedup = The minimum percentage by which a parallel path
 must be cheaper (in terms of execution time) than a non-parallel path
 in order to survive.  I'm imagining the default here might be
 something like 15%.
 min_parallel_speedup_per_worker = Like the previous one, but per
 worker.  e.g. if this is 5%, which might be a sensible default, then a
 plan with 4 workers must be at least 20% better to survive, but a plan
 using only 2 workers only needs to be 10% better.

 max_parallel_degree feels awfully like it would have to be set
 conservatively, similar to how work_mem is today. Like with work_mem, during
 quiet periods it sure would be nice if it could magically increase.

Absolutely.  But, similar to work_mem, that's a really hard problem.
We can't know at plan time how much work memory, or how many CPUs,
will be available at execution time.  And even if we did, it need not
be constant throughout the whole of query execution.  It could be that
when execution starts, there's lots of memory available, so we do a
quicksort rather than a tape-sort.  But midway through the machine
comes under intense memory pressure and there's no way for the system
to switch strategies.

Now, having said that, I absolutely believe that it's correct for the
planner to make the initial decisions in this area.  Parallelism
changes the cost of execution nodes, and it's completely wrong to
assume that this couldn't alter planner decisions at higher levels of
the plan tree.  At the same time, it's pretty clear that it would be a
great thing for the executor to be able to adjust the strategy if the
planner's assumptions don't pan out, or if conditions have changed.

For example, if we choose a seq-scan-sort-and-filter over an
index-scan-and-filter thinking that we'll be able to do a quicksort,
and then it turns out that we're short on memory, it's too late to
switch gears and adopt the index-scan-and-filter plan after all.
That's long since

Re: [HACKERS] Parallel Seq Scan

On Wed, Apr 8, 2015 at 3:38 AM, Amit Langote
langote_amit...@lab.ntt.co.jp wrote:
 On 08-04-2015 PM 12:46, Amit Kapila wrote:
 Going forward, I think we can improve the same if we decide not to shutdown
 parallel workers till postmaster shutdown once they are started and
 then just allocate them during executor-start phase.

 I wonder if it makes sense to invent the notion of a global pool of workers
 with configurable number of workers that are created at postmaster start and
 destroyed at shutdown and requested for use when a query uses parallelizable
 nodes.

Short answer: Yes, but not for the first version of this feature.

Longer answer: We can't actually very reasonably have a global pool
of workers so long as we retain the restriction that a backend
connected to one database cannot subsequently disconnect from it and
connect to some other database instead.  However, it's certainly a
good idea to reuse the same workers for subsequent operations on the
same database, especially if they are also by the same user.  At the
very minimum, it would be good to reuse the same workers for
subsequent operations within the same query, instead of destroying the
old ones and creating new ones.  Nonwithstanding the obvious value of
all of these ideas, I don't think we should do any of them for the
first version of this feature.  This is too big a thing to get perfect
on the first try.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?


On 4/20/15 11:11 AM, Robert Haas wrote:

On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark st...@mit.edu wrote:

This is my point though (you're right that flushed isn't always the
same as eviction but that's not the important point here). Right now
we only demote when we consider buffers for eviction. But we promote
when we pin buffers. Those two things aren't necessarily happening at
the same rate and in fact are often orders of magnitude different.


I am absolutely, positively, violently in 100% agreement with this.  I
have made the same point before, but it sure is nice to hear someone
else thinking about it the same way.


+1


What I'm saying is that we should demote a buffer every time we
promote a buffer. So every time we pin a buffer we should advance the
clock a corresponding amount. I know I'm being intentionally vague
about what the corresponding amount is.) The important thing is that
the two should be tied together.


Yes, absolutely.  If you tilt your head the right way, my proposal of
limiting the number of promotions per clock sweep has the effect of
tying buffer demotion and buffer promotion together much more tightly
than is the case right now.  You are limited to 2 promotions per
demotion; and practically speaking not all buffers eligible to be
promoted will actually get accessed, so the number of promotions per
demotion will in reality be somewhere between 0 and 2.  Ideally it
would be exactly 1, but 1 +/- 1 is still a tighter limit than we have
at present.  Which is not to say there isn't some other idea that is
better still.


I think that would help, but it still leaves user backends trying to 
advance the clock, which is quite painful. Has anyone tested running the 
clock in the background? We need a wiki page with all the ideas that 
have been tested around buffer management...

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.


On 4/20/15 1:48 PM, Bruce Momjian wrote:

On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:

Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.

Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.


So, this patch avoids reading the all-frozen pages if it has not been
modified since the last VACUUM FREEZE?  Since it is already frozen, the
running VACUUM FREEZE will not modify the page or generate WAL, so is it
really worth maintaining a new per-page bitmap just to avoid the
sequential scan of tables every 200MB transactions?  I would like to see
us reduce the need for VACUUM FREEZE, rather than go in this direction.


How would you propose we do that?

I also think there's better ways we could handle *all* our cleanup work. 
Tuples have a definite lifespan, and there's potentially a lot of 
efficiency to be gained if we could track tuples through their stages of 
life... but I don't see any easy ways to do that.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
 Attached WIP patch adds Frozen Map which enables us to avoid whole
 table vacuuming even when full scan is required: preventing XID
 wraparound failures.
 
 Frozen Map is a bitmap with one bit per heap page, and quite similar
 to Visibility Map. A set bit means that all tuples on heap page are
 completely frozen, therefore we don't need to do vacuum freeze that
 page.
 A bit is set when vacuum(or autovacuum) figures out that all tuples on
 corresponding heap page are completely frozen, and a bit is cleared
 when INSERT and UPDATE(only new heap page) are executed.

So, this patch avoids reading the all-frozen pages if it has not been
modified since the last VACUUM FREEZE?  Since it is already frozen, the
running VACUUM FREEZE will not modify the page or generate WAL, so is it
really worth maintaining a new per-page bitmap just to avoid the
sequential scan of tables every 200MB transactions?  I would like to see
us reduce the need for VACUUM FREEZE, rather than go in this direction.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow SQL/plpgsql functions to accept record

2015-04-20 Thread David G. Johnston

On Sun, Apr 19, 2015 at 3:02 PM, Jim Nasby jim.na...@bluetreble.com wrote:

 Is there a fundamental reason SQL/plpgsql functions won't accept record as
 an input type? If not, can someone point me at a patch that might show how
 much work would be involved in adding support?

 My particular use case is a generic function that will count how many
 fields in a record are NULL. I can do it in pure SQL (below), but was
 hoping to wrap the entire thing in a function. Right now, I have to add a
 call to row_to_json() to the function call.

 SELECT count(*)
   FROM json_each_text( row_to_json($1) ) a
   WHERE value IS NULL


See also:

SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v);
ERROR: record type has not been registered

While it may not be necessary to solve both problems I suspect they have
the same underlying root cause - specifically the separation of concerns
between the planner and the executor.

ISTM that the planner needs to be able to create arbitrarily named
composite types and leave them registered in the session somewhere for
the executor to find.  Session because:

PREPARE prep_rec AS SELECT record_input_func(v) FROM ( VALUES
(ROW($1::integer,$2::boolean,$3::text)) src (v);
EXECUTE prep_rec USING (1, true, 'hi!');

If it requires additional smarts in the executor to make this all work I
suspect the cost-benefit equations end up supporting the somewhat more
verbose but workable status-quo.

I'm not sure how { row_to_json(record) } works but SQL (including pl/pgsql)
needs to have some source of definition for what the record type should be
in reality - and that source currently is the catalogs whose rows are
locked by the planner and injected, I think, into a session cache.  The
source query in pl/pgsql defines the type for fully embedded use of the
record placeholder while the caller's function alias provides that
information for RETURNS record.  The calling query needs to provide the
same information for CREATE FUNCTION func( arg1 record ) since the body
of the pl/pgsql function needs to instantiate arg1 with a known type as
soon as the function is entered.  It is theoretically possible to impute
the needed anonymous type from the query definition - the problem is how
and where to register that information for execution.

At least for pl/pgsql I could see possibly doing something like func( arg1
packed_record_bytes) and having pl/pgsql understand how to unpack those
bytes into an anonymous but structured record (like it would with SELECT
... INTO record_var) seems plausible.  I would not expect pl/SQL to allow
anything of the sort as it doesn't seem compatible with the idea of
inline-ability.

Maybe the C code for row_to_json (or libpq in general) can provide
inspiration (particularly for the pack/unpack bytes) but as I do not know
C I'm going to have to leave that to others.

David J.

Re: [HACKERS] Freeze avoidance of very large table.

On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote:
 On 4/20/15 1:48 PM, Bruce Momjian wrote:
 On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
 Attached WIP patch adds Frozen Map which enables us to avoid whole
 table vacuuming even when full scan is required: preventing XID
 wraparound failures.
 
 Frozen Map is a bitmap with one bit per heap page, and quite similar
 to Visibility Map. A set bit means that all tuples on heap page are
 completely frozen, therefore we don't need to do vacuum freeze that
 page.
 A bit is set when vacuum(or autovacuum) figures out that all tuples on
 corresponding heap page are completely frozen, and a bit is cleared
 when INSERT and UPDATE(only new heap page) are executed.
 
 So, this patch avoids reading the all-frozen pages if it has not been
 modified since the last VACUUM FREEZE?  Since it is already frozen, the
 running VACUUM FREEZE will not modify the page or generate WAL, so is it
 really worth maintaining a new per-page bitmap just to avoid the
 sequential scan of tables every 200MB transactions?  I would like to see
 us reduce the need for VACUUM FREEZE, rather than go in this direction.
 
 How would you propose we do that?
 
 I also think there's better ways we could handle *all* our cleanup
 work. Tuples have a definite lifespan, and there's potentially a lot
 of efficiency to be gained if we could track tuples through their
 stages of life... but I don't see any easy ways to do that.

See the TODO list:

https://wiki.postgresql.org/wiki/Todo
o  Avoid the requirement of freezing pages that are infrequently
   modified

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

Bruce Momjian wrote:

 I think the limit has to be in terms of a percentage of the table size. 
 For example, if we do one SELECT on a table with all non-dirty pages, it
 would be good to know that 5% of the pages were pruned --- that tells me
 that another 19 SELECTs will totally prune the table, assuming no future
 writes.

This seems simple to implement: keep two counters, where the second one
is pages we skipped cleanup in.  Once that counter hits SOME_MAX_VALUE,
reset the first counter so that further 5 pages will get HOT pruned.  5%
seems a bit high though.  (In Simon's design, SOME_MAX_VALUE is
essentially +infinity.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:
 Bruce Momjian wrote:
 
  I think the limit has to be in terms of a percentage of the table size. 
  For example, if we do one SELECT on a table with all non-dirty pages, it
  would be good to know that 5% of the pages were pruned --- that tells me
  that another 19 SELECTs will totally prune the table, assuming no future
  writes.
 
 This seems simple to implement: keep two counters, where the second one
 is pages we skipped cleanup in.  Once that counter hits SOME_MAX_VALUE,
 reset the first counter so that further 5 pages will get HOT pruned.  5%
 seems a bit high though.  (In Simon's design, SOME_MAX_VALUE is
 essentially +infinity.)

Oh, I pulled 5% out of the air.  Thinking of a SELECT-only workload,
which would be our worse case, I was thinking how many SELECTS running
through HOT update chains would it take to be slower than generating the
WAL to prune the page.  I see the percentage as something that we could
reasonably balance, while a fixed page count couldn't be analyzed in
that way.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [pgsql-packagers] Palle Girgensohn's ICU patch

2015-04-20 Thread Peter Eisentraut

On 4/19/15 7:46 AM, Palle Girgensohn wrote:
 Just poking this old thread again. What happened here, is anyone putting work 
 into this area at the moment?

I plan to look at it for 9.6.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] optimizing vacuum truncation scans

2015-04-20 Thread Qingqing Zhou

On Sun, Apr 19, 2015 at 7:09 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I did literally the simplest thing I could think of as a proof of concept 
 patch, to see if it would actually fix things.  I just jumped back a certain 
 number of blocks occasionally and prefetched them forward, then resumed the 
 regular backward scan.
IIRC, this patches introduces optional prefetch for the backward scan
where file system prefetch may not work well. The result is promising!

 Also, what would be the best way to drill a hole through bufmgr.c into md.c 
 so that the prefetch could specify an entire range, rather than looping over 
 each individual block?

Different from mdread(), which requires a buffer in argument.
Extending mdprefetch() looks natural and safe to me. This shall also
layout a foundation for other scan style optimizations.

 What would have to be done to detect people running on SSD and disable the 
 feature, if anything?

There are other call sites of prefetch - so this shall be a topic not
just for this optimization.

Regards,
Qingqing


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-20 Thread Jeff Janes

On Mon, Apr 20, 2015 at 10:33 AM, Bruce Momjian br...@momjian.us wrote:

 On Thu, Apr 16, 2015 at 03:41:54PM +0100, Simon Riggs wrote:
  That is how we arrive at the idea of a cleanup limit, further enhanced
 by a
  limit that applies only to dirtying clean blocks, which we have 4?
 recent votes
  in favour of.
 
  I would personally be in favour of a parameter to control the limit,
 since
  whatever we chose is right/wrong depending upon circumstances. I am
 however
  comfortable with not having a parameter if people think it is hard to
 tune
  that, which I agree it would be, hence no parameter in the patch.

 I think the limit has to be in terms of a percentage of the table size.
 For example, if we do one SELECT on a table with all non-dirty pages, it
 would be good to know that 5% of the pages were pruned --- that tells me
 that another 19 SELECTs will totally prune the table, assuming no future
 writes.


But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job,
while the user waits, which is fundamentally VACUUM's duty to do in the
background?  If there are a handful of very hot pages, then it makes sense
not to wait for vacuum to get to them.  And that is what a block-count
limit does.

But if the entire table is very hot, I think that that is just another of
way of saying that autovacuum is horribly misconfigured.  I think the
purpose of this patch is to fix something that can't be fixed through
configuration alone.


 If there are future writes, they would dirty the pages and
 cause even more pruning, but the 5% gives me the maximum pruning number
 of SELECTs.  If there aren't another 19 SELECTs, do I care if the table
 is pruned or not?


The use case I see for this is when there is a mixed workload.  There is
one select which reads the entire table, and hundreds of thousands of
selects/updates/insert that don't, and of course vacuum comes along every
now and then and does it thing.  Why should the one massive SELECT have
horrible performance just because it was run right before autovacuum would
have kicked in instead of right after if finished?

Cheers,

Jeff

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 12:28:11PM -0700, Jeff Janes wrote:
 But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while
 the user waits, which is fundamentally VACUUM's duty to do in the background? 
 If there are a handful of very hot pages, then it makes sense not to wait for
 vacuum to get to them.  And that is what a block-count limit does.  
 
 But if the entire table is very hot, I think that that is just another of way
 of saying that autovacuum is horribly misconfigured.  I think the purpose of

Well, we have to assume there are many misconfigured configurations ---
autovacuum isn't super-easy to configure, so we can't just blame the
user if this makes things worse.  In fact, page pruning was designed
spefically for cases where autovacuum wasn't running our couldn't keep
up.

 this patch is to fix something that can't be fixed through configuration 
 alone.
  
 
 If there are future writes, they would dirty the pages and
 cause even more pruning, but the 5% gives me the maximum pruning number
 of SELECTs.  If there aren't another 19 SELECTs, do I care if the table
 is pruned or not? 
 
 
 The use case I see for this is when there is a mixed workload.  There is one
 select which reads the entire table, and hundreds of thousands of selects/
 updates/insert that don't, and of course vacuum comes along every now and then
 and does it thing.  Why should the one massive SELECT have horrible 
 performance
 just because it was run right before autovacuum would have kicked in instead 
 of
 right after if finished?

I see your point, but what about the read-only workload after a big
update?  Do we leave large tables to be non-pruned for a long time? 
Also, consider cases where you did a big update, the autovacuum
thresh-hold was not met, so autovacuum doesn't run on that table ---
again, do we keep those non-pruned rows around for millions of scans?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] alternative compression algorithms?