Re: [HACKERS] Open issues for HOT patch

2007-09-21 Thread Alvaro Herrera
Gregory Stark wrote:
 
 Heikki Linnakangas [EMAIL PROTECTED] writes:
 
  There is one wacky idea I haven't dared to propose yet:
 
  We could lift the limitation that you can't defragment a page that's
  pinned, if we play some smoke and mirrors in the buffer manager. When
  you prune a page, make a *copy* of the page you're pruning, and keep
  both versions in the buffer cache. Old pointers keep pointing to the old
  version. Any new calls to ReadBuffer will return the new copy, and the
  old copy can be dropped when its pin count drops to zero.
 
 Fwiw when Heikki first mentioned this idea I thought it was the craziest thing
 I ever heard. But the more I thought about it the more I liked it. I've come
 to the conclusion that while it's a wart, it's not much worse than the wart of
 the super-exclusive lock which it replaces. In fact it's arguably cleaner in
 some ways.

I read this as very similar to RCU.  In some scenarios it makes a lot of
sense.  I don't think it's a thing to be attacked in 8.3 though, since
it is a big change to the bufmgr.

-- 
Alvaro Herrerahttp://www.advogato.org/person/alvherre
The Postgresql hackers have what I call a NASA space shot mentality.
 Quite refreshing in a world of weekend drag racer developers.
(Scott Marlowe)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Open issues for HOT patch

2007-09-19 Thread Simon Riggs
On Tue, 2007-09-18 at 12:10 -0400, Tom Lane wrote:
 I wrote:
  * The patch makes undocumented changes that cause autovacuum's decisions
  to be driven by total estimated dead space rather than total number of
  dead tuples.  Do we like this?
 
 No one seems to have picked up on this point, but after reflection
 I think there's actually a pretty big problem here.  Per-page pruning
 is perfectly capable of keeping dead space in check.  In a system with
 HOT running well, the reasons to vacuum a table will be:
 
 1. Remove dead index entries.
 2. Remove LP_DEAD line pointers.
 3. Truncate off no-longer-used end pages.
 4. Transfer knowledge about free space into FSM.
 
 Pruning cannot accomplish #1, #2, or #3, and without significant changes
 in the FSM infrastructure it has no hope about #4 either.  What I'm
 afraid of is that steady page-level pruning will keep the amount of dead
 space low, causing autovacuum never to fire, causing the indexes to
 bloat indefinitely because of #1 and the table itself to bloat
 indefinitely because of #2 and #4.  Thus, the proposed change in
 autovacuum seems badly misguided: instead of making autovacuum trigger
 on things that only it can fix, it makes autovacuum trigger on something
 that per-page pruning can deal with perfectly well.
 
 I'm inclined to think that we should continue to drive autovac off a
 count of dead rows, as this is directly related to points #1 and #2,
 and doesn't seem any worse for #3 and #4 than an estimate based on space
 would be.  Possibly it would be sensible for per-page pruning to report
 a reduction in number of dead rows when it removes heap-only tuples,
 but I'm not entirely sure --- any thoughts?

Some behavioural comments only: I was part of the earlier discussion
about when-to-VACUUM and don't have any fixed view of how to do this.

If HOT is running well, then there will be less need for #1, #3 and #4,
as I understand it. Deletes will still cause the need for #1, #3, #4 as
well as dead-space removal. Many tables have only Inserts and Deletes,
so we need to take that into account.

On large tables, VACUUM hurts very badly, so I would like to see it run
significantly less often.

In your last post you mentioned multiple UPDATEs. Pruning multiple times
for successive UPDATEs isn't going to release more space, so why do it?

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Open issues for HOT patch

2007-09-19 Thread Heikki Linnakangas
Decibel! wrote:
 On Tue, Sep 18, 2007 at 11:32:52AM -0400, Tom Lane wrote:
 Another option would be to prune whenever the free space goes
 below table fillfactor and hope that users would set fillfactor so that
 atleast one updated tuple can fit in the block. I know its not best to
 rely on the users though. But it can be good hint.
 If default fillfactor weren't 100% then this might be good ;-).  But
 
 Erik Jones and I were just talking about FILLFACTOR...
 
 Is the plan to keep it at 100% with HOT? ISTM that's not such a great
 idea, since it forces at least the first update (if not many more) to be
 COLD.

I think we should still keep it at 100%. Most tables are not updated,
and a non-100% fillfactor will be waste of space when the extra space is
not needed. Even a table that is updated should reach a steady state
after a few cold updates. Those cold updates will make room on the pages
for future updates, now that we can prune them and leave only dead line
pointers behind.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Open issues for HOT patch

2007-09-19 Thread Heikki Linnakangas
Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
 We could lift the limitation that you can't defragment a page that's
 pinned, if we play some smoke and mirrors in the buffer manager. When
 you prune a page, make a *copy* of the page you're pruning, and keep
 both versions in the buffer cache. Old pointers keep pointing to the old
 version. Any new calls to ReadBuffer will return the new copy, and the
 old copy can be dropped when its pin count drops to zero.
 
 No, that's way too wacky.  How do you prevent people from making further
 changes to the old version?  For instance, marking a tuple deleted?

To make any changes to the old version, you need to lock the page with
LockBuffer. LockBuffer needs to return a buffer with the latest version
of the page, and the caller has to use that version for any changes.
Changing all callers of LockBuffer (that lock heap pages) to do that is
the biggest change involved, AFAICS.

Hint bit updates to the old version we could just forget about.

 The actual practical application we have, I think, would only require
 being able to defrag a page that our own backend has pins on, which is
 something that might be more workable --- but it still seems awfully
 fragile.  It could maybe be made to work in the simplest case of a
 plain UPDATE, because in practice I think the executor will never
 reference the old tuple's contents after heap_update() returns.  But
 this falls down in more complex situations involving joins --- we might
 continue to try to join the same old tuple to other rows, and then any
 pass-by-reference Datums we are using are corrupt if the tuple got
 moved.

Ugh, yeah that's too fragile.

Another wacky idea:

Within our own backend, we could keep track of which tuples we've
accessed, and defrag could move all other tuples as long as the ones
that we might still have pointers to are not touched. The bookkeeping
wouldn't have to be exact, as long as it's conservative.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Open issues for HOT patch

2007-09-19 Thread Gregory Stark

Heikki Linnakangas [EMAIL PROTECTED] writes:

 There is one wacky idea I haven't dared to propose yet:

 We could lift the limitation that you can't defragment a page that's
 pinned, if we play some smoke and mirrors in the buffer manager. When
 you prune a page, make a *copy* of the page you're pruning, and keep
 both versions in the buffer cache. Old pointers keep pointing to the old
 version. Any new calls to ReadBuffer will return the new copy, and the
 old copy can be dropped when its pin count drops to zero.

Fwiw when Heikki first mentioned this idea I thought it was the craziest thing
I ever heard. But the more I thought about it the more I liked it. I've come
to the conclusion that while it's a wart, it's not much worse than the wart of
the super-exclusive lock which it replaces. In fact it's arguably cleaner in
some ways.

As a result vacuum would never have to wait for arbitrarily long pins and
there wouldn't be the concept of a vacuum waiting for a vacuum lock with
strange lock queueing semantics. It also means we could move tuples around on
the page more freely.

The only places which would have to deal with a possible new buffer would be
precisely those places that lock the page. If you aren't locking the page then
you definitely aren't about to fiddle with any bits that matter since your
writes could be lost. Certainly you're not about to set xmin or xmax or
anything like that. 

You might set hint bits which would be lost but probably not often since you
would have already checked the visibility of the tuples with the page locked.
There may be one or two places where we fiddle bits for a tuple we've just
inserted ourselves thinking nobody else can see it yet, but the current
philosophy seems to be leaning towards treating such coding practices as
unacceptably fragile anyways.

The buffer manager doesn't really need to track multiple versions of pages.
It would just mark the old version as an orphaned buffer which is
automatically a victim for the clock sweep as soon as the pin count drops to
0. It will never need to return such a buffer. What we would need is enough
information to reread the buffer if someone tries to lock it and to unpin it
when someone unpins a newer version.

At first I thought the cost of copying the page would be a downside but in
fact Heikki pointed out that in defragmentation we're already copying the
page. In fact copying it to new memory instead of memory which is almost
certainly likely in processor caches which would need to be invalidated would
actually be faster and avoiding the use of memmove could be faster too.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Open issues for HOT patch

2007-09-19 Thread Tom Lane
Decibel! [EMAIL PROTECTED] writes:
 3 isn't that important to me, but 4 is:
 4. Doesn't hammer the database to measure

 And pgstattuple fails #4 miserably. Want to know the average dead space
 in a 500GB database? Yeah, right

So we could put a vacuum_cost_delay() in it ...

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Open issues for HOT patch

2007-09-19 Thread Decibel!

On Sep 19, 2007, at 8:08 AM, Tom Lane wrote:

Decibel! [EMAIL PROTECTED] writes:

3 isn't that important to me, but 4 is:
4. Doesn't hammer the database to measure


And pgstattuple fails #4 miserably. Want to know the average dead  
space

in a 500GB database? Yeah, right


So we could put a vacuum_cost_delay() in it ...


pg_stat_delay()? ;)

That's better than what we have now, without a doubt. But I'd still  
prefer to have a table I can just read on the fly.


We do have a TODO to replace the current stats infrastructure with  
something that has less overhead, right? :)

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Heikki Linnakangas
Bruce Momjian wrote:
 If we only prune on an update (or insert) why not just do prune every
 time?  I figure the prune/defrag has to be lighter than the
 update/insert itself.

Pruning is a quite costly operation. You need to check the visibility of
each tuple on the page, following tuple chains as you go, mark line
pointers as not used or redirected, and finally memmove all the tuples
to remove the gaps between them. And it needs to be WAL-logged.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Heikki Linnakangas
Tom Lane wrote:
 * We also need to think harder about when to invoke the page pruning
 code.  As the patch stands, if you set a breakpoint at
 heap_page_prune_opt it'll seem to be hit constantly (eg, once for every
 system catalog probe), which seems uselessly often.  And yet it also
 seems not often enough, because one thing I found out real fast is that
 the prune if free space  1.2 average tuple size heuristic fails badly
 when you look at queries that execute multiple updates within the same
 heap page.  We only prune when we first pin a particular target page,
 and so the additional updates don't afford another chance to see if it's
 time to prune.
 
 I'd like to see if we can arrange to only do pruning when reading a page
 that is known to be an update target (ie, never during plain SELECTs);
 I suspect this would be relatively easy with some executor and perhaps
 planner changes.  But that only fixes the first half of the gripe above;
 I'm not at all sure what to do about the multiple-updates-per-page
 issue.

There is one wacky idea I haven't dared to propose yet:

We could lift the limitation that you can't defragment a page that's
pinned, if we play some smoke and mirrors in the buffer manager. When
you prune a page, make a *copy* of the page you're pruning, and keep
both versions in the buffer cache. Old pointers keep pointing to the old
version. Any new calls to ReadBuffer will return the new copy, and the
old copy can be dropped when its pin count drops to zero.

Tracking multiple copies of a page requires some changes to the buffer
manager. LockBuffer would need to return the latest version of the page,
because anything that checks visibility or does updates would need to
use the latest copy, and callers of LockBuffer would need to be changed
accordingly.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 But then what happens when you want to update a second tuple on the same
 page?  None of our existing plan types release and reacquire pin if they
 don't have to, and I really doubt that we want to give up that
 optimization.

 You will prune when you lock the page and at that point unless you got
 enough room for both tuples I doubt trying just before the second tuple
 is going to help.

No, you're missing the point completely.  If the free space on the page
is, say, 1.5x the average tuple size, the code *won't* prune, and then
it will be stuck when it goes to do the second tuple update, because
there is no chance to reconsider the prune/no-prune decision after some
space is eaten by the first update.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 There is one wacky idea I haven't dared to propose yet:

 We could lift the limitation that you can't defragment a page that's
 pinned, if we play some smoke and mirrors in the buffer manager. When
 you prune a page, make a *copy* of the page you're pruning, and keep
 both versions in the buffer cache. Old pointers keep pointing to the old
 version. Any new calls to ReadBuffer will return the new copy, and the
 old copy can be dropped when its pin count drops to zero.

No, that's way too wacky.  How do you prevent people from making further
changes to the old version?  For instance, marking a tuple deleted?

The actual practical application we have, I think, would only require
being able to defrag a page that our own backend has pins on, which is
something that might be more workable --- but it still seems awfully
fragile.  It could maybe be made to work in the simplest case of a
plain UPDATE, because in practice I think the executor will never
reference the old tuple's contents after heap_update() returns.  But
this falls down in more complex situations involving joins --- we might
continue to try to join the same old tuple to other rows, and then any
pass-by-reference Datums we are using are corrupt if the tuple got
moved.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  But then what happens when you want to update a second tuple on the same
  page?  None of our existing plan types release and reacquire pin if they
  don't have to, and I really doubt that we want to give up that
  optimization.
 
  You will prune when you lock the page and at that point unless you got
  enough room for both tuples I doubt trying just before the second tuple
  is going to help.
 
 No, you're missing the point completely.  If the free space on the page
 is, say, 1.5x the average tuple size, the code *won't* prune, and then
 it will be stuck when it goes to do the second tuple update, because
 there is no chance to reconsider the prune/no-prune decision after some
 space is eaten by the first update.

My point is that if you only do this for INSERT/UPDATE, you can prune
when you have less than enough room for 3-4 tuples, and if you add the
xmin of the earliest prune xact you can prune even more aggressively.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes:
 On 9/18/07, Tom Lane [EMAIL PROTECTED] wrote:
 Perhaps we could
 replace that heuristic with something that is page-local; seems like
 dividing the total used space by the number of item pointers would give
 at least a rough approximation of the page's average tuple size.
 
 We might get it completely wrong unless we know the number of
 normal line pointers (redirected, dead and unused line pointers
 do not take any real storage).

Sure, but it's only a heuristic anyway.  Probably a more serious
objection is that it fails in the wrong direction: if you start to get
line pointer bloat then the estimated average tuple size goes down,
making it less likely to prune instead of more.  But maybe do something
that looks at free space and number of pointers independently, rather
than operating in terms of average tuple size?

 Another option would be to prune whenever the free space goes
 below table fillfactor and hope that users would set fillfactor so that
 atleast one updated tuple can fit in the block. I know its not best to
 rely on the users though. But it can be good hint.

If default fillfactor weren't 100% then this might be good ;-).  But
we could use max(1-fillfactor, BLCKSZ/8) or some such.

 Yet another option would be to set a hint on the page whenever we
 fail to do HOT update because of not-enough-free-space in the
 block. Next time we shall prune and so the subsequent updates
 would be HOT update.

This would be a good idea independent of anything else, I think.
There's plenty of room for page hint bits, and the write will be
free since we must set the old tuple XMAX anyway.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Pavan Deolasee
On 9/18/07, Tom Lane [EMAIL PROTECTED] wrote:


  Another option would be to prune whenever the free space goes
  below table fillfactor

 If default fillfactor weren't 100% then this might be good ;-).  But
 we could use max(1-fillfactor, BLCKSZ/8) or some such.

  Yet another option would be to set a hint on the page whenever we
  fail to do HOT update

 This would be a good idea independent of anything else, I think.



Or may be a combination of the above two would work well. If the user
has not specified any fillfactor (and so table is using default 100), the
first update is most likely a COLD update and we set the hint bit. The
retired tuple is then pruned before the next update comes and we shall
do HOT update. OTOH if the use has specified a fillfactor less than
to avoid any COLD update, the first update will be HOT. At that point,
the used space would go above fillfactor and hence we shall prune
before the next update comes making room for next HOT update.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Tom Lane
I wrote:
 * The patch makes undocumented changes that cause autovacuum's decisions
 to be driven by total estimated dead space rather than total number of
 dead tuples.  Do we like this?

No one seems to have picked up on this point, but after reflection
I think there's actually a pretty big problem here.  Per-page pruning
is perfectly capable of keeping dead space in check.  In a system with
HOT running well, the reasons to vacuum a table will be:

1. Remove dead index entries.
2. Remove LP_DEAD line pointers.
3. Truncate off no-longer-used end pages.
4. Transfer knowledge about free space into FSM.

Pruning cannot accomplish #1, #2, or #3, and without significant changes
in the FSM infrastructure it has no hope about #4 either.  What I'm
afraid of is that steady page-level pruning will keep the amount of dead
space low, causing autovacuum never to fire, causing the indexes to
bloat indefinitely because of #1 and the table itself to bloat
indefinitely because of #2 and #4.  Thus, the proposed change in
autovacuum seems badly misguided: instead of making autovacuum trigger
on things that only it can fix, it makes autovacuum trigger on something
that per-page pruning can deal with perfectly well.

I'm inclined to think that we should continue to drive autovac off a
count of dead rows, as this is directly related to points #1 and #2,
and doesn't seem any worse for #3 and #4 than an estimate based on space
would be.  Possibly it would be sensible for per-page pruning to report
a reduction in number of dead rows when it removes heap-only tuples,
but I'm not entirely sure --- any thoughts?

If we do this, then it's not clear that having pgstats track dead space
is worth the trouble at all.  It might possibly be of value for testing
purposes to see how well pruning is doing, but I'm unconvinced that it's
worth bloating stats messages and files to have this number in a
production system.  An alternative that would serve as well for testing
would be to teach contrib/pgstattuple to measure dead space.

Comments?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Pavan Deolasee
On 9/18/07, Tom Lane [EMAIL PROTECTED] wrote:

 I wrote:
  * The patch makes undocumented changes that cause autovacuum's decisions
  to be driven by total estimated dead space rather than total number of
  dead tuples.  Do we like this?

 No one seems to have picked up on this point, but after reflection
 I think there's actually a pretty big problem here.  Per-page pruning
 is perfectly capable of keeping dead space in check.  In a system with
 HOT running well, the reasons to vacuum a table will be:

 1. Remove dead index entries.
 2. Remove LP_DEAD line pointers.
 3. Truncate off no-longer-used end pages.
 4. Transfer knowledge about free space into FSM.

 Pruning cannot accomplish #1, #2, or #3, and without significant changes
 in the FSM infrastructure it has no hope about #4 either.



I guess we already have mechanism to remove dead index entries
outside vacuum. So my take would be handle vacuum based on dead
index entries separately. May be we can track number of dead index
entries and trigger vacuum on the base relation if it goes beyond a
threshold.

For LP_DEAD line pointers, with some adjustments to the patch, we
can make it track dead space in a page by accounting for the
the LP_DEAD pointers. So if there is a bloat because of LP_DEAD
pointers, that will be reflected in the dead space and help us trigger
vacuum on the table.



 I'm inclined to think that we should continue to drive autovac off a
 count of dead rows


If we do that, I guess it would make sense to count only non-HOT
dead tuples because HOT tuples neither create LP_DEAD line
pointers nor cause index bloats.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Joshua D. Drake
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tom Lane wrote:
 I wrote:
 * The patch makes undocumented changes that cause autovacuum's decisions
 to be driven by total estimated dead space rather than total number of
 dead tuples.  Do we like this?

 If we do this, then it's not clear that having pgstats track dead space
 is worth the trouble at all.  It might possibly be of value for testing
 purposes to see how well pruning is doing, but I'm unconvinced that it's
 worth bloating stats messages and files to have this number in a
 production system.  An alternative that would serve as well for testing
 would be to teach contrib/pgstattuple to measure dead space.

As a DBA, I can say it doesn't really matter to me *how we track* the
dead space, as long as tracking it is:

1. Clear
2. Simple
3. Available by default (thus pgstattuple needs to push into core)


Sincerely,

Joshua D. Drake



- --

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG7/1HATb/zqfZUUQRAorXAJ47OZI8n7Bpj4pRyxRO1nGCUC7L0wCgojPZ
74vcXOZ1KqTFKw8v/w4WngI=
=Bpc2
-END PGP SIGNATURE-

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes:
 On 9/18/07, Tom Lane [EMAIL PROTECTED] wrote:
 In a system with
 HOT running well, the reasons to vacuum a table will be:
 
 1. Remove dead index entries.
 2. Remove LP_DEAD line pointers.
 3. Truncate off no-longer-used end pages.
 4. Transfer knowledge about free space into FSM.
 
 Pruning cannot accomplish #1, #2, or #3, and without significant changes
 in the FSM infrastructure it has no hope about #4 either.

 I guess we already have mechanism to remove dead index entries
 outside vacuum.

Not a trustworthy one --- unless you have a solid proposal for making it
work with bitmap indexscans, it would be foolish to design autovacuum
behavior on the assumption that dead index entries aren't a problem.

(Also, IIRC only btree has been taught to recover dead entries at all.)

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Pavan Deolasee
On 9/18/07, Tom Lane [EMAIL PROTECTED] wrote:


 * I'm still pretty unhappy about the patch's use of a relcache copy of
 GetAvgFSMRequestSize()'s result.  The fact that there's no provision for
 ever updating the value while the relcache entry lives is part of it,
 but the bigger part is that I'd rather not have anything at all
 depending on that number.



We could fix the first part by adding relcache invalidation whenever
the average FSM request size changes by a certain margin. But
I am not insisting on using the avgFSM mechanism to decide when
to prune.


Perhaps we could
 replace that heuristic with something that is page-local; seems like
 dividing the total used space by the number of item pointers would give
 at least a rough approximation of the page's average tuple size.


We might get it completely wrong unless we know the number of
normal line pointers (redirected, dead and unused line pointers
do not take any real storage).

Another option would be to prune whenever the free space goes
below table fillfactor and hope that users would set fillfactor so that
atleast one updated tuple can fit in the block. I know its not best to
rely on the users though. But it can be good hint.

Yet another option would be to set a hint on the page whenever we
fail to do HOT update because of not-enough-free-space in the
block. Next time we shall prune and so the subsequent updates
would be HOT update.

None of these are perfect though.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Pavan Deolasee
On 9/18/07, Tom Lane [EMAIL PROTECTED] wrote:

 Pavan Deolasee [EMAIL PROTECTED] writes:
  On 9/18/07, Tom Lane [EMAIL PROTECTED] wrote:
  In a system with
  HOT running well, the reasons to vacuum a table will be:
 
  1. Remove dead index entries.
  2. Remove LP_DEAD line pointers.
  3. Truncate off no-longer-used end pages.
  4. Transfer knowledge about free space into FSM.
 
  Pruning cannot accomplish #1, #2, or #3, and without significant
 changes
  in the FSM infrastructure it has no hope about #4 either.

  I guess we already have mechanism to remove dead index entries
  outside vacuum.

 Not a trustworthy one --- unless you have a solid proposal for making it
 work with bitmap indexscans, it would be foolish to design autovacuum
 behavior on the assumption that dead index entries aren't a problem.



Hmm.. I think we need to drop this for now because I am sure any
such proposal would need a lot more discussion. May be something
we can pick up for 8.4

So we go back to tracking dead tuples. I would still be inclined
towards tracking non-HOT dead tuples or subtract the count of
pruned HOT tuples because we don't want to trigger autovacuum
too often, rather let pruning clean as much dead space as possible.
What it means also that the tuple storage reclaimed by pruning
a non-HOT dead tuples does not impact the autovacuum behavior,
positively or negatively. And ISTM that this does not address 4 ?

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Decibel!
On Tue, Sep 18, 2007 at 11:32:52AM -0400, Tom Lane wrote:
  Another option would be to prune whenever the free space goes
  below table fillfactor and hope that users would set fillfactor so that
  atleast one updated tuple can fit in the block. I know its not best to
  rely on the users though. But it can be good hint.
 
 If default fillfactor weren't 100% then this might be good ;-).  But

Erik Jones and I were just talking about FILLFACTOR...

Is the plan to keep it at 100% with HOT? ISTM that's not such a great
idea, since it forces at least the first update (if not many more) to be
COLD.

I realize that ideally we'd probably want FILLFACTOR to take things like
average tuple size and average number of updates per page into account,
but for a first pass 90% would likely be a good compromise...
-- 
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828


pgpB5r6zpzlbd.pgp
Description: PGP signature


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Decibel!
On Tue, Sep 18, 2007 at 09:31:03AM -0700, Joshua D. Drake wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Tom Lane wrote:
  I wrote:
  * The patch makes undocumented changes that cause autovacuum's decisions
  to be driven by total estimated dead space rather than total number of
  dead tuples.  Do we like this?
 
  If we do this, then it's not clear that having pgstats track dead space
  is worth the trouble at all.  It might possibly be of value for testing
  purposes to see how well pruning is doing, but I'm unconvinced that it's
  worth bloating stats messages and files to have this number in a
  production system.  An alternative that would serve as well for testing
  would be to teach contrib/pgstattuple to measure dead space.
 
 As a DBA, I can say it doesn't really matter to me *how we track* the
 dead space, as long as tracking it is:
 
 1. Clear
 2. Simple
 3. Available by default (thus pgstattuple needs to push into core)
 
3 isn't that important to me, but 4 is:

4. Doesn't hammer the database to measure

And pgstattuple fails #4 miserably. Want to know the average dead space
in a 500GB database? Yeah, right
-- 
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828


pgpYEy0HNFGbI.pgp
Description: PGP signature


Re: [HACKERS] Open issues for HOT patch

2007-09-18 Thread Joshua D. Drake
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Decibel! wrote:
 On Tue, Sep 18, 2007 at 09:31:03AM -0700, Joshua D. Drake wrote:

 If we do this, then it's not clear that having pgstats track dead space
 is worth the trouble at all.  It might possibly be of value for testing
 purposes to see how well pruning is doing, but I'm unconvinced that it's
 worth bloating stats messages and files to have this number in a
 production system.  An alternative that would serve as well for testing
 would be to teach contrib/pgstattuple to measure dead space.
 As a DBA, I can say it doesn't really matter to me *how we track* the
 dead space, as long as tracking it is:

 1. Clear
 2. Simple
 3. Available by default (thus pgstattuple needs to push into core)
  
 3 isn't that important to me, but 4 is:
 
 4. Doesn't hammer the database to measure
 
 And pgstattuple fails #4 miserably. Want to know the average dead space
 in a 500GB database? Yeah, right

Point taken, and agreed (although I still think it needs to be in core).

Joshua D. Drake

- --

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG8JpeATb/zqfZUUQRAprDAJ9PtAUx8ZG5P/HnQSM9KZZ/ii3QzwCdHRZ6
JHwNQMkwpS63huymdN0r4Yc=
=0TuC
-END PGP SIGNATURE-

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] Open issues for HOT patch

2007-09-17 Thread Tom Lane
I have finished a first review pass over all of the HOT patch
(updated code is posted on -patches).  I haven't found any showstoppers,
but there seem still several areas that need discussion:

* The patch makes undocumented changes that cause autovacuum's decisions
to be driven by total estimated dead space rather than total number of
dead tuples.  Do we like this?  What should happen to the default
threshold parameters (they are not even in the same units as before...)?
Is there any value in even continuing to track dead tuple counts, per
se, in the pgstats machinery?  It seems redundant/expensive to track
both tuple counts and byte counts, and it's not like the size of the
stats file is not already known to be a performance issue ...

* I'm still pretty unhappy about the patch's use of a relcache copy of
GetAvgFSMRequestSize()'s result.  The fact that there's no provision for
ever updating the value while the relcache entry lives is part of it,
but the bigger part is that I'd rather not have anything at all
depending on that number.  FSM in its current form desperately needs to
die; and once it's replaced by some form of distributed on-disk storage,
it's unlikely that we will have any simple means of getting an
equivalent number.  The average request size was never meant for
external use anyway, but only as a filter to help reject useless entries
from getting into the limited shared-memory FSM space.  Perhaps we could
replace that heuristic with something that is page-local; seems like
dividing the total used space by the number of item pointers would give
at least a rough approximation of the page's average tuple size.

* We also need to think harder about when to invoke the page pruning
code.  As the patch stands, if you set a breakpoint at
heap_page_prune_opt it'll seem to be hit constantly (eg, once for every
system catalog probe), which seems uselessly often.  And yet it also
seems not often enough, because one thing I found out real fast is that
the prune if free space  1.2 average tuple size heuristic fails badly
when you look at queries that execute multiple updates within the same
heap page.  We only prune when we first pin a particular target page,
and so the additional updates don't afford another chance to see if it's
time to prune.

I'd like to see if we can arrange to only do pruning when reading a page
that is known to be an update target (ie, never during plain SELECTs);
I suspect this would be relatively easy with some executor and perhaps
planner changes.  But that only fixes the first half of the gripe above;
I'm not at all sure what to do about the multiple-updates-per-page
issue.

Comments?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Open issues for HOT patch

2007-09-17 Thread Bruce Momjian
Tom Lane wrote:
 * We also need to think harder about when to invoke the page pruning
 code.  As the patch stands, if you set a breakpoint at
 heap_page_prune_opt it'll seem to be hit constantly (eg, once for every
 system catalog probe), which seems uselessly often.  And yet it also
 seems not often enough, because one thing I found out real fast is that
 the prune if free space  1.2 average tuple size heuristic fails badly
 when you look at queries that execute multiple updates within the same
 heap page.  We only prune when we first pin a particular target page,
 and so the additional updates don't afford another chance to see if it's
 time to prune.
 
 I'd like to see if we can arrange to only do pruning when reading a page
 that is known to be an update target (ie, never during plain SELECTs);
 I suspect this would be relatively easy with some executor and perhaps
 planner changes.  But that only fixes the first half of the gripe above;
 I'm not at all sure what to do about the multiple-updates-per-page
 issue.

If we only prune on an update (or insert) why not just do prune every
time?  I figure the prune/defrag has to be lighter than the
update/insert itself.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Open issues for HOT patch

2007-09-17 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 If we only prune on an update (or insert) why not just do prune every
 time?

The problem is you can't prune anymore once you have existing pin on the
target page.  I'd really like to get around that, but so far it seems
unacceptably fragile --- the executor really doesn't expect tuples to
get moved around underneath it.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Open issues for HOT patch

2007-09-17 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  If we only prune on an update (or insert) why not just do prune every
  time?
 
 The problem is you can't prune anymore once you have existing pin on the
 target page.  I'd really like to get around that, but so far it seems
 unacceptably fragile --- the executor really doesn't expect tuples to
 get moved around underneath it.

I thought you could do the pruning before you pin the page only in
update/insert cases.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Open issues for HOT patch

2007-09-17 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 The problem is you can't prune anymore once you have existing pin on the
 target page.  I'd really like to get around that, but so far it seems
 unacceptably fragile --- the executor really doesn't expect tuples to
 get moved around underneath it.

 I thought you could do the pruning before you pin the page only in
 update/insert cases.

But then what happens when you want to update a second tuple on the same
page?  None of our existing plan types release and reacquire pin if they
don't have to, and I really doubt that we want to give up that
optimization.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Open issues for HOT patch

2007-09-17 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  The problem is you can't prune anymore once you have existing pin on the
  target page.  I'd really like to get around that, but so far it seems
  unacceptably fragile --- the executor really doesn't expect tuples to
  get moved around underneath it.
 
  I thought you could do the pruning before you pin the page only in
  update/insert cases.
 
 But then what happens when you want to update a second tuple on the same
 page?  None of our existing plan types release and reacquire pin if they
 don't have to, and I really doubt that we want to give up that
 optimization.

You will prune when you lock the page and at that point unless you got
enough room for both tuples I doubt trying just before the second tuple
is going to help.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly