subject:"Re\: \[HACKERS\] vacuum, performance, and MVCC"

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-29 Thread Hannu Krosing

Ühel kenal päeval, T, 2006-06-27 kell 12:16, kirjutas Bruce Momjian:
 Hannu Krosing wrote:
  ?hel kenal p?eval, T, 2006-06-27 kell 10:38, kirjutas Hannu Krosing:
   ?hel kenal p?eval, E, 2006-06-26 kell 23:08, kirjutas Bruce Momjian:
Jim C. Nasby wrote:
 On Mon, Jun 26, 2006 at 02:32:59PM -0400, Bruce Momjian wrote:
  
  It is certainly possible to do what you are suggesting, that is 
  have two
  index entries point to same chain head, and have the index access
  routines figure out if the index qualifications still hold, but that
  seems like a lot of overhead.
   
   I think Jim meant not 2 pointing to the same head, but 2 pointing into
   the same chain. Say we have table with (id serial, ts timestamp) where
   ts changes at each update and id does not.
   
   So after 3 updates on one page we have one CITC/ITPC head with pointers
   from both indexes and two follow-up tuples with pointers from only ts
   index.
   
   The problem with this setup is, that we can't reuse any of those
   follow-up tuples without index cleanup.
  
  But we still have to think about similar cases (index entries pointing
  inside CITC chains), unless we plan to disallow adding indexes to
  tables.
 
 CREATE INDEX has to undo any chains where the new indexed columns change
 in the chain, and add index entries to remove the chain.

Yes, that would be the most straightforward solution.

It could be better in some cases, if we could avoid adding entries to
other indexes. Maybe we can just reset some flags, so that some SITC ops
like finding tuples by the CITC index pointer still work while adding
new entries wont. 

But it will be tricky to make this work for bitmap index scans. 

So yes, index build is a slop operation anyway, so making it even a
little slower is probably not a big problem. And most CITC chains will
have only one visible row at a time, this will probably not be a big
issue. 

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-29 Thread Bruce Momjian

Hannu Krosing wrote:
   But we still have to think about similar cases (index entries pointing
   inside CITC chains), unless we plan to disallow adding indexes to
   tables.
  
  CREATE INDEX has to undo any chains where the new indexed columns change
  in the chain, and add index entries to remove the chain.
 
 Yes, that would be the most straightforward solution.
 
 It could be better in some cases, if we could avoid adding entries to
 other indexes. Maybe we can just reset some flags, so that some SITC ops
 like finding tuples by the CITC index pointer still work while adding
 new entries wont. 
 
 But it will be tricky to make this work for bitmap index scans. 
 
 So yes, index build is a slop operation anyway, so making it even a
 little slower is probably not a big problem. And most CITC chains will
 have only one visible row at a time, this will probably not be a big
 issue. 

Consider that index scans coming from different indexes have to span the
same SITC.  I see no clean way to avoid making all the indexes
consistent, and as you said, CREATE INDEX isn't a frequent operation.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Hannu Krosing

Ühel kenal päeval, E, 2006-06-26 kell 23:08, kirjutas Bruce Momjian:
 Jim C. Nasby wrote:
  On Mon, Jun 26, 2006 at 02:32:59PM -0400, Bruce Momjian wrote:
   
   It is certainly possible to do what you are suggesting, that is have two
   index entries point to same chain head, and have the index access
   routines figure out if the index qualifications still hold, but that
   seems like a lot of overhead.

I think Jim meant not 2 pointing to the same head, but 2 pointing into
the same chain. Say we have table with (id serial, ts timestamp) where
ts changes at each update and id does not.

So after 3 updates on one page we have one CITC/ITPC head with pointers
from both indexes and two follow-up tuples with pointers from only ts
index.

The problem with this setup is, that we can't reuse any of those
follow-up tuples without index cleanup.

   Also, once there is only one visible row in the chain, removing old
   index entries seems quite complex because you have to have vacuum keep
   the qualifications of each row to figure out which index tuple is the
   valid one (seems messy).
   
  Perhaps my point got lost... in the case where no index keys change
  during an update, SITC seems superior in every way to my proposal. My
  idea (let's call it Index Tuple Page Consolidation, ITPC) would be
  beneficial to UPDATEs that modify one or more index keys but still put
  the tuple on the same page. Where SITC would be most useful for tables
  that have a very heavy update rate and very few indexes, ITPC would
  benefit tables that have more indexes on them; where presumably it's
  much more likely for UPDATEs to change at least one index key (which
  means SITC goes out the window, if I understand it correctly). If I'm
  missing something and SITC can in fact deal with some index keys
  changing during an UPDATE, then I see no reason for ITPC.
 
 I understood what you had said.  The question is whether we want to get
 that complex with this feature, and if there are enough use cases
 (UPDATE with index keys changing) to warrant it.

I'd like to think that most heavily-updated tables avoid that, but there
may be still cases where this is needed.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread PFC




My idea is that if an UPDATE places the new tuple on the same page as
the old tuple, it will not create new index entries for any indexes
where the key doesn't change.


	Basically the idea behind preventing index bloat by updates is to have  
one index tuple point to several actual tuples having the same value.


So : Index entry - list of tuples having the same value - actual 
tuples
(- represents an indirection)

	I proposed to put the list of tuples in the index ; you propose to put it  
in data pages.


I think both solutions have pros and cons :

* List of tuples in the index :
+ reduces index size, makes cacheability in RAM more likely
+ speeds up index scans
- complexity
- slows down modifications to the index (a bit)

* List of tuples in the page
+ simpler to implement
+ reduces index size, but less so than previous solution
- useless if UPDATE puts the new tuple on a different page

I guess the best solution would be a mix of both.

	Also, I insist (again) that there is a lot to gain by using a bit of  
compression on the data pages, even if it's very simple compression like  
storing the new version of a row as a difference from the previous version  
(ie. only store the columns that changed).
	I think DB2 stores the latest version entirely, and stores the previous  
versions as a delta. This is more efficient.


	In the case of tables containing TEXT values, these could also get  
TOASTed. When an update does not modify the TOASTed columns, it would be  
nice to simply be able to keep the reference to the TOASTed data instead  
of decompressing it and recompressing it. Or is it already the case ?



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Hannu Krosing

Ühel kenal päeval, T, 2006-06-27 kell 10:38, kirjutas Hannu Krosing:
 Ühel kenal päeval, E, 2006-06-26 kell 23:08, kirjutas Bruce Momjian:
  Jim C. Nasby wrote:
   On Mon, Jun 26, 2006 at 02:32:59PM -0400, Bruce Momjian wrote:

It is certainly possible to do what you are suggesting, that is have two
index entries point to same chain head, and have the index access
routines figure out if the index qualifications still hold, but that
seems like a lot of overhead.
 
 I think Jim meant not 2 pointing to the same head, but 2 pointing into
 the same chain. Say we have table with (id serial, ts timestamp) where
 ts changes at each update and id does not.
 
 So after 3 updates on one page we have one CITC/ITPC head with pointers
 from both indexes and two follow-up tuples with pointers from only ts
 index.
 
 The problem with this setup is, that we can't reuse any of those
 follow-up tuples without index cleanup.

But we still have to think about similar cases (index entries pointing
inside CITC chains), unless we plan to disallow adding indexes to
tables.

Perhaps that case has to simply disable heap tuple reuse until some
event. what would that event be?

Or maybe we should have some bitmap of dirty tuple ids inside each page,
that is tuple ids that have index pointers to them. and then avoid using
these ?

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Hannu Krosing

Ühel kenal päeval, E, 2006-06-26 kell 11:31, kirjutas Bruce Momjian:
 Hannu Krosing wrote:
pass 3: clean heap based on ctid from pass 1

If yo do it this way, you dont need to invent new data structures to
pass extra info about CITC internals to passes 2 and 3

On more thing - when should free space map be notified about free space
in pages with CITC chains ?
   
   Uh, well, I am thinking we only free CITC space when we are going to use
   it for an UPDATE, rather than free things while doing an operation.  It
   is good to keep the cleanup overhead out of the main path as much as
   possible.
  
  So vacuum should only remove dead CITC chains and leave the ones with
  live tuples to CITC internal use ?
 
 Yes, it has to.  What else would it do?  Add index entries?

No, clean out the dead part. 

But this would probably add the page to FSM - do we want that.

Also, this cleaning should probably be done at pass1, so we dont have to
carry the ctids of tuples which have no index entries around to passes 2
and 3 . This has the downside of possibly writing the heap page twice,
so maybe we dont want it.

  That would also suggest that pages having live CITC chains and less than
  N% of free space should mot be reported to FSM.
 
 Parts of the CITC that are not visible can be used for free space by
 vacuum, but the visible part is left alone.
 
-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Mark Woodward

 Ãhel kenal pÃ¤eval, E, 2006-06-26 kell 09:10, kirjutas Mark Woodward:
  ÃÅhel kenal pÃÂ¤eval, R, 2006-06-23 kell 17:27, kirjutas Bruce
 Momjian:
  Jonah H. Harris wrote:
   On 6/23/06, Tom Lane [EMAIL PROTECTED] wrote:
What I see in this discussion is a huge amount of the grass must
 be
greener on the other side syndrome, and hardly any recognition
 that
every technique has its downsides and complications.
  
   I'm being totally objective.  I don't think we should abandon
   PostgreSQL's overall design at all, because we do perform INSERTs
 and
   DELETEs much better than most systems.  However, I've looked at
 many
   systems and how they implement UPDATE so that it is a scalable
   operation.  Sure, there are costs and benefits to each
 implementation,
   but I think we have some pretty brilliant people in this community
 and
   can come up with an elegant design for scalable UPDATEs.
 
  I think the UPDATE case is similar to the bitmap index scan or
 perhaps
  bitmap indexes on disk --- there are cases we know can not be handled
  well by our existing code, so we have added (or might add) these
  features to try to address those difficult cases.
 
  Not really. Bitmap index scan and bitmap index are both new additions
  working well with existing framework.
 
  While the problem of slowdown on frequent updates is real, the
 suggested
  fix is just plain wrong, as it is based on someones faulty assumption
 on
  how index lookup works, and very much simplified view of how different
  parts of the system work to implement MVCC.

 Yes, the suggestion was based on MVCC concepts, not a particular
 implementation.

 On the contrary - afaik, it was loosely based on how Oracle does it with
 its rollback segments, only assuming that rollback segments are kept in
 heap and that indexes point only to the oldest row version :p

Well, give me a little more credit than that. Yes, Oracle did play small
part in my thinking, but only in as much as they can't do it, why can't
we? The problem was how to get the most recent tuple to be more efficient
and not have tuples that will never be used impact performance without
excessive locking or moving data around.

It was a just a quick idea. Bruce's solution, you have to admit, is
somewhat similar.


  The original fix he suggests was to that imagined behaviour and thus
  ignored all the real problems of such change.

 The original suggestion, was nothing more than a hypothetical for the
 purpose of discussion.

 The problem was the steady degradation of performance on frequent
 updates.
 That was the point of discussion.  I brought up one possible way to
 start a brain storm. The discussion then morphed into critisizing the
 example and not addressing the problem.

 The problem is heatedly discussed every 3-4 months.

And yet, here we are again.


 Anyway, I think some decent discussion about the problem did happen, and
 that is good.

 Agreed.

 Maybe this _was_ the best way to bring up the discussion again.

I have a way, for better or worse, I guess, of stirring up the pot. :-)

Cry as we may about MySQL, but I have a sneaking suspicion that this is
one of the issues that puts PostgreSQL at a serious disadvantage.

While heavily updated rows are a single type of problem, these days I
think *most* database deployments are as back-ends for web sites. This
problem is *very* critical to that type of application, consequently
probably why PostgreSQL has difficulty in that space.

If PostgreSQL can be made *not* to suffer performance degradation on
heavily updated rows, then that is realy the last issue in the way of it
being a completely creadible medium to large enterprise back end. This
combined with its amazing pragramability, should make it unstoppable.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Mark Woodward

 On Fri, Jun 23, 2006 at 06:37:01AM -0400, Mark Woodward wrote:
 While we all know session data is, at best, ephemeral, people still want
 some sort of persistence, thus, you need a database. For mcache I have a
 couple plugins that have a wide range of opitions, from read/write at
 startup and shut down, to full write through cache to a database.

 In general, my clients don't want this, they want the database to store
 their data. When you try to explain to them that a database may not be
 the
 right place to store this data, they ask why, sadly they have little
 hope
 of understanding the nuances and remain unconvinced.

 Have you done any benchmarking between a site using mcache and one not?
 I'll bet there's a huge difference, which translates into hardware $$.
 That's something managers can understand.


Last benchmark I did was on a pure data level, a couple years ago,
PostgreSQL could handle about 800 session transactions a second, but
degraded over time, MCache was up about 7500 session transactions a second
and held steady. I should dig up that code and make it available on my
site.

I have a couple users that tell me that their sites couldn't work without
it, not even with MySQL.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Martijn van Oosterhout

On Mon, Jun 26, 2006 at 11:29:27AM -0400, Bruce Momjian wrote:
  Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
  have to change all the users of that (which aren't many, I suppose).
  It's probably worth making a utility function to expand them.
  
  I'm still confused where bitmap index scan fit into all of this. Is
  preserving the sequential scan aspect of these a goal with this new
  setup?
 
 No.  I was just pointing out that if you get to the tuple via an index,
 you get handed the head of the SITC via the index tuple, but if you are
 doing a sequential scan, you don't get it, so you have to find it, or
 any other non-visible SITC header.

Ok, but it remains true that you can only have one SITC per tuple. So
if you have 5 indexes on a table, any SITC will only join tuples that
didn't change any values in any of the indexed columns. That's probably
not a big deal though; indexes columns arn't likely to be the ones
changing much.

So, for the bitmap scan you have to make sure that within a single
transaction, scanning multiple indexes will have to provide the same
SITC for each set of tuples, even in the face of concurrent updates.
Otherwise the BitmapAnd will incorrectly throw them out.

That should be doable, if you only change the head of the SITC on
VACUUM. I'm not sure if that's what's being suggested right now.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Bruce Momjian

Hannu Krosing wrote:
 ?hel kenal p?eval, T, 2006-06-27 kell 10:38, kirjutas Hannu Krosing:
  ?hel kenal p?eval, E, 2006-06-26 kell 23:08, kirjutas Bruce Momjian:
   Jim C. Nasby wrote:
On Mon, Jun 26, 2006 at 02:32:59PM -0400, Bruce Momjian wrote:
 
 It is certainly possible to do what you are suggesting, that is have 
 two
 index entries point to same chain head, and have the index access
 routines figure out if the index qualifications still hold, but that
 seems like a lot of overhead.
  
  I think Jim meant not 2 pointing to the same head, but 2 pointing into
  the same chain. Say we have table with (id serial, ts timestamp) where
  ts changes at each update and id does not.
  
  So after 3 updates on one page we have one CITC/ITPC head with pointers
  from both indexes and two follow-up tuples with pointers from only ts
  index.
  
  The problem with this setup is, that we can't reuse any of those
  follow-up tuples without index cleanup.
 
 But we still have to think about similar cases (index entries pointing
 inside CITC chains), unless we plan to disallow adding indexes to
 tables.

CREATE INDEX has to undo any chains where the new indexed columns change
in the chain, and add index entries to remove the chain.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Bruce Momjian

PFC wrote:
 
  My idea is that if an UPDATE places the new tuple on the same page as
  the old tuple, it will not create new index entries for any indexes
  where the key doesn't change.
 
   Basically the idea behind preventing index bloat by updates is to have  
 one index tuple point to several actual tuples having the same value.
   

The idea is not to avoid index bloat, but to allow heap reuse, and having
one index entry for multiple versions of an UPDATEd row is merely an
implementation detail.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Jim C. Nasby

On Mon, Jun 26, 2006 at 11:08:24PM -0400, Bruce Momjian wrote:
 Jim C. Nasby wrote:
  On Mon, Jun 26, 2006 at 02:32:59PM -0400, Bruce Momjian wrote:
   
   It is certainly possible to do what you are suggesting, that is have two
   index entries point to same chain head, and have the index access
   routines figure out if the index qualifications still hold, but that
   seems like a lot of overhead.
   
   Also, once there is only one visible row in the chain, removing old
   index entries seems quite complex because you have to have vacuum keep
   the qualifications of each row to figure out which index tuple is the
   valid one (seems messy).
   
  Perhaps my point got lost... in the case where no index keys change
  during an update, SITC seems superior in every way to my proposal. My
  idea (let's call it Index Tuple Page Consolidation, ITPC) would be
  beneficial to UPDATEs that modify one or more index keys but still put
  the tuple on the same page. Where SITC would be most useful for tables
  that have a very heavy update rate and very few indexes, ITPC would
  benefit tables that have more indexes on them; where presumably it's
  much more likely for UPDATEs to change at least one index key (which
  means SITC goes out the window, if I understand it correctly). If I'm
  missing something and SITC can in fact deal with some index keys
  changing during an UPDATE, then I see no reason for ITPC.
 
 I understood what you had said.  The question is whether we want to get
 that complex with this feature, and if there are enough use cases
 (UPDATE with index keys changing) to warrant it.

Ideas on how to test a table to see how many tuples would fit this
criteria?

Or we could just shelve ITPC as a possibility in the future, after we
see how much the limitations of SITC hurt.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Bruce Momjian

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
 On Mon, Jun 26, 2006 at 11:29:27AM -0400, Bruce Momjian wrote:
   Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
   have to change all the users of that (which aren't many, I suppose).
   It's probably worth making a utility function to expand them.
   
   I'm still confused where bitmap index scan fit into all of this. Is
   preserving the sequential scan aspect of these a goal with this new
   setup?
  
  No.  I was just pointing out that if you get to the tuple via an index,
  you get handed the head of the SITC via the index tuple, but if you are
  doing a sequential scan, you don't get it, so you have to find it, or
  any other non-visible SITC header.
 
 Ok, but it remains true that you can only have one SITC per tuple. So
 if you have 5 indexes on a table, any SITC will only join tuples that
 didn't change any values in any of the indexed columns. That's probably
 not a big deal though; indexes columns arn't likely to be the ones
 changing much.

Right.

 So, for the bitmap scan you have to make sure that within a single
 transaction, scanning multiple indexes will have to provide the same
 SITC for each set of tuples, even in the face of concurrent updates.
 Otherwise the BitmapAnd will incorrectly throw them out.

The index points to the item id on the page, and that never changes,
even if the head tuple changes later.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Bruce Momjian

Jim C. Nasby wrote:
   Perhaps my point got lost... in the case where no index keys change
   during an update, SITC seems superior in every way to my proposal. My
   idea (let's call it Index Tuple Page Consolidation, ITPC) would be
   beneficial to UPDATEs that modify one or more index keys but still put
   the tuple on the same page. Where SITC would be most useful for tables
   that have a very heavy update rate and very few indexes, ITPC would
   benefit tables that have more indexes on them; where presumably it's
   much more likely for UPDATEs to change at least one index key (which
   means SITC goes out the window, if I understand it correctly). If I'm
   missing something and SITC can in fact deal with some index keys
   changing during an UPDATE, then I see no reason for ITPC.
  
  I understood what you had said.  The question is whether we want to get
  that complex with this feature, and if there are enough use cases
  (UPDATE with index keys changing) to warrant it.
 
 Ideas on how to test a table to see how many tuples would fit this
 criteria?
 
 Or we could just shelve ITPC as a possibility in the future, after we
 see how much the limitations of SITC hurt.

Probably.  I am not sure even SITC is a win given the complexity it will
add, but I think it is worth trying.  Getting into more complex cases
where chains change indexed values seems like something we could try
later if we have to.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Jim C. Nasby

On Tue, Jun 27, 2006 at 10:42:54AM +0200, PFC wrote:
   Also, I insist (again) that there is a lot to gain by using a bit of 
 compression on the data pages, even if it's very simple compression like  
 storing the new version of a row as a difference from the previous version  
 (ie. only store the columns that changed).
   I think DB2 stores the latest version entirely, and stores the 
   previous  versions as a delta. This is more efficient.
 
This would only help on tables that:

have many columns[1]
are frequently updated
the updates normally touch few columns

[1] I'm assuming that un-changed toasted fields keep the same pointer

I'm doubtful that that case is common enough to warrant the amount of
work that would be involved in doing this. I think it might be useful to
consider ways to make vertical partitioning easier, since that's the
common means to reduce the impact of these scenarios.

   In the case of tables containing TEXT values, these could also get  
 TOASTed. When an update does not modify the TOASTed columns, it would be  
 nice to simply be able to keep the reference to the TOASTed data instead  
 of decompressing it and recompressing it. Or is it already the case ?

Hopefully it is, but I'm not sure... something that would be good is a
means to force fields to be toasted sooner than when the tuple is bigger
than 2k, because that'd be a very easy way to get gains from vertical
partitioning.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Greg Stark


Bruce Momjian [EMAIL PROTECTED] writes:

 PFC wrote:
  
   My idea is that if an UPDATE places the new tuple on the same page as
   the old tuple, it will not create new index entries for any indexes
   where the key doesn't change.
  
  Basically the idea behind preventing index bloat by updates is to have  
  one index tuple point to several actual tuples having the same value.
  
 
 The idea is not to avoid index bloat, but to allow heap reuse, and having
 one index entry for multiple versions of an UPDATEd row is merely an
 implementation detail.

It sort of sounds like you're describing a whole new index type that stores
only the page, not the precise record of any tuple it indexes. If your table
has only such indexes then you never need to worry about updating indexes if
your new tuple version goes on the same page as the old one.

It's an interesting thought experiment. It might trade off a lot of work in
index maintenance as well as saving space in the index for a lot of additional
work performing index scans. There can easily be enough tuples on a page to
make scanning the entire page pretty costly.


-- 
greg


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-27 Thread Bruce Momjian

Greg Stark wrote:
 
 Bruce Momjian [EMAIL PROTECTED] writes:
 
  PFC wrote:
   
My idea is that if an UPDATE places the new tuple on the same page as
the old tuple, it will not create new index entries for any indexes
where the key doesn't change.
   
 Basically the idea behind preventing index bloat by updates is to have  
   one index tuple point to several actual tuples having the same value.
 
  
  The idea is not to avoid index bloat, but to allow heap reuse, and having
  one index entry for multiple versions of an UPDATEd row is merely an
  implementation detail.
 
 It sort of sounds like you're describing a whole new index type that stores
 only the page, not the precise record of any tuple it indexes. If your table

Background, indexes point to page item pointers, not to actual offsets
in the page.  This is how vacuum can move around tuples without modifying the
indexes.  The index points to a page item pointer that is a chain of
tuples with the same indexed columns.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Heikki Linnakangas


On Mon, 26 Jun 2006, Jan Wieck wrote:


On 6/25/2006 10:12 PM, Bruce Momjian wrote:

When you are using the update chaining, you can't mark that index row as
dead because it actually points to more than one row on the page, some
are non-visible, some are visible.


Back up the truck ... you mean in the current code base we have heap tuples 
that are visible in index scans because of heap tuple chaining but without 
index tuples pointing directly at them?


In current code, no. Every heap tuple has corresponding index tuples.

In Bruce's proposal, yes. You would have heap tuples without index tuples 
pointing directly at them. An index scan could only find them by following 
t_ctid chains.


Correct me if I understood you incorrectly, Bruce.

- Heikki

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Jan Wieck wrote:
 On 6/25/2006 10:12 PM, Bruce Momjian wrote:
  When you are using the update chaining, you can't mark that index row as
  dead because it actually points to more than one row on the page, some
  are non-visible, some are visible.
 
 Back up the truck ... you mean in the current code base we have heap 
 tuples that are visible in index scans because of heap tuple chaining 
 but without index tuples pointing directly at them?

No, this would be new code added.  The basic idea is with the new
same-page update chaining, a single index points to the head of a chain,
not to a single tuple, so you can't mark a tuple as pointing to dead
rows if any of the tuples in the chain are visible.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Alvaro Herrera wrote:
 Jan Wieck wrote:
  On 6/25/2006 10:12 PM, Bruce Momjian wrote:
  When you are using the update chaining, you can't mark that index row as
  dead because it actually points to more than one row on the page, some
  are non-visible, some are visible.
  
  Back up the truck ... you mean in the current code base we have heap 
  tuples that are visible in index scans because of heap tuple chaining 
  but without index tuples pointing directly at them?
 
 I don't know where this idea came from, but it's not true.  All heap
 tuples, dead or otherwise, have index entries.  Unless the idea is to
 extend update chaining to mean something different from the current
 meaning.

It does mean something different.  Single-Index-Tuple Chain (CITC) is a
special type of update chaining where the updates are all on the same
row, and a single index entry points to the entire chain.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Zeugswetter Andreas DCP SD


  On 6/25/2006 10:12 PM, Bruce Momjian wrote:
  When you are using the update chaining, you can't mark that index
row 
  as dead because it actually points to more than one row on the
page, 
  some are non-visible, some are visible.
  
  Back up the truck ... you mean in the current code base we have heap

  tuples that are visible in index scans because of heap tuple
chaining 
  but without index tuples pointing directly at them?
 
 I don't know where this idea came from, but it's not true.  
 All heap tuples, dead or otherwise, have index entries.  

When using CITC you would be reusing the index tuples from the current
heap tuple, so you can only reuse free space or a dead member of a CITC
chain.
You cannot reuse a dead tuple not member of a CITC chain because that
has separate
(invalid) index tuples pointing at it.

Part of the trick was moving slots (==ctid) around, so I still do not
really see how
you can represent the CITC chain as part of the update chain. 
Unless you intend to break dead parts of the update chain ? Maybe that
is ok ?

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Zeugswetter Andreas DCP SD wrote:
 
   On 6/25/2006 10:12 PM, Bruce Momjian wrote:
   When you are using the update chaining, you can't mark that index
 row 
   as dead because it actually points to more than one row on the
 page, 
   some are non-visible, some are visible.
   
   Back up the truck ... you mean in the current code base we have heap
 
   tuples that are visible in index scans because of heap tuple
 chaining 
   but without index tuples pointing directly at them?
  
  I don't know where this idea came from, but it's not true.  
  All heap tuples, dead or otherwise, have index entries.  
 
 When using CITC you would be reusing the index tuples from the current
 heap tuple, so you can only reuse free space or a dead member of a CITC
 chain.
 You cannot reuse a dead tuple not member of a CITC chain because that
 has separate
 (invalid) index tuples pointing at it.
 
 Part of the trick was moving slots (==ctid) around, so I still do not
 really see how
 you can represent the CITC chain as part of the update chain. 
 Unless you intend to break dead parts of the update chain ? Maybe that
 is ok ?

Yes, you have to remove dead (non-visible) parts of the update chain.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Heikki Linnakangas wrote:
 On Mon, 26 Jun 2006, Jan Wieck wrote:
 
  On 6/25/2006 10:12 PM, Bruce Momjian wrote:
  When you are using the update chaining, you can't mark that index row as
  dead because it actually points to more than one row on the page, some
  are non-visible, some are visible.
 
  Back up the truck ... you mean in the current code base we have heap tuples 
  that are visible in index scans because of heap tuple chaining but without 
  index tuples pointing directly at them?
 
 In current code, no. Every heap tuple has corresponding index tuples.
 
 In Bruce's proposal, yes. You would have heap tuples without index tuples 
 pointing directly at them. An index scan could only find them by following 
 t_ctid chains.
 
 Correct me if I understood you incorrectly, Bruce.

Correct!  We use the same pointers used by normal UPDATEs, except we set
a bit on the old tuple indicating it is a single-index tuple, and we
don't create index entries for the new tuple.  Index scan routines will
need to be taught about the new chains, but because only one tuple in
the chain is visible to a single backend, the callers should not need to
be modified.

(All tuples in the chain have page item ids.  It is just that when they
are freed, the pointers are adjusted so the index points to the chain
head.)

One problem is that once you find the row you want to update, it is
difficult to see if it is part of a single-index chain because there are
only forward pointers, so I think we have to scan the entire page to
find the chains.  To reduce that overhead, I am thinking we free the
non-visible tuples only when the page has no more free space.  This
allows us to free not just our own non-visible tuples, but perhaps
others as well.

We have never been able to free non-visible tuples before because of
index cleanup overhead, but with single-index chains, we can, and reduce
the requirements of vacuum for many workloads.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Mark Woodward

 Ãhel kenal pÃ¤eval, R, 2006-06-23 kell 17:27, kirjutas Bruce Momjian:
 Jonah H. Harris wrote:
  On 6/23/06, Tom Lane [EMAIL PROTECTED] wrote:
   What I see in this discussion is a huge amount of the grass must be
   greener on the other side syndrome, and hardly any recognition that
   every technique has its downsides and complications.
 
  I'm being totally objective.  I don't think we should abandon
  PostgreSQL's overall design at all, because we do perform INSERTs and
  DELETEs much better than most systems.  However, I've looked at many
  systems and how they implement UPDATE so that it is a scalable
  operation.  Sure, there are costs and benefits to each implementation,
  but I think we have some pretty brilliant people in this community and
  can come up with an elegant design for scalable UPDATEs.

 I think the UPDATE case is similar to the bitmap index scan or perhaps
 bitmap indexes on disk --- there are cases we know can not be handled
 well by our existing code, so we have added (or might add) these
 features to try to address those difficult cases.

 Not really. Bitmap index scan and bitmap index are both new additions
 working well with existing framework.

 While the problem of slowdown on frequent updates is real, the suggested
 fix is just plain wrong, as it is based on someones faulty assumption on
 how index lookup works, and very much simplified view of how different
 parts of the system work to implement MVCC.

Yes, the suggestion was based on MVCC concepts, not a particular
implementation.

 The original fix he suggests was to that imagined behaviour and thus
 ignored all the real problems of such change.

The original suggestion, was nothing more than a hypothetical for the
purpose of discussion.

The problem was the steady degradation of performance on frequent updates.
That was the point of discussion.  I brought up one possible way to
start a brain storm. The discussion then morphed into critisizing the
example and not addressing the problem.

Anyway, I think some decent discussion about the problem did happen, and
that is good.



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Martijn van Oosterhout

On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
 Correct!  We use the same pointers used by normal UPDATEs, except we set
 a bit on the old tuple indicating it is a single-index tuple, and we
 don't create index entries for the new tuple.  Index scan routines will
 need to be taught about the new chains, but because only one tuple in
 the chain is visible to a single backend, the callers should not need to
 be modified.

I suppose we would also change the index_getmulti() function to return
a set of ctids plus flags so the caller knows to follow the chains,
right? And for bitmap index scans you would only remember the page in
the case of such a tuple, since you can't be sure the exact ctid you've
got is the one you want.

Seems doable.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Mark Woodward

 Heikki Linnakangas wrote:
 On Mon, 26 Jun 2006, Jan Wieck wrote:

  On 6/25/2006 10:12 PM, Bruce Momjian wrote:
  When you are using the update chaining, you can't mark that index row
 as
  dead because it actually points to more than one row on the page,
 some
  are non-visible, some are visible.
 
  Back up the truck ... you mean in the current code base we have heap
 tuples
  that are visible in index scans because of heap tuple chaining but
 without
  index tuples pointing directly at them?

 In current code, no. Every heap tuple has corresponding index tuples.

 In Bruce's proposal, yes. You would have heap tuples without index
 tuples
 pointing directly at them. An index scan could only find them by
 following
 t_ctid chains.

 Correct me if I understood you incorrectly, Bruce.

 Correct!  We use the same pointers used by normal UPDATEs, except we set
 a bit on the old tuple indicating it is a single-index tuple, and we
 don't create index entries for the new tuple.  Index scan routines will
 need to be taught about the new chains, but because only one tuple in
 the chain is visible to a single backend, the callers should not need to
 be modified.

 (All tuples in the chain have page item ids.  It is just that when they
 are freed, the pointers are adjusted so the index points to the chain
 head.)

 One problem is that once you find the row you want to update, it is
 difficult to see if it is part of a single-index chain because there are
 only forward pointers, so I think we have to scan the entire page to
 find the chains.  To reduce that overhead, I am thinking we free the
 non-visible tuples only when the page has no more free space.  This
 allows us to free not just our own non-visible tuples, but perhaps
 others as well.

This sort of incorporates the vacuum row I suggested.


 We have never been able to free non-visible tuples before because of
 index cleanup overhead, but with single-index chains, we can, and reduce
 the requirements of vacuum for many workloads.


This is great!

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
 On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
  Correct!  We use the same pointers used by normal UPDATEs, except we set
  a bit on the old tuple indicating it is a single-index tuple, and we
  don't create index entries for the new tuple.  Index scan routines will
  need to be taught about the new chains, but because only one tuple in
  the chain is visible to a single backend, the callers should not need to
  be modified.
 
 I suppose we would also change the index_getmulti() function to return
 a set of ctids plus flags so the caller knows to follow the chains,
 right? And for bitmap index scans you would only remember the page in
 the case of such a tuple, since you can't be sure the exact ctid you've
 got is the one you want.
 
 Seems doable.

Yes, it just is an issue of where you want to add the complexity ---
scan entire page when no free space, or only an UPDATE.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Bruce Momjian wrote:
 Martijn van Oosterhout wrote:
 -- Start of PGP signed section.
  On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
   Correct!  We use the same pointers used by normal UPDATEs, except we set
   a bit on the old tuple indicating it is a single-index tuple, and we
   don't create index entries for the new tuple.  Index scan routines will
   need to be taught about the new chains, but because only one tuple in
   the chain is visible to a single backend, the callers should not need to
   be modified.
  
  I suppose we would also change the index_getmulti() function to return
  a set of ctids plus flags so the caller knows to follow the chains,
  right? And for bitmap index scans you would only remember the page in
  the case of such a tuple, since you can't be sure the exact ctid you've
  got is the one you want.
  
  Seems doable.
 
 Yes, it just is an issue of where you want to add the complexity ---
 scan entire page when no free space, or only an UPDATE.

Oh, and because you want to do this when doing an update via sequential
scan as well as an index scan, I am thinking you might need to do the
per-page method because you might not have even seen the head of the
chain yet.  With an index scan, finding the head is easy, but for a
sequential scan, it seems more difficult, and we don't have any free
space in the tail of the chain to maintain a pointer to the head.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Bruce Momjian wrote:
 Bruce Momjian wrote:
  Martijn van Oosterhout wrote:
  -- Start of PGP signed section.
   On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
Correct!  We use the same pointers used by normal UPDATEs, except we set
a bit on the old tuple indicating it is a single-index tuple, and we
don't create index entries for the new tuple.  Index scan routines will
need to be taught about the new chains, but because only one tuple in
the chain is visible to a single backend, the callers should not need to
be modified.
   
   I suppose we would also change the index_getmulti() function to return
   a set of ctids plus flags so the caller knows to follow the chains,
   right? And for bitmap index scans you would only remember the page in
   the case of such a tuple, since you can't be sure the exact ctid you've
   got is the one you want.
   
   Seems doable.
  
  Yes, it just is an issue of where you want to add the complexity ---
  scan entire page when no free space, or only an UPDATE.
 
 Oh, and because you want to do this when doing an update via sequential
 scan as well as an index scan, I am thinking you might need to do the
 per-page method because you might not have even seen the head of the
 chain yet.  With an index scan, finding the head is easy, but for a
 sequential scan, it seems more difficult, and we don't have any free
 space in the tail of the chain to maintain a pointer to the head.

Thinking some more, there will need to be a bit to uniquely identify the
head of a CITC.  With that, you could just scan the page tuples looking
for CITC heads, and checking those to see if they are not visible, and
re-using them, rather than doing a full page reorganization where all
free spaces is collected in the middle of the page.  That should limit
the overhead of reuse.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Hannu Krosing

Ühel kenal päeval, E, 2006-06-26 kell 14:56, kirjutas Martijn van
Oosterhout:
 On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
  Correct!  We use the same pointers used by normal UPDATEs, except we set
  a bit on the old tuple indicating it is a single-index tuple, and we
  don't create index entries for the new tuple.  Index scan routines will
  need to be taught about the new chains, but because only one tuple in
  the chain is visible to a single backend, the callers should not need to
  be modified.
 
 I suppose we would also change the index_getmulti() function to return
 a set of ctids plus flags so the caller knows to follow the chains,
 right? 

It is probably better to always return the pointer to the head of CITC
chain (the one an index points to) and do extra visibility checks and
chain-following on each access. This would keep the change internal to
tuple fetching functions.

 And for bitmap index scans you would only remember the page in
 the case of such a tuple, since you can't be sure the exact ctid you've
 got is the one you want.

no, you should only use the pointer to CITC head outside tuple access
funtions. And this pointer to CITC head is what is always passed to
those access functions/macros.

The VACUUM would run its passes thus:

pass 1: run over heap, collect pointers to single dead tuples, and fully
dead CITC chains (fully dead = no live tuples on this page). Clean up
old tuples from CITC chains and move live tuples around so that CITC
points to oldest possibly visible (not vacuumed) tuple. Doing this there
frees us from need to collect a separate set of pointers for those. Or
have you planned that old tuples from CITC chains are collected on the
go/as needed ? Of course we could do both.

pass 2: clean indexes based on ctid from pass 1

pass 3: clean heap based on ctid from pass 1

If yo do it this way, you dont need to invent new data structures to
pass extra info about CITC internals to passes 2 and 3

On more thing - when should free space map be notified about free space
in pages with CITC chains ?

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Hannu Krosing wrote:
 ?hel kenal p?eval, E, 2006-06-26 kell 14:56, kirjutas Martijn van
 Oosterhout:
  On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
   Correct!  We use the same pointers used by normal UPDATEs, except we set
   a bit on the old tuple indicating it is a single-index tuple, and we
   don't create index entries for the new tuple.  Index scan routines will
   need to be taught about the new chains, but because only one tuple in
   the chain is visible to a single backend, the callers should not need to
   be modified.
  
  I suppose we would also change the index_getmulti() function to return
  a set of ctids plus flags so the caller knows to follow the chains,
  right? 
 
 It is probably better to always return the pointer to the head of CITC
 chain (the one an index points to) and do extra visibility checks and
 chain-following on each access. This would keep the change internal to
 tuple fetching functions.

So index_getnext() traverses the chain and returns one member per call. 
Makes sense.  Just realize you are in a single index entry returning
multiple tuples.  We will need some record keeping to track that.

  And for bitmap index scans you would only remember the page in
  the case of such a tuple, since you can't be sure the exact ctid you've
  got is the one you want.
 
 no, you should only use the pointer to CITC head outside tuple access
 funtions. And this pointer to CITC head is what is always passed to
 those access functions/macros.
 
 The VACUUM would run its passes thus:
 
 pass 1: run over heap, collect pointers to single dead tuples, and fully
 dead CITC chains (fully dead = no live tuples on this page). Clean up
 old tuples from CITC chains and move live tuples around so that CITC
 points to oldest possibly visible (not vacuumed) tuple. Doing this there
 frees us from need to collect a separate set of pointers for those. Or
 have you planned that old tuples from CITC chains are collected on the
 go/as needed ? Of course we could do both.

Non-visible CITC members should be freed during an UPDATE on the same
page, so vacuum doesn't have to be involved.

 pass 2: clean indexes based on ctid from pass 1
 
 pass 3: clean heap based on ctid from pass 1
 
 If yo do it this way, you dont need to invent new data structures to
 pass extra info about CITC internals to passes 2 and 3
 
 On more thing - when should free space map be notified about free space
 in pages with CITC chains ?

Uh, well, I am thinking we only free CITC space when we are going to use
it for an UPDATE, rather than free things while doing an operation.  It
is good to keep the cleanup overhead out of the main path as much as
possible.

Also, seems I can't spell algorithms very well:

  Definition:  Single-Index-Tuple Chain (SITC)
 -
Thinking of vacuum, right now it does these cleanups:

o  non-visible UPDATEs on the same page with no key changes
o  non-visible UPDATEs on the same page with key changes
o  non-visible UPDATEs on different pages
o  DELETEs
o  aborted transactions

The big question is what percentage of dead space is the first one?  My
guess is 65%.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Hannu Krosing

Ühel kenal päeval, E, 2006-06-26 kell 09:10, kirjutas Mark Woodward:
  Ãœhel kenal pÃ¤eval, R, 2006-06-23 kell 17:27, kirjutas Bruce Momjian:
  Jonah H. Harris wrote:
   On 6/23/06, Tom Lane [EMAIL PROTECTED] wrote:
What I see in this discussion is a huge amount of the grass must be
greener on the other side syndrome, and hardly any recognition that
every technique has its downsides and complications.
  
   I'm being totally objective.  I don't think we should abandon
   PostgreSQL's overall design at all, because we do perform INSERTs and
   DELETEs much better than most systems.  However, I've looked at many
   systems and how they implement UPDATE so that it is a scalable
   operation.  Sure, there are costs and benefits to each implementation,
   but I think we have some pretty brilliant people in this community and
   can come up with an elegant design for scalable UPDATEs.
 
  I think the UPDATE case is similar to the bitmap index scan or perhaps
  bitmap indexes on disk --- there are cases we know can not be handled
  well by our existing code, so we have added (or might add) these
  features to try to address those difficult cases.
 
  Not really. Bitmap index scan and bitmap index are both new additions
  working well with existing framework.
 
  While the problem of slowdown on frequent updates is real, the suggested
  fix is just plain wrong, as it is based on someones faulty assumption on
  how index lookup works, and very much simplified view of how different
  parts of the system work to implement MVCC.
 
 Yes, the suggestion was based on MVCC concepts, not a particular
 implementation.

On the contrary - afaik, it was loosely based on how Oracle does it with
its rollback segments, only assuming that rollback segments are kept in
heap and that indexes point only to the oldest row version :p

  The original fix he suggests was to that imagined behaviour and thus
  ignored all the real problems of such change.
 
 The original suggestion, was nothing more than a hypothetical for the
 purpose of discussion.
 
 The problem was the steady degradation of performance on frequent updates.
 That was the point of discussion.  I brought up one possible way to
 start a brain storm. The discussion then morphed into critisizing the
 example and not addressing the problem.

The problem is heatedly discussed every 3-4 months.

 Anyway, I think some decent discussion about the problem did happen, and
 that is good.

Agreed. 

Maybe this _was_ the best way to bring up the discussion again.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Martijn van Oosterhout

On Mon, Jun 26, 2006 at 10:50:26AM -0400, Bruce Momjian wrote:
   I suppose we would also change the index_getmulti() function to return
   a set of ctids plus flags so the caller knows to follow the chains,
   right? 
  
  It is probably better to always return the pointer to the head of CITC
  chain (the one an index points to) and do extra visibility checks and
  chain-following on each access. This would keep the change internal to
  tuple fetching functions.
 
 So index_getnext() traverses the chain and returns one member per call. 
 Makes sense.  Just realize you are in a single index entry returning
 multiple tuples.  We will need some record keeping to track that.

Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
have to change all the users of that (which aren't many, I suppose).
It's probably worth making a utility function to expand them.

I'm still confused where bitmap index scan fit into all of this. Is
preserving the sequential scan aspect of these a goal with this new
setup?

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Hannu Krosing

Ühel kenal päeval, E, 2006-06-26 kell 10:50, kirjutas Bruce Momjian:
 Hannu Krosing wrote:
  ?hel kenal p?eval, E, 2006-06-26 kell 14:56, kirjutas Martijn van
  Oosterhout:
   On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
Correct!  We use the same pointers used by normal UPDATEs, except we set
a bit on the old tuple indicating it is a single-index tuple, and we
don't create index entries for the new tuple.  Index scan routines will
need to be taught about the new chains, but because only one tuple in
the chain is visible to a single backend, the callers should not need to
be modified.
   
   I suppose we would also change the index_getmulti() function to return
   a set of ctids plus flags so the caller knows to follow the chains,
   right? 
  
  It is probably better to always return the pointer to the head of CITC
  chain (the one an index points to) and do extra visibility checks and
  chain-following on each access. This would keep the change internal to
  tuple fetching functions.
 
 So index_getnext() traverses the chain and returns one member per call. 
 Makes sense.  Just realize you are in a single index entry returning
 multiple tuples.  We will need some record keeping to track that.

Maybe we need to push visibility checks further down, so that
index_getnext() returns only the one heap row that is visible.

   And for bitmap index scans you would only remember the page in
   the case of such a tuple, since you can't be sure the exact ctid you've
   got is the one you want.
  
  no, you should only use the pointer to CITC head outside tuple access
  funtions. And this pointer to CITC head is what is always passed to
  those access functions/macros.
  
  The VACUUM would run its passes thus:
  
  pass 1: run over heap, collect pointers to single dead tuples, and fully
  dead CITC chains (fully dead = no live tuples on this page). Clean up
  old tuples from CITC chains and move live tuples around so that CITC
  points to oldest possibly visible (not vacuumed) tuple. Doing this there
  frees us from need to collect a separate set of pointers for those. Or
  have you planned that old tuples from CITC chains are collected on the
  go/as needed ? Of course we could do both.
 
 Non-visible CITC members should be freed during an UPDATE on the same
 page, so vacuum doesn't have to be involved.

Ok.

  pass 2: clean indexes based on ctid from pass 1
  
  pass 3: clean heap based on ctid from pass 1
  
  If yo do it this way, you dont need to invent new data structures to
  pass extra info about CITC internals to passes 2 and 3
  
  On more thing - when should free space map be notified about free space
  in pages with CITC chains ?
 
 Uh, well, I am thinking we only free CITC space when we are going to use
 it for an UPDATE, rather than free things while doing an operation.  It
 is good to keep the cleanup overhead out of the main path as much as
 possible.

So vacuum should only remove dead CITC chains and leave the ones with
live tuples to CITC internal use ?

That would also suggest that pages having live CITC chains and less than
N% of free space should mot be reported to FSM.

 Also, seems I can't spell algorithms very well:
 
 Definition:  Single-Index-Tuple Chain (SITC)
  -
 Thinking of vacuum, right now it does these cleanups:
 
   o  non-visible UPDATEs on the same page with no key changes
   o  non-visible UPDATEs on the same page with key changes
   o  non-visible UPDATEs on different pages
   o  DELETEs
   o  aborted transactions
 
 The big question is what percentage of dead space is the first one?  My
 guess is 65%.

Can be from 0% to 99.9%, very much dependent on application.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

NOTICE: This communication contains privileged or other confidential
information. If you have received it in error, please advise the sender
by reply email and immediately delete the message and any attachments
without copying or disclosing the contents.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Hannu Krosing

Ühel kenal päeval, E, 2006-06-26 kell 16:58, kirjutas Martijn van
Oosterhout:
 On Mon, Jun 26, 2006 at 10:50:26AM -0400, Bruce Momjian wrote:
I suppose we would also change the index_getmulti() function to return
a set of ctids plus flags so the caller knows to follow the chains,
right? 
   
   It is probably better to always return the pointer to the head of CITC
   chain (the one an index points to) and do extra visibility checks and
   chain-following on each access. This would keep the change internal to
   tuple fetching functions.
  
  So index_getnext() traverses the chain and returns one member per call. 
  Makes sense.  Just realize you are in a single index entry returning
  multiple tuples.  We will need some record keeping to track that.
 
 Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
 have to change all the users of that (which aren't many, I suppose).
 It's probably worth making a utility function to expand them.
 
 I'm still confused where bitmap index scan fit into all of this. Is
 preserving the sequential scan aspect of these a goal with this new
 setup?

Bitmap index scan does not have to change much - only the function that
gets tuple by its ctid must be able to trace forward chains within the
page.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
 On Mon, Jun 26, 2006 at 10:50:26AM -0400, Bruce Momjian wrote:
I suppose we would also change the index_getmulti() function to return
a set of ctids plus flags so the caller knows to follow the chains,
right? 
   
   It is probably better to always return the pointer to the head of CITC
   chain (the one an index points to) and do extra visibility checks and
   chain-following on each access. This would keep the change internal to
   tuple fetching functions.
  
  So index_getnext() traverses the chain and returns one member per call. 
  Makes sense.  Just realize you are in a single index entry returning
  multiple tuples.  We will need some record keeping to track that.
 
 Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
 have to change all the users of that (which aren't many, I suppose).
 It's probably worth making a utility function to expand them.
 
 I'm still confused where bitmap index scan fit into all of this. Is
 preserving the sequential scan aspect of these a goal with this new
 setup?

No.  I was just pointing out that if you get to the tuple via an index,
you get handed the head of the SITC via the index tuple, but if you are
doing a sequential scan, you don't get it, so you have to find it, or
any other non-visible SITC header.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Hannu Krosing wrote:
   pass 3: clean heap based on ctid from pass 1
   
   If yo do it this way, you dont need to invent new data structures to
   pass extra info about CITC internals to passes 2 and 3
   
   On more thing - when should free space map be notified about free space
   in pages with CITC chains ?
  
  Uh, well, I am thinking we only free CITC space when we are going to use
  it for an UPDATE, rather than free things while doing an operation.  It
  is good to keep the cleanup overhead out of the main path as much as
  possible.
 
 So vacuum should only remove dead CITC chains and leave the ones with
 live tuples to CITC internal use ?

Yes, it has to.  What else would it do?  Add index entries?

 That would also suggest that pages having live CITC chains and less than
 N% of free space should mot be reported to FSM.

Parts of the CITC that are not visible can be used for free space by
vacuum, but the visible part is left alone.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC, and compression

2006-06-26 Thread PFC



There were some talks lately about compression.
	With a bit of lateral thinking I guess this can be used to contain the  
bloat induced by updates.

Of course this is just my hypothesis.

Compression in indexes :

	Instead of storing (value, tuple identifier) keys in the indexes, store  
(value, [tuple identifier list]) ; ie. all tuples which have the same  
indexed value are referenced by the same index tuple, instead of having  
one index tuple per actual tuple.
	The length of the list would of course be limited to the space actually  
available on an index page ; if many rows have the same indexed value,  
several index tuples would be generated so that index tuples fit on index  
pages.
	This would make the index smaller (more likely to fit in RAM) at the cost  
of a little CPU overhead for index modifications, but would make the index  
scans actually use less CPU (no need to compare the indexed value on each  
table tuple).


Compression in data pages :

	The article that circulated on the list suggested several types of  
compression, offset, dictionary, etc. The point is that several row  
versions on the same page can be compressed well because these versions  
probably have similar column values.


Just a thought...

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC, and compression

2006-06-26 Thread Bruce Momjian

PFC wrote:
 
   There were some talks lately about compression.
   With a bit of lateral thinking I guess this can be used to contain the  
 bloat induced by updates.
   Of course this is just my hypothesis.
 
   Compression in indexes :
 
   Instead of storing (value, tuple identifier) keys in the indexes, store 
  
 (value, [tuple identifier list]) ; ie. all tuples which have the same  
 indexed value are referenced by the same index tuple, instead of having  
 one index tuple per actual tuple.
   The length of the list would of course be limited to the space actually 
  
 available on an index page ; if many rows have the same indexed value,  
 several index tuples would be generated so that index tuples fit on index  
 pages.
   This would make the index smaller (more likely to fit in RAM) at the 
 cost  
 of a little CPU overhead for index modifications, but would make the index  
 scans actually use less CPU (no need to compare the indexed value on each  
 table tuple).

What about increasing the size of an existing index entry?  Can that be
done easily when a new row is added?

   Compression in data pages :
 
   The article that circulated on the list suggested several types of  
 compression, offset, dictionary, etc. The point is that several row  
 versions on the same page can be compressed well because these versions  
 probably have similar column values.
 
   Just a thought...

I would be worried about the overhead of doing that on compression and
decompression.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC, and compression

2006-06-26 Thread PFC




What about increasing the size of an existing index entry?  Can that be
done easily when a new row is added?


I'd say it looks pretty much like inserting a new index tuple...
Say value is the indexed column.

Find first page in the index featuring value.
1   If there is space on the page,
		add the tuple id to the list of the corresponding index entry (just like  
creating a new index tuple, but uses less space).

else
look at next page.
If next page has an index tuple with the same indexed value,
goto 1
else
insert new page and create an index tuple on it


I would be worried about the overhead of doing that on compression and
decompression.


	The compression methods mentioned in the article which was passed on the  
list seemed pretty fast. From IO-limited, the test database became  
CPU-limited (and a lot faster).



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Zeugswetter Andreas DCP SD


  head of the chain yet.  With an index scan, finding the head is
easy, 
  but for a sequential scan, it seems more difficult, and we don't
have 
  any free space in the tail of the chain to maintain a pointer to the
head.
 
 Thinking some more, there will need to be a bit to uniquely 
 identify the head of a CITC.

I don't think so. It would probably be sufficient to impose an order on
the CITC.
e.g. the oldest tuple version in the CITC is the head. 
(An idea just in case we can't spare a bit :-) 

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Jim C. Nasby

On Sun, Jun 25, 2006 at 09:13:48PM +0300, Heikki Linnakangas wrote:
 If you can't expire the old row because one of the indexed columns was
 modified, I see no reason to try to reduce the additional index entries.
 
 It won't enable early expiration, but it means less work to do on update. 
 If there's a lot of indexes, not having to add so many index tuples can be 
 a significant saving.

While catching up on this thread, the following idea came to me that I
think would allow for not updating an index on an UPDATE if it's key
doesn't change. If I understand Bruce's SITC proposal correctly, this
would differ in that SITC requires that no index keys change.

My idea is that if an UPDATE places the new tuple on the same page as
the old tuple, it will not create new index entries for any indexes
where the key doesn't change. This means that when fetching tuples from
the index, ctid would have to be followed until you found the version
you wanted OR you found the first ctid that pointed to a different page
(because that tuple will have it's own index entry) OR you found a tuple
with a different value for the key of the index you're using (because
it'd be invalid, and there'd be a different index entry for it). I
believe that the behavior of the index hint bits would also have to
change somewhat, as each index entry would now essentially be pointing
at all the tuples in the ctid chain that exist on a page, not just
single tuple.

In the case of an UPDATE that needs to put the new tuple on a different
page, our current behavior would be used. This means that the hint bits
would still be useful in limiting the number of heap pages you hit. I
also believe this means that we wouldn't suffer any additional overhead
from our current code when there isn't much free space on pages.

Since SITC allows for in-page space reuse without vacuuming only when no
index keys change, it's most useful for very heavily updated tables such
as session handlers or queue tables, because those tables typically have
very few indexes, so it's pretty unlikely that an index key will change.
For more general-purpose tables that have more indexes but still see a
fair number of updates to a subset of rows, not having to update every
index would likely be a win. I also don't see any reason why both
options couldn't be used together.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian


It is certainly possible to do what you are suggesting, that is have two
index entries point to same chain head, and have the index access
routines figure out if the index qualifications still hold, but that
seems like a lot of overhead.

Also, once there is only one visible row in the chain, removing old
index entries seems quite complex because you have to have vacuum keep
the qualifications of each row to figure out which index tuple is the
valid one (seems messy).

---

Jim C. Nasby wrote:
 On Sun, Jun 25, 2006 at 09:13:48PM +0300, Heikki Linnakangas wrote:
  If you can't expire the old row because one of the indexed columns was
  modified, I see no reason to try to reduce the additional index entries.
  
  It won't enable early expiration, but it means less work to do on update. 
  If there's a lot of indexes, not having to add so many index tuples can be 
  a significant saving.
 
 While catching up on this thread, the following idea came to me that I
 think would allow for not updating an index on an UPDATE if it's key
 doesn't change. If I understand Bruce's SITC proposal correctly, this
 would differ in that SITC requires that no index keys change.
 
 My idea is that if an UPDATE places the new tuple on the same page as
 the old tuple, it will not create new index entries for any indexes
 where the key doesn't change. This means that when fetching tuples from
 the index, ctid would have to be followed until you found the version
 you wanted OR you found the first ctid that pointed to a different page
 (because that tuple will have it's own index entry) OR you found a tuple
 with a different value for the key of the index you're using (because
 it'd be invalid, and there'd be a different index entry for it). I
 believe that the behavior of the index hint bits would also have to
 change somewhat, as each index entry would now essentially be pointing
 at all the tuples in the ctid chain that exist on a page, not just
 single tuple.
 
 In the case of an UPDATE that needs to put the new tuple on a different
 page, our current behavior would be used. This means that the hint bits
 would still be useful in limiting the number of heap pages you hit. I
 also believe this means that we wouldn't suffer any additional overhead
 from our current code when there isn't much free space on pages.
 
 Since SITC allows for in-page space reuse without vacuuming only when no
 index keys change, it's most useful for very heavily updated tables such
 as session handlers or queue tables, because those tables typically have
 very few indexes, so it's pretty unlikely that an index key will change.
 For more general-purpose tables that have more indexes but still see a
 fair number of updates to a subset of rows, not having to update every
 index would likely be a win. I also don't see any reason why both
 options couldn't be used together.
 -- 
 Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
 Pervasive Software  http://pervasive.comwork: 512-231-6117
 vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461
 

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Jim C. Nasby

On Fri, Jun 23, 2006 at 06:37:01AM -0400, Mark Woodward wrote:
 While we all know session data is, at best, ephemeral, people still want
 some sort of persistence, thus, you need a database. For mcache I have a
 couple plugins that have a wide range of opitions, from read/write at
 startup and shut down, to full write through cache to a database.
 
 In general, my clients don't want this, they want the database to store
 their data. When you try to explain to them that a database may not be the
 right place to store this data, they ask why, sadly they have little hope
 of understanding the nuances and remain unconvinced.

Have you done any benchmarking between a site using mcache and one not?
I'll bet there's a huge difference, which translates into hardware $$.
That's something managers can understand.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Zeugswetter Andreas DCP SD wrote:
 
   head of the chain yet.  With an index scan, finding the head is
 easy, 
   but for a sequential scan, it seems more difficult, and we don't
 have 
   any free space in the tail of the chain to maintain a pointer to the
 head.
  
  Thinking some more, there will need to be a bit to uniquely 
  identify the head of a CITC.
 
 I don't think so. It would probably be sufficient to impose an order on
 the CITC.
 e.g. the oldest tuple version in the CITC is the head. 
 (An idea just in case we can't spare a bit :-) 

Well, if we need to scan the page quickly, having the bit, or a bit
combination that can only be the head, is helpful.  What we usually do
is to combine a SITC bit with another bit that would never be set for
SITC, and that is the head, and you use macros to properly do tests.  We
do this already in a number of cases.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Jim C. Nasby

On Mon, Jun 26, 2006 at 02:32:59PM -0400, Bruce Momjian wrote:
 
 It is certainly possible to do what you are suggesting, that is have two
 index entries point to same chain head, and have the index access
 routines figure out if the index qualifications still hold, but that
 seems like a lot of overhead.
 
 Also, once there is only one visible row in the chain, removing old
 index entries seems quite complex because you have to have vacuum keep
 the qualifications of each row to figure out which index tuple is the
 valid one (seems messy).
 
Perhaps my point got lost... in the case where no index keys change
during an update, SITC seems superior in every way to my proposal. My
idea (let's call it Index Tuple Page Consolidation, ITPC) would be
beneficial to UPDATEs that modify one or more index keys but still put
the tuple on the same page. Where SITC would be most useful for tables
that have a very heavy update rate and very few indexes, ITPC would
benefit tables that have more indexes on them; where presumably it's
much more likely for UPDATEs to change at least one index key (which
means SITC goes out the window, if I understand it correctly). If I'm
missing something and SITC can in fact deal with some index keys
changing during an UPDATE, then I see no reason for ITPC.

 ---
 
 Jim C. Nasby wrote:
  On Sun, Jun 25, 2006 at 09:13:48PM +0300, Heikki Linnakangas wrote:
   If you can't expire the old row because one of the indexed columns was
   modified, I see no reason to try to reduce the additional index entries.
   
   It won't enable early expiration, but it means less work to do on update. 
   If there's a lot of indexes, not having to add so many index tuples can 
   be 
   a significant saving.
  
  While catching up on this thread, the following idea came to me that I
  think would allow for not updating an index on an UPDATE if it's key
  doesn't change. If I understand Bruce's SITC proposal correctly, this
  would differ in that SITC requires that no index keys change.
  
  My idea is that if an UPDATE places the new tuple on the same page as
  the old tuple, it will not create new index entries for any indexes
  where the key doesn't change. This means that when fetching tuples from
  the index, ctid would have to be followed until you found the version
  you wanted OR you found the first ctid that pointed to a different page
  (because that tuple will have it's own index entry) OR you found a tuple
  with a different value for the key of the index you're using (because
  it'd be invalid, and there'd be a different index entry for it). I
  believe that the behavior of the index hint bits would also have to
  change somewhat, as each index entry would now essentially be pointing
  at all the tuples in the ctid chain that exist on a page, not just
  single tuple.
  
  In the case of an UPDATE that needs to put the new tuple on a different
  page, our current behavior would be used. This means that the hint bits
  would still be useful in limiting the number of heap pages you hit. I
  also believe this means that we wouldn't suffer any additional overhead
  from our current code when there isn't much free space on pages.
  
  Since SITC allows for in-page space reuse without vacuuming only when no
  index keys change, it's most useful for very heavily updated tables such
  as session handlers or queue tables, because those tables typically have
  very few indexes, so it's pretty unlikely that an index key will change.
  For more general-purpose tables that have more indexes but still see a
  fair number of updates to a subset of rows, not having to update every
  index would likely be a win. I also don't see any reason why both
  options couldn't be used together.
  -- 
  Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
  Pervasive Software  http://pervasive.comwork: 512-231-6117
  vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461
  
 
 -- 
   Bruce Momjian   [EMAIL PROTECTED]
   EnterpriseDBhttp://www.enterprisedb.com
 
   + If your life is a hard drive, Christ can be your backup. +
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings
 

-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-26 Thread Bruce Momjian

Jim C. Nasby wrote:
 On Mon, Jun 26, 2006 at 02:32:59PM -0400, Bruce Momjian wrote:
  
  It is certainly possible to do what you are suggesting, that is have two
  index entries point to same chain head, and have the index access
  routines figure out if the index qualifications still hold, but that
  seems like a lot of overhead.
  
  Also, once there is only one visible row in the chain, removing old
  index entries seems quite complex because you have to have vacuum keep
  the qualifications of each row to figure out which index tuple is the
  valid one (seems messy).
  
 Perhaps my point got lost... in the case where no index keys change
 during an update, SITC seems superior in every way to my proposal. My
 idea (let's call it Index Tuple Page Consolidation, ITPC) would be
 beneficial to UPDATEs that modify one or more index keys but still put
 the tuple on the same page. Where SITC would be most useful for tables
 that have a very heavy update rate and very few indexes, ITPC would
 benefit tables that have more indexes on them; where presumably it's
 much more likely for UPDATEs to change at least one index key (which
 means SITC goes out the window, if I understand it correctly). If I'm
 missing something and SITC can in fact deal with some index keys
 changing during an UPDATE, then I see no reason for ITPC.

I understood what you had said.  The question is whether we want to get
that complex with this feature, and if there are enough use cases
(UPDATE with index keys changing) to warrant it.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Hannu Krosing

Ühel kenal päeval, L, 2006-06-24 kell 19:36, kirjutas Bruce Momjian:
 Hannu Krosing wrote:
  ?hel kenal p?eval, R, 2006-06-23 kell 13:08, kirjutas Tom Lane:

   
   Bottom line: there's still lots of low-hanging fruit.  Why are people
   feeling that we need to abandon or massively complicate our basic
   architecture to make progress?
  
  Maybe we could start from reusing the index tuples which point to
  invisible tuples ? The index is not MVCC anyway, so maybe it is easier
  to do in-place replacement there ?
  
  This probably has the same obstacles which have prevented us from
  removing those in the first place (removing instead of marking as
  invisible). Does it cause some locking issues ? Or does it go against
  some other constraints of our index lookups ?
  
  I think that just setting the invisible bit in an index leaf node causes
  nearly as much disk io as removing the node.
  
  If we could delete/reuse old index tuples, it would solve a sizable
  chunk of index-growth problem, especially for cases where referenced key
  value does not change.
 
 I think heap _and_ index reuse is the only useful direction.  Index or
 heap reuse alone seems too marginal for the added complexity.

Sure, but index reuse seems a lot easier, as there is nothing additional
to remember or clean out when doing it.

When reusing a heap tuple you have to clean out all index entries
pointing to it.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Mark Woodward

 On 6/24/2006 9:23 AM, Mark Woodward wrote:

 On Sat, 24 Jun 2006, Mark Woodward wrote:

 I'm probably mistaken, but aren't there already forward references in
 tuples to later versions? If so, I'm only sugesting reversing the
 order
 and referencing the latest version.

 I thought I understood your idea, but now you lost me again. I thought
 what you want is that the older heap tuple has a pointer to the
 newer one. Which it already has, it's called t_ctid.

 Perfect!

 Can you try to explain more carefully how the whole thing would work?
 What would an index tuple point to? What pointers would a heap tuple
 have? What would an index scan do to find the row version it's
 interested
 in? What exactly would an update do?


 Since we already allocate space for some notion of linked list, then all
 I'm suggesting is we reverse the order, sort of. Currently it looks like
 this:

 ver001-ver002-ver003-...-verN

 That's what t_ctid does now, right? Well, that's sort of stupid. Why not
 have it do this:

 ver001-verN-...-ver003-ver002-|
  ^-/

 This will speed up almost *all* queries when there are more than two
 version of rows.

 OK, here is the behavior of an update:
 (1) Find the latest version of the row
 (2) Duplicate row and modify as per plan
 (3) Set the t_ctid of the new row to the last latest
 (4) Set the t_ctid of the first row to that of the new row
 (5) Attempt to index the row
 (6) If the first version of the row is in the index already (ver001)
 Don't
 modify the index, otherwise, add the new version (just as before)

 When you vacuum, simply make the latest version (verN) the key row
 (ver001).

 This isn't done simply. Currently, vacuum collects a trivial array of
 ctid's it is removing and every now and then does a bulk remove of the
 index tuples pointing to them. Now lets consider a table with two
 indexed columns with the following row versions resulting from an insert
 and 3 updates to that same row:

v1:  a,b
v2:  a,c
v3:  a,d
v4:  b,d

 In your new scheme, there would be two index tuples for column 1 (one
 pointing to v1, one pointing to v4) and 3 index tuples for column 2 (one
 for each different value pointing to v1, v2 and v3). Did I get that
 right so far?

 If vacuum now can remove v1, it has to update index 1 to point to v2 and
 remove the pointer to v1 from index 2. If it can remove v1 and v2, it
 has to update index 1 to point to v3 and remove v1 and v2 from index 2.
 If it can remove v1, v2 and v3 it must delete the index 1 tuple pointing
 to v1, delete the index 2 entries pointing to v1 and v2 and update the
 index 2 entry for v3 to point to v4. Figuring out which index tuples to
 remove and which ones to update can only be done by comparing each and
 every indexed columns old and new values. To do so, vacuum will have to
 fetch all the row versions, which can be scattered all over the place,
 with all possible locking issues including but not limited to deadlocks.

I'm not sure why vacuum can't run similarly to the way it does now.



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Bruce Momjian

Hannu Krosing wrote:
   Maybe we could start from reusing the index tuples which point to
   invisible tuples ? The index is not MVCC anyway, so maybe it is easier
   to do in-place replacement there ?
   
   This probably has the same obstacles which have prevented us from
   removing those in the first place (removing instead of marking as
   invisible). Does it cause some locking issues ? Or does it go against
   some other constraints of our index lookups ?
   
   I think that just setting the invisible bit in an index leaf node causes
   nearly as much disk io as removing the node.
   
   If we could delete/reuse old index tuples, it would solve a sizable
   chunk of index-growth problem, especially for cases where referenced key
   value does not change.
  
  I think heap _and_ index reuse is the only useful direction.  Index or
  heap reuse alone seems too marginal for the added complexity.
 
 Sure, but index reuse seems a lot easier, as there is nothing additional
 to remember or clean out when doing it.

Yes, seems so.  TODO added:

* Reuse index tuples that point to heap tuples that are not visible to
  anyone?

 When reusing a heap tuple you have to clean out all index entries
 pointing to it.

Well, not for UPDATE for no key changes on the same page, if we do that.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Jan Wieck


On 6/25/2006 6:52 AM, Mark Woodward wrote:

On 6/24/2006 9:23 AM, Mark Woodward wrote:


On Sat, 24 Jun 2006, Mark Woodward wrote:


I'm probably mistaken, but aren't there already forward references in
tuples to later versions? If so, I'm only sugesting reversing the
order
and referencing the latest version.


I thought I understood your idea, but now you lost me again. I thought
what you want is that the older heap tuple has a pointer to the
newer one. Which it already has, it's called t_ctid.


Perfect!


Can you try to explain more carefully how the whole thing would work?
What would an index tuple point to? What pointers would a heap tuple
have? What would an index scan do to find the row version it's
interested
in? What exactly would an update do?



Since we already allocate space for some notion of linked list, then all
I'm suggesting is we reverse the order, sort of. Currently it looks like
this:

ver001-ver002-ver003-...-verN

That's what t_ctid does now, right? Well, that's sort of stupid. Why not
have it do this:

ver001-verN-...-ver003-ver002-|
 ^-/

This will speed up almost *all* queries when there are more than two
version of rows.

OK, here is the behavior of an update:
(1) Find the latest version of the row
(2) Duplicate row and modify as per plan
(3) Set the t_ctid of the new row to the last latest
(4) Set the t_ctid of the first row to that of the new row
(5) Attempt to index the row
(6) If the first version of the row is in the index already (ver001)
Don't
modify the index, otherwise, add the new version (just as before)

When you vacuum, simply make the latest version (verN) the key row
(ver001).


This isn't done simply. Currently, vacuum collects a trivial array of
ctid's it is removing and every now and then does a bulk remove of the
index tuples pointing to them. Now lets consider a table with two
indexed columns with the following row versions resulting from an insert
and 3 updates to that same row:

   v1:  a,b
   v2:  a,c
   v3:  a,d
   v4:  b,d

In your new scheme, there would be two index tuples for column 1 (one
pointing to v1, one pointing to v4) and 3 index tuples for column 2 (one
for each different value pointing to v1, v2 and v3). Did I get that
right so far?

If vacuum now can remove v1, it has to update index 1 to point to v2 and
remove the pointer to v1 from index 2. If it can remove v1 and v2, it
has to update index 1 to point to v3 and remove v1 and v2 from index 2.
If it can remove v1, v2 and v3 it must delete the index 1 tuple pointing
to v1, delete the index 2 entries pointing to v1 and v2 and update the
index 2 entry for v3 to point to v4. Figuring out which index tuples to
remove and which ones to update can only be done by comparing each and
every indexed columns old and new values. To do so, vacuum will have to
fetch all the row versions, which can be scattered all over the place,
with all possible locking issues including but not limited to deadlocks.


I'm not sure why vacuum can't run similarly to the way it does now.


What's that supposed to mean?


Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Jan Wieck


On 6/25/2006 12:27 PM, Bruce Momjian wrote:


Hannu Krosing wrote:

  Maybe we could start from reusing the index tuples which point to
  invisible tuples ? The index is not MVCC anyway, so maybe it is easier
  to do in-place replacement there ?
  
  This probably has the same obstacles which have prevented us from

  removing those in the first place (removing instead of marking as
  invisible). Does it cause some locking issues ? Or does it go against
  some other constraints of our index lookups ?
  
  I think that just setting the invisible bit in an index leaf node causes

  nearly as much disk io as removing the node.
  
  If we could delete/reuse old index tuples, it would solve a sizable

  chunk of index-growth problem, especially for cases where referenced key
  value does not change.
 
 I think heap _and_ index reuse is the only useful direction.  Index or

 heap reuse alone seems too marginal for the added complexity.

Sure, but index reuse seems a lot easier, as there is nothing additional
to remember or clean out when doing it.


Yes, seems so.  TODO added:

* Reuse index tuples that point to heap tuples that are not visible to
  anyone?


When reusing a heap tuple you have to clean out all index entries
pointing to it.


Well, not for UPDATE for no key changes on the same page, if we do that.



An update that results in all the same values of every indexed column of 
a known deleted invisible tuple. This reused tuple can by definition not 
be the one currently updated. So unless it is a table without a primary 
key, this assumes that at least 3 versions of the same row exist within 
the same block. How likely is that to happen?



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Jan Wieck


On 6/24/2006 4:10 PM, Hannu Krosing wrote:


Ühel kenal päeval, L, 2006-06-24 kell 15:44, kirjutas Jan Wieck:


 That fixes the symptom, not the problem. The problem is performance
 steadily degrades over time.
 
 No, you got it backwards.  The performance degradation is the symptom.

 The problem is that there are too many dead tuples in the table.  There
 is one way to solve that problem -- remove them, which is done by
 running vacuum.

Precisely.

 There are some problems with vacuum itself, that I agree with.  For
 example it would be good if a long-running vacuum wouldn't affect a
 vacuum running in another table because of the long-running transaction
 effect it has.  It would be good if vacuum could be run partially over a
 table.  It would be good if there was a way to speed up vacuum by using
 a dead space map or something.

It would be good if vacuum wouldn't waste time on blocks that don't have 
any possible work in them. Vacuum has two main purposes. A) remove dead 
rows and B) freeze xids. Once a block has zero deleted rows and all xids 
are frozen, there is nothing to do with this block and vacuum should 
skip it until a transaction updates that block.


This requires 2 bits per block, which is 32K per 1G segment of a heap. 
Clearing the bits is done when the block is marked dirty. This way 
vacuum would not waste any time and IO on huge slow changing tables. 
That part, sequentially scanning huge tables that didn't change much is 
what keeps us from running vacuum every couple of seconds.


Seems like a plan. 


Still, there is another problem which is not solved by map approach
only, at least with current implementation of vacuum.

This is the fact that we need to do full scan over index(es) to clean up
pointers to removed tuples. And huge tables tend to have huge indexes.


Right, now that you say it I remember why this wasn't so easy as it 
sounded at the beginning.


Obviously there is no other way to find an index tuple without a 
sequential scan other than doing an index scan. So vacuum would have to 
estimate based on the bitmaps if it could be beneficial (huge table, 
little vacuumable pages) to actually remove/flag single index tuples 
before removing the heap tuple. This can be done in advance to removing 
the heap tuple because index tuples might not be there to begin with.


However, that is a very costly thing to do and not trivial to implement.


Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Heikki Linnakangas


On Sat, 24 Jun 2006, Bruce Momjian wrote:


Because having them be on the same page is the only way you can update
the page item pointer so when you recycle the row, you the indexes are
now pointing to the new version.  Pages look like:

[marker][item1][item2][item3]...[tuple1][tuple2][tuple3]

and indexes only point to items, not to tuples.  This allows tuples to
be compacted on the page without affecting the indexes.

If tuple1 is updated to tuple2, once tuple1 is no longer visible to any
backends, you can modify item1 to point to tuple2, and you can mark the
space used by tuple1 as reusable:

[marker][item1(tuple2)][item2][item3]...[free][tuple2][tuple3]


Ok, now I think I get it. So the limitation of old and new tuple being on 
the same page is required to make it possible to remove the old tuple 
without touching the indexes?


If you move the new tuple (logically, by modifying item pointers) on 
vacuum, isn't there a risk that a concurrent seqscan misses it?



If you can't expire the old row because one of the indexed columns was
modified, I see no reason to try to reduce the additional index entries.


It won't enable early expiration, but it means less work to do on update. 
If there's a lot of indexes, not having to add so many index tuples can be 
a significant saving.


To summarise, we have two issues related to frequent updates:
1. Index and heap bloat, requiring frequent vacuum.
2. Updates are more expensive than on other DBMSs, because we have to add 
a new index tuple in every index, even if none of the indexed columns are 
modified.


Tom suggested that we just improve vacuum and autovacuum, and someone 
brought up the dead space map idea again. Those are all worthwhile things 
to do and help with vacuuming after deletes as well as updates, but they 
only address issue 1. Mark's suggestion (assuming that it would've 
worked) as well as yours address both, but only for updates.


- Heikki

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Bruce Momjian

Jan Wieck wrote:
  Sure, but index reuse seems a lot easier, as there is nothing additional
  to remember or clean out when doing it.
  
  Yes, seems so.  TODO added:
  
  * Reuse index tuples that point to heap tuples that are not visible to
anyone?
  
  When reusing a heap tuple you have to clean out all index entries
  pointing to it.
  
  Well, not for UPDATE for no key changes on the same page, if we do that.
  
 
 An update that results in all the same values of every indexed column of 
 a known deleted invisible tuple. This reused tuple can by definition not 
 be the one currently updated. So unless it is a table without a primary 
 key, this assumes that at least 3 versions of the same row exist within 
 the same block. How likely is that to happen?

Good question.  You take the current tuple, and make another one on the
same page.  Later, an update can reuse the original tuple if it is no
longer visible to anyone (by changing the item id), so you only need two
tuples, not three.  My hope is that a repeated update would eventually
move to a page that enough free space for two (or more) versions.

Does that help explain it?

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Bruce Momjian

Heikki Linnakangas wrote:
 On Sat, 24 Jun 2006, Bruce Momjian wrote:
 
  Because having them be on the same page is the only way you can update
  the page item pointer so when you recycle the row, you the indexes are
  now pointing to the new version.  Pages look like:
 
  [marker][item1][item2][item3]...[tuple1][tuple2][tuple3]
 
  and indexes only point to items, not to tuples.  This allows tuples to
  be compacted on the page without affecting the indexes.
 
  If tuple1 is updated to tuple2, once tuple1 is no longer visible to any
  backends, you can modify item1 to point to tuple2, and you can mark the
  space used by tuple1 as reusable:
 
  [marker][item1(tuple2)][item2][item3]...[free][tuple2][tuple3]
 
 Ok, now I think I get it. So the limitation of old and new tuple being on 
 the same page is required to make it possible to remove the old tuple 
 without touching the indexes?

Yes, modifying the heap page item pointer is required for reuse.

 If you move the new tuple (logically, by modifying item pointers) on 
 vacuum, isn't there a risk that a concurrent seqscan misses it?

Well, you lock the page while updating the item pointer.  Because the
versions are on the same page, a single page lock should be fine.

  If you can't expire the old row because one of the indexed columns was
  modified, I see no reason to try to reduce the additional index entries.
 
 It won't enable early expiration, but it means less work to do on update. 
 If there's a lot of indexes, not having to add so many index tuples can be 
 a significant saving.

Already added to TODO.

* Reuse index tuples that point to heap tuples that are not visible to
  anyone?

 To summarise, we have two issues related to frequent updates:
 1. Index and heap bloat, requiring frequent vacuum.
 2. Updates are more expensive than on other DBMSs, because we have to add 
 a new index tuple in every index, even if none of the indexed columns are 
 modified.
 
 Tom suggested that we just improve vacuum and autovacuum, and someone 
 brought up the dead space map idea again. Those are all worthwhile things 
 to do and help with vacuuming after deletes as well as updates, but they 
 only address issue 1. Mark's suggestion (assuming that it would've 
 worked) as well as yours address both, but only for updates.

Agreed.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Jan Wieck


On 6/25/2006 2:24 PM, Bruce Momjian wrote:


Jan Wieck wrote:

 Sure, but index reuse seems a lot easier, as there is nothing additional
 to remember or clean out when doing it.
 
 Yes, seems so.  TODO added:
 
 	* Reuse index tuples that point to heap tuples that are not visible to

  anyone?
 
 When reusing a heap tuple you have to clean out all index entries

 pointing to it.
 
 Well, not for UPDATE for no key changes on the same page, if we do that.
 

An update that results in all the same values of every indexed column of 
a known deleted invisible tuple. This reused tuple can by definition not 
be the one currently updated. So unless it is a table without a primary 
key, this assumes that at least 3 versions of the same row exist within 
the same block. How likely is that to happen?


Good question.  You take the current tuple, and make another one on the
same page.  Later, an update can reuse the original tuple if it is no
longer visible to anyone (by changing the item id), so you only need two
tuples, not three.  My hope is that a repeated update would eventually
move to a page that enough free space for two (or more) versions.

Does that help explain it?



That's exactly what I meant. You need space for 3 or more tuple versions 
within one page and the luck that one of them is invisible at the time 
of the update. I don't know how likely or unlikely this is in reality, 
but it doesn't sound very promising to me so far.


Another problem with this is that even if you find such row, it doesn't 
spare you the index traversal. The dead row whos item id you're reusing 
might have resulted from an insert that aborted or crashed before it 
finished creating all index entries. Or some of its index entries might 
already be flagged known dead, and you better reset those flags.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Hannu Krosing

Ühel kenal päeval, P, 2006-06-25 kell 14:24, kirjutas Bruce Momjian:
 Jan Wieck wrote:
   Sure, but index reuse seems a lot easier, as there is nothing additional
   to remember or clean out when doing it.
   
   Yes, seems so.  TODO added:
   
 * Reuse index tuples that point to heap tuples that are not visible to
   anyone?
   
   When reusing a heap tuple you have to clean out all index entries
   pointing to it.
   
   Well, not for UPDATE for no key changes on the same page, if we do that.
   
  
  An update that results in all the same values of every indexed column of 
  a known deleted invisible tuple. This reused tuple can by definition not 
  be the one currently updated. So unless it is a table without a primary 
  key, this assumes that at least 3 versions of the same row exist within 
  the same block. How likely is that to happen?
 
 Good question.  You take the current tuple, and make another one on the
 same page.  Later, an update can reuse the original tuple if it is no
 longer visible to anyone (by changing the item id), so you only need two
 tuples, not three.  My hope is that a repeated update would eventually
 move to a page that enough free space for two (or more) versions.

I can confirm that this is exactly what happens when running an
update-heavy load with frequent vacuums. Eventually most rows get their
own db pages or share the same page with 2-3 rows. And there will be
lots of unused (filed up, or cleaned and not yet reused) pages.

The overall performance could be made a little better by tuning the
system to not put more than N new rows on the same page at initial
insert or when the row move to a new page during update. Currently
several new rows are initially put on the same page and then move around
during repeated updates until they slow(ish)ly claim their own page.

 Does that help explain it?
 
-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Hannu Krosing

Ühel kenal päeval, P, 2006-06-25 kell 06:52, kirjutas Mark Woodward:

 I'm not sure why vacuum can't run similarly to the way it does now.

What do you mean ?

Currently vacuum runs a three-step process

1) runs a full scan over heap and collects all dead tuple ctids from
heap

2) run full scans over all indexes of the relation and removes any
pointers pointing to dead tuples.

3) runs another full scan over heap and removes the tuples in the list
collected at step 1.

There is no modifications done to live tuples (ok, they *may* get frozen
if they are above certain age, but this is not relevant to current
discussion).

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Bruce Momjian

Jan Wieck wrote:
  An update that results in all the same values of every indexed column of 
  a known deleted invisible tuple. This reused tuple can by definition not 
  be the one currently updated. So unless it is a table without a primary 
  key, this assumes that at least 3 versions of the same row exist within 
  the same block. How likely is that to happen?
  
  Good question.  You take the current tuple, and make another one on the
  same page.  Later, an update can reuse the original tuple if it is no
  longer visible to anyone (by changing the item id), so you only need two
  tuples, not three.  My hope is that a repeated update would eventually
  move to a page that enough free space for two (or more) versions.
  
  Does that help explain it?
  
 
 That's exactly what I meant. You need space for 3 or more tuple versions 
 within one page and the luck that one of them is invisible at the time 
 of the update. I don't know how likely or unlikely this is in reality, 
 but it doesn't sound very promising to me so far.

Why three?  I explained using only two heap tuples:

[item1]...[tuple1]

becomes on UPDATE:
   --
[item1]...[tuple1][tuple2]
  -

on another UPDATE, if tuple1 is no longer visible:

   --
[item1]...[tuple1][tuple2]
  --

 Another problem with this is that even if you find such row, it doesn't 
 spare you the index traversal. The dead row whos item id you're reusing 
 might have resulted from an insert that aborted or crashed before it 
 finished creating all index entries. Or some of its index entries might 
 already be flagged known dead, and you better reset those flags.

You can only reuse heap rows that were created and expired by committed
transactions.  In fact, you can only UPDATE a row that was created by a
committed transaction.  You cannot _reuse_ any row, but only a row that
is being UPDATEd.  Also, it cannot be known dead because it are are in
the process of updating it.

I am thinking my idea was not fully understood.  Hopefully this email
helps.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Bruce Momjian

Hannu Krosing wrote:
 ?hel kenal p?eval, P, 2006-06-25 kell 14:24, kirjutas Bruce Momjian:
  Jan Wieck wrote:
Sure, but index reuse seems a lot easier, as there is nothing 
additional
to remember or clean out when doing it.

Yes, seems so.  TODO added:

* Reuse index tuples that point to heap tuples that are not 
visible to
  anyone?

When reusing a heap tuple you have to clean out all index entries
pointing to it.

Well, not for UPDATE for no key changes on the same page, if we do that.

   
   An update that results in all the same values of every indexed column of 
   a known deleted invisible tuple. This reused tuple can by definition not 
   be the one currently updated. So unless it is a table without a primary 
   key, this assumes that at least 3 versions of the same row exist within 
   the same block. How likely is that to happen?
  
  Good question.  You take the current tuple, and make another one on the
  same page.  Later, an update can reuse the original tuple if it is no
  longer visible to anyone (by changing the item id), so you only need two
  tuples, not three.  My hope is that a repeated update would eventually
  move to a page that enough free space for two (or more) versions.
 
 I can confirm that this is exactly what happens when running an
 update-heavy load with frequent vacuums. Eventually most rows get their
 own db pages or share the same page with 2-3 rows. And there will be
 lots of unused (filed up, or cleaned and not yet reused) pages.

Right, that was my guess because heavily updated rows start to move
around in the table, and because UPDATE tries to stay on the same page,
once it the row hits a mostly-empty page, it stays there.

 The overall performance could be made a little better by tuning the
 system to not put more than N new rows on the same page at initial
 insert or when the row move to a new page during update. Currently
 several new rows are initially put on the same page and then move around
 during repeated updates until they slow(ish)ly claim their own page.

We have a fillfactor patch that will be in 8.2.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Bruce Momjian

bruce wrote:
 Why three?  I explained using only two heap tuples:
 
   [item1]...[tuple1]
 
 becomes on UPDATE:
--
   [item1]...[tuple1][tuple2]
   -
 
 on another UPDATE, if tuple1 is no longer visible:
 
--
   [item1]...[tuple1][tuple2]
   --
 

Here is some pseudo-code that implements this:

  Definition:  Single-Index-Tuple Chain (CITC)

  Do old and new UPDATE rows have unchanged indexed columns?
Is old row member of CITC?
Point item id at first CITC visible tuple in chain
Mark previous invisible CITC tuples as freespace

Does page have free space?
Add new tuple on the same page as old
Mark old tuple as CITC
Do not create index entries

VACUUM would have to be taught about CITC, and CREATE INDEX would have
to create entries in other indexes for cases where its new indexed
columns change inside a CITC.

Conceptually, CITC allows a single index entry to point to multiple
UPDATEd rows, allowing non-visible tuples to be recycled (and reused for
future UPDATEs) without affecting the index.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Jan Wieck


On 6/25/2006 5:18 PM, Bruce Momjian wrote:


Jan Wieck wrote:
 An update that results in all the same values of every indexed column of 
 a known deleted invisible tuple. This reused tuple can by definition not 
 be the one currently updated. So unless it is a table without a primary 
 key, this assumes that at least 3 versions of the same row exist within 
 the same block. How likely is that to happen?
 
 Good question.  You take the current tuple, and make another one on the

 same page.  Later, an update can reuse the original tuple if it is no
 longer visible to anyone (by changing the item id), so you only need two
 tuples, not three.  My hope is that a repeated update would eventually
 move to a page that enough free space for two (or more) versions.
 
 Does that help explain it?
 

That's exactly what I meant. You need space for 3 or more tuple versions 
within one page and the luck that one of them is invisible at the time 
of the update. I don't know how likely or unlikely this is in reality, 
but it doesn't sound very promising to me so far.


Why three?  I explained using only two heap tuples:


For some reason I counted in the new tuple ... sorry that. Yes, it can 
work with two tuples.




[item1]...[tuple1]

becomes on UPDATE:
   --
[item1]...[tuple1][tuple2]
  -

on another UPDATE, if tuple1 is no longer visible:

   --
[item1]...[tuple1][tuple2]
  --

Another problem with this is that even if you find such row, it doesn't 
spare you the index traversal. The dead row whos item id you're reusing 
might have resulted from an insert that aborted or crashed before it 
finished creating all index entries. Or some of its index entries might 
already be flagged known dead, and you better reset those flags.


You can only reuse heap rows that were created and expired by committed
transactions.  In fact, you can only UPDATE a row that was created by a
committed transaction.  You cannot _reuse_ any row, but only a row that
is being UPDATEd.  Also, it cannot be known dead because it are are in
the process of updating it.


Now you lost me. What do you mean a row that is being UPDATEd? The row 
(version) being UPDATEd right now cannot be expired, or why would you 
update that one? And if your transaction rolls back later, the row you 
update right now must be the one surviving.


Any row that was created by a committed transaction does indeed have all 
the index entries created. But if it is deleted and expired, that means 
that the transaction that stamped xmax has committed and is outside of 
every existing snapshot. You can only reuse a slot that is used by a 
tuple that satisfies the vacuum snapshot. And a tuple that satisfies 
that snapshot has potentially index entries flagged known dead.



I am thinking my idea was not fully understood.  Hopefully this email
helps.


I must be missing something because I still don't see how it can work.


Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Bruce Momjian

Jan Wieck wrote:
  [item1]...[tuple1]
  
  becomes on UPDATE:
 --
  [item1]...[tuple1][tuple2]
-
  
  on another UPDATE, if tuple1 is no longer visible:
  
 --
  [item1]...[tuple1][tuple2]
--
  
  Another problem with this is that even if you find such row, it doesn't 
  spare you the index traversal. The dead row whos item id you're reusing 
  might have resulted from an insert that aborted or crashed before it 
  finished creating all index entries. Or some of its index entries might 
  already be flagged known dead, and you better reset those flags.
  
  You can only reuse heap rows that were created and expired by committed
  transactions.  In fact, you can only UPDATE a row that was created by a
  committed transaction.  You cannot _reuse_ any row, but only a row that
  is being UPDATEd.  Also, it cannot be known dead because it are are in
  the process of updating it.
 
 Now you lost me. What do you mean a row that is being UPDATEd? The row 
 (version) being UPDATEd right now cannot be expired, or why would you 
 update that one? And if your transaction rolls back later, the row you 
 update right now must be the one surviving.

It can only be a non-visible version of the row earlier in the UPDATE
chain, not the actual one being updated.

 Any row that was created by a committed transaction does indeed have all 
 the index entries created. But if it is deleted and expired, that means 
 that the transaction that stamped xmax has committed and is outside of 
 every existing snapshot. You can only reuse a slot that is used by a 
 tuple that satisfies the vacuum snapshot. And a tuple that satisfies 
 that snapshot has potentially index entries flagged known dead.

When you are using the update chaining, you can't mark that index row as
dead because it actually points to more than one row on the page, some
are non-visible, some are visible.

  I am thinking my idea was not fully understood.  Hopefully this email
  helps.
 
 I must be missing something because I still don't see how it can work.

I just posted pseudo-code.  Hope that helps.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Jan Wieck


On 6/25/2006 10:12 PM, Bruce Momjian wrote:

When you are using the update chaining, you can't mark that index row as
dead because it actually points to more than one row on the page, some
are non-visible, some are visible.


Back up the truck ... you mean in the current code base we have heap 
tuples that are visible in index scans because of heap tuple chaining 
but without index tuples pointing directly at them?



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-25 Thread Alvaro Herrera

Jan Wieck wrote:
 On 6/25/2006 10:12 PM, Bruce Momjian wrote:
 When you are using the update chaining, you can't mark that index row as
 dead because it actually points to more than one row on the page, some
 are non-visible, some are visible.
 
 Back up the truck ... you mean in the current code base we have heap 
 tuples that are visible in index scans because of heap tuple chaining 
 but without index tuples pointing directly at them?

I don't know where this idea came from, but it's not true.  All heap
tuples, dead or otherwise, have index entries.  Unless the idea is to
extend update chaining to mean something different from the current
meaning.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jan Wieck


On 6/23/2006 3:10 PM, Mark Woodward wrote:


This is NOT an in-place update. The whole MVCC strategy of keeping old
versions around doesn't change. The only thing that does change is one
level of indirection. Rather than keep references to all versions of all
rows in indexes, keep only a reference to the first or key row of each
row, and have the first version of a row form the head of a linked list to
subsequent versions of each row. The list will be in decending order.


Where exactly do you intend to keep all those links (for a table with N 
indexes)?



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jan Wieck


On 6/23/2006 9:56 PM, [EMAIL PROTECTED] wrote:


On Fri, Jun 23, 2006 at 03:08:34PM -0400, Bruce Momjian wrote:

Tom Lane wrote:
 ...
 suggesting.  We're having a hard enough time debugging and optimizing
 *one* storage model.  I think the correct path forward is to stick with
 the same basic storage model and vacuuming concept, and address the
 known performance issues with better-optimized vacuuming.  No, it will
 never be perfect for every scenario, but we can certainly make it much
 better than it is now, without risking killing the project by
 introducing undebuggable, unmaintainable complexity.
Well, are you suggesting we just stop improving the database?  I am sure
not.  But, your suggestion is that we can't do better without incurring
more complexity (true), and that complexity will not be worth it.  I
don't agree with that until I see some proposals, and shutting down
discussion because they will add complexity or are fruitless seems
unwise.


It sounds like everybody agrees that things need to be fixed, and genuinely
motivated people are trying to offer what they have to the table.


One singe core team member responds vaguely in a way, you feel being 
supportive of your case, and you conclude that everybody agrees? 
Sorry, x'use me?


There are a couple of side effects on this update in place issue that 
aren't even mentioned yet. Nobody with any significant in depth 
knowledge of the Postgres non-overwriting storage engine concept seems 
to suggest any move towards a storage system, that does updates in place 
that require undo operations in case of a process/system failure. 
You're ready to fix all those places to support the undo you need? You 
must have a big table.



Jan



Tom already has enough on his plate, as do most others here - so unless
a competent champion can take up the challenge, discussion is all we have.

I'm not liking the we should do it this way, no, we should do it that.
My read of the situation is that both may be useful, and that both should
be pursued. But one set of people can't pursue both.

Is any who is able, able to take up this challenge? Perhaps more than one,
from both major directions? (vacuum on one side, and improved storage on
the other) Somebody with the time and skill, who can work through the
design discussions on one of the aspects?

I want to contribute soon, and this is the sort of thing that interests me -
but I still don't have time yet, and there would be no guarantee that I
succeeded. Somebody else? :-)

Cheers,
mark




--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread mark

On Sat, Jun 24, 2006 at 03:29:47AM -0400, Jan Wieck wrote:
 It sounds like everybody agrees that things need to be fixed, and genuinely
 motivated people are trying to offer what they have to the table.
 One singe core team member responds vaguely in a way, you feel being 
 supportive of your case, and you conclude that everybody agrees? 
 Sorry, x'use me?

 There are a couple of side effects on this update in place issue that 
 aren't even mentioned yet. Nobody with any significant in depth 
 knowledge of the Postgres non-overwriting storage engine concept seems 
 to suggest any move towards a storage system, that does updates in place 
 that require undo operations in case of a process/system failure. 
 You're ready to fix all those places to support the undo you need? You 
 must have a big table.

Jan: Who on the list has claimed that nothing is broken and nothing needs
to be improved? Are you making this claim?

Shutting down ideas is not constructive. Encouraging those with ideas to
step up and do more than talk could be.

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Heikki Linnakangas


On Fri, 23 Jun 2006, Jonah H. Harris wrote:


On 6/23/06, Mark Woodward [EMAIL PROTECTED] wrote:

Rather than keep references to all versions of all
rows in indexes, keep only a reference to the first or key row of each
row, and have the first version of a row form the head of a linked list to
subsequent versions of each row. The list will be in decending order.


By all means, please go ahead and try it because it's not quite that
easy.  You're going to run into serious locking and contention issues
this way.  In the end, it's not much better than running a sequential
scan to query a row that's been updated several thousand times on a
table that hasn't been vacuumed... follow that pointer :)


Can you elaborate what kind of locking and contention issues you're 
thinking of?


You could update the index tuple to point to a newer version of the row, 
when an index scan determines that the heap tuple it points to is not 
visible to anyone. We already check that to update the XMAX_COMMITTED hint 
bit. Updating the index tuple comes with a cost, of course, but 
alleviates the follow that pointer issue.


The biggest challenge that I see is that an index scan would somehow 
need to know when to follow the t_ctid chain and when not. If you follow 
the pointer and there's another index tuple for the row, the scan could 
see the same tuple twice. Some kind of bookkeeping would be needed to 
solve that.


Also, vacuuming would become a bit more complex, since it would need to 
update the index tuples to point to newer row versions instead of just 
removing them.


All in all, I think this solution to the an update needs to update all 
indexes, even when none of the indexed columns changed issue requires 
less changes than implementing Oracle style rollback segments and/or an 
undo log.


- Heikki

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Mark Woodward

 On 6/23/2006 3:10 PM, Mark Woodward wrote:

 This is NOT an in-place update. The whole MVCC strategy of keeping old
 versions around doesn't change. The only thing that does change is one
 level of indirection. Rather than keep references to all versions of all
 rows in indexes, keep only a reference to the first or key row of each
 row, and have the first version of a row form the head of a linked list
 to
 subsequent versions of each row. The list will be in decending order.

 Where exactly do you intend to keep all those links (for a table with N
 indexes)?


I'm probably mistaken, but aren't there already forward references in
tuples to later versions? If so, I'm only sugesting reversing the order
and referencing the latest version.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Martijn van Oosterhout

On Sat, Jun 24, 2006 at 08:14:10AM -0400, Mark Woodward wrote:
  On 6/23/2006 3:10 PM, Mark Woodward wrote:
 
  This is NOT an in-place update. The whole MVCC strategy of keeping old
  versions around doesn't change. The only thing that does change is one
  level of indirection. Rather than keep references to all versions of all
  rows in indexes, keep only a reference to the first or key row of each
  row, and have the first version of a row form the head of a linked list
  to
  subsequent versions of each row. The list will be in decending order.
 
  Where exactly do you intend to keep all those links (for a table with N
  indexes)?
 
 
 I'm probably mistaken, but aren't there already forward references in
 tuples to later versions? If so, I'm only sugesting reversing the order
 and referencing the latest version.

You can't do that. The links exist so that in READ COMMITTED mode you
can always find the newest version. You would need to add additional
links to go backwards.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Martijn van Oosterhout

On Fri, Jun 23, 2006 at 03:10:39PM -0400, Mark Woodward wrote:
 This is NOT an in-place update. The whole MVCC strategy of keeping old
 versions around doesn't change. The only thing that does change is one
 level of indirection. Rather than keep references to all versions of all
 rows in indexes, keep only a reference to the first or key row of each
 row, and have the first version of a row form the head of a linked list to
 subsequent versions of each row. The list will be in decending order.

I thought of another issue with this. If you move away from storing
each row in the indexes, you can pretty much forget bitmap index scans.
They pretty much rely on every row being represented, not just a
subset. Until you go to the heap you don't know if a tuple will match,
which is precisely what the bitmap scan is trying to avoid. You could
follow the links, but that destroys the nice sequential access
properties.

 In the vast majority of cases, the overhead of this action will be
 trivial. In an unmodified row, you're there. In a modified row, you have
 one extra lookup. In extream cases, you may have to go back a few
 versions, but I don't see that as a common behavior.

I wonder if looking at old versions is really all that uncommon. A
large reporting query which runs for hours will probably be looking at
a lot of old versions. These are the queries that will be hit the
hardest.

If you're trying to avoid index bloat, I wonder if it wouldn't be
better to tackle this from the other end. In indexes, allow a row to
carry multiple CTIDs. That way a new version only requires adding six
more bytes, rather than a whole new tuple.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Heikki Linnakangas


On Sat, 24 Jun 2006, Mark Woodward wrote:


I'm probably mistaken, but aren't there already forward references in
tuples to later versions? If so, I'm only sugesting reversing the order
and referencing the latest version.


I thought I understood your idea, but now you lost me again. I thought 
what you want is that the older heap tuple has a pointer to the 
newer one. Which it already has, it's called t_ctid.


Can you try to explain more carefully how the whole thing would work? 
What would an index tuple point to? What pointers would a heap tuple 
have? What would an index scan do to find the row version it's interested 
in? What exactly would an update do?


- Heikki

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Mark Woodward

 On Sat, 24 Jun 2006, Mark Woodward wrote:

 I'm probably mistaken, but aren't there already forward references in
 tuples to later versions? If so, I'm only sugesting reversing the order
 and referencing the latest version.

 I thought I understood your idea, but now you lost me again. I thought
 what you want is that the older heap tuple has a pointer to the
 newer one. Which it already has, it's called t_ctid.

Perfect!

 Can you try to explain more carefully how the whole thing would work?
 What would an index tuple point to? What pointers would a heap tuple
 have? What would an index scan do to find the row version it's interested
 in? What exactly would an update do?


Since we already allocate space for some notion of linked list, then all
I'm suggesting is we reverse the order, sort of. Currently it looks like
this:

ver001-ver002-ver003-...-verN

That's what t_ctid does now, right? Well, that's sort of stupid. Why not
have it do this:

ver001-verN-...-ver003-ver002-|
 ^-/

This will speed up almost *all* queries when there are more than two
version of rows.

OK, here is the behavior of an update:
(1) Find the latest version of the row
(2) Duplicate row and modify as per plan
(3) Set the t_ctid of the new row to the last latest
(4) Set the t_ctid of the first row to that of the new row
(5) Attempt to index the row
(6) If the first version of the row is in the index already (ver001) Don't
modify the index, otherwise, add the new version (just as before)

When you vacuum, simply make the latest version (verN) the key row (ver001).

There are, no doubt, issues that need to be resolved (I can think of a
coouple off the top of my head), but overall I think it is workable and
don't think this will affect performance in the simple case and improve
performance in the cases where there are more than one or two version of a
row.

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jonah H. Harris


On 6/24/06, Mark Woodward [EMAIL PROTECTED] wrote:

Currently it looks like this:

ver001-ver002-ver003-...-verN

That's what t_ctid does now, right? Well, that's sort of stupid. Why not
have it do this:

ver001-verN-...-ver003-ver002-|


Heh, because that's crazy.  The first time you insert a key into the
index it will point to v1 of a tuple... say after 5 updates you have
v2,v3,v4,v5... your c_tid pointer chain looks like v1
(original)-v2-v3-v4-v5 (newest).  However, your whole idea is based
on not having to do another index insert for unchanged keys, so the
index still points to v1... which means you have to follow the c_tid
chain to get to the newest version just like a sequential scan.  I
don't see how you think you can reverse pointer it.


This will speed up almost *all* queries when there are more than two
version of rows.


Nope.


When you vacuum, simply make the latest version (verN) the key row (ver001).


How are you going to do this without a ton of locking... remember, the
index is pointing to v1 with a tid... so you'll have to physically
move the newest version v5 to v1's tid from wherever it was... like a
vacuum full on steroids.  Unless of course, you rebuild the index...
but that's not a solution either.

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 2nd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jonah H. Harris


On 6/24/06, Jonah H. Harris [EMAIL PROTECTED] wrote:

Grr... need coffee... s/c_tid/ctid/g

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 2nd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Mark Woodward

 On 6/24/06, Mark Woodward [EMAIL PROTECTED] wrote:
 Currently it looks like this:

 ver001-ver002-ver003-...-verN

 That's what t_ctid does now, right? Well, that's sort of stupid. Why not
 have it do this:

 ver001-verN-...-ver003-ver002-|

 Heh, because that's crazy.  The first time you insert a key into the
 index it will point to v1 of a tuple... say after 5 updates you have
 v2,v3,v4,v5... your c_tid pointer chain looks like v1
 (original)-v2-v3-v4-v5 (newest).  However, your whole idea is based
 on not having to do another index insert for unchanged keys, so the
 index still points to v1... which means you have to follow the c_tid
 chain to get to the newest version just like a sequential scan.  I
 don't see how you think you can reverse pointer it.

In the scenario, as previously outlined:

ver001-verN-...-ver003-ver2-|
  ^-/

The index points to version 1 (ver001) which points to the latest version
(verN).




 This will speed up almost *all* queries when there are more than two
 version of rows.

 Nope.

Of course it will.


 When you vacuum, simply make the latest version (verN) the key row
 (ver001).

 How are you going to do this without a ton of locking... remember, the
 index is pointing to v1 with a tid... so you'll have to physically
 move the newest version v5 to v1's tid from wherever it was... like a
 vacuum full on steroids.  Unless of course, you rebuild the index...
 but that's not a solution either.

I don't understand how you can assume this. In fact, it wil proably reduce
locking and disk IO by not having to modify indexes.
\

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jochem van Dieten


On 6/24/06, Mark Woodward wrote:


ver001-verN-...-ver003-ver002-|
 ^-/

This will speed up almost *all* queries when there are more than two
version of rows.

OK, here is the behavior of an update:
(1) Find the latest version of the row
(2) Duplicate row and modify as per plan
(3) Set the t_ctid of the new row to the last latest
(4) Set the t_ctid of the first row to that of the new row
(5) Attempt to index the row
(6) If the first version of the row is in the index already (ver001) Don't
modify the index, otherwise, add the new version (just as before)


I am not sure I understand your algorithm. If we take as a starting
point the following situation of a fresh tuple, in very schematic form
it looks like:

Heap:
TIDT_CTID   XMIN  XMAX  Col1   Col2
xxx1xxx1ttt1  null1  1

Index on Col1:
1xxx1

Index on Col2:
1xxx1



Now, after an update to this tuple changing the Value2 field, in what
state should the heap, index 1 and index 2 be? If I understand you
correctly, you want it to be:

Heap:
TIDT_CTID   XMIN  XMAX  Col1   Col2
xxx1xxx2ttt1  ttt21  1
xxx2xxx1ttt2  null1  2

Index on Col1:
1xxx2

Index on Col2:
1xxx1
2xxx2


Three questions:
1. Do I understand your intention correctly?
2. Could you extend this for an update to increment value2 (because
the T_CTID logic gets a bit unclear for me there).
3. The net benefit of this would be 1 less entry in the index on Col1?

Jochem

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Martijn van Oosterhout

On Sat, Jun 24, 2006 at 09:23:28AM -0400, Mark Woodward wrote:
  Can you try to explain more carefully how the whole thing would work?
  What would an index tuple point to? What pointers would a heap tuple
  have? What would an index scan do to find the row version it's interested
  in? What exactly would an update do?
 
 
 Since we already allocate space for some notion of linked list, then all
 I'm suggesting is we reverse the order, sort of. Currently it looks like
 this:
 
 ver001-ver002-ver003-...-verN
 
 That's what t_ctid does now, right? Well, that's sort of stupid. Why not
 have it do this:
 
 ver001-verN-...-ver003-ver002-|
  ^-/

You don't say where the index points or the order, but I'm assuming
from your diagram that ver1 is the oldest, verN is the newest.
Currently there is an index entry for each version, but in your version
there is only an index entry for ver1, right?

 This will speed up almost *all* queries when there are more than two
 version of rows.
 
 OK, here is the behavior of an update:
 (1) Find the latest version of the row

You mean, find the version of the row which satisfies your snapshot. If
the version pointed to by the index is it, you're done. Otherwise you
follow the chain. The most common option being one step, because ver01
is likely to be invisible.

 (2) Duplicate row and modify as per plan
 (3) Set the t_ctid of the new row to the last latest
 (4) Set the t_ctid of the first row to that of the new row
 (5) Attempt to index the row
 (6) If the first version of the row is in the index already (ver001) Don't
 modify the index, otherwise, add the new version (just as before)

This looks OK, I guess. I wouldn't know about locking...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jonah H. Harris


On 6/24/06, Mark Woodward [EMAIL PROTECTED] wrote:

In the scenario, as previously outlined:

ver001-verN-...-ver003-ver2-|
  ^-/


So you want to always keep an old version around?

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 2nd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Mark Woodward

 On 6/24/06, Mark Woodward [EMAIL PROTECTED] wrote:
 In the scenario, as previously outlined:

 ver001-verN-...-ver003-ver2-|
   ^-/

 So you want to always keep an old version around?

Prior to vacuum, it will be there anyway, and after vacuum, the new
version will become ver001.

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jonah H. Harris


On 6/24/06, Mark Woodward [EMAIL PROTECTED] wrote:

 On 6/24/06, Mark Woodward [EMAIL PROTECTED] wrote:
 In the scenario, as previously outlined:

 ver001-verN-...-ver003-ver2-|
   ^-/

 So you want to always keep an old version around?

Prior to vacuum, it will be there anyway, and after vacuum, the new
version will become ver001.


So you do intend to move verN into ver001's slot?  What about the
other conditions you had mentioned where you have to follow
PostgreSQL's current behavior?  How are you going to have a pointer
chain in that case?


--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 2nd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread PFC




What I see in this discussion is a huge amount of the grass must be
greener on the other side syndrome, and hardly any recognition that
every technique has its downsides and complications.


Sure ;)

	MVCC generates dead rows, by its very nature ; however I see two trends  
in this :


1* A large transaction that updates/deletes many rows.
	For instance suppose you UPDATE an entire table whose size is larger than  
memory.


	Old row versions have to be kept somewhere until commit, be it in the  
table itself or in some accessory undo-log.
	So, there will be a lot of harddisk grinding anyway, be it MVCC or  
Oracle-style, or whatever. MVCC will bloat the table and indexes, then  
VACUUM will shrink them. Update-in-place systems will bloat an undo log.


	It seems to me the current MVCC+VACUUM is the right tool for this job,  
requiring about the same amount of IO that the others.
	Vacuum scans sequentially, so it's the best way to process large volumes  
of data.


2* Many short transactions update rows in a table
Like the sessions problem, for instance.

Current VACUUM sucks for this case, I guess that's known.

---

	So, we have two different problems, and one tool which is adapted to one  
problem only. Should the tool (Vacuum) be fixed to handle both problems,  
making it more complex and difficult to maintain, or should another tool  
be created specifically for the second problem ?
	Problem 2 is very different from problem 1. The only case when they meet  
is when there is a continuous stream of small updates running concurrently  
with a long transaction.

So, what is the ideal tool for case 2 ?

	We'd want a vacuuming machine that can be run very often, or even better,  
continuously.
	The table can be large, larger than the disk cache, so scanning it is not  
an option.
	The updates are probably randomly distributed into the table. Therefore,  
VACUUMing a long time after these transactions are commited and the pages  
are no longer in cache would require a lot of random seeks, which is also  
bad.

Besides, we want this vacuum to be continuous and transparent.

	The best time to vacuum pages is, thus, when they are still in the  
background writer's memory, or the disk cache, waiting to be flushed to  
disk. There, they can be re-read, vacuumed and re-written with no seek  
penalty, only additional WAL traffic. However the amount of WAL traffic in  
bytes/s is less important that the frequency of WAL syncs. Emitting more  
WAL data shouldn't be a problem if those sync writes are coalesced with  
the sync writes of current reansactions.


	So, I guess the best solution for case 2 is to have the background writer  
perform on-the-fly VACUUM :


	An UPDATE or DELETE transaction hands over dirty pages to be written to  
the bgwriter. It also tells the bgwriter the ID of the current transaction  
and flags specifying if they contain candidate dead rows.
	The bgwriter should have a sufficiently large buffer in order to be able  
to keep these pages in memory until all the transactions that can see the  
dead rows in these pages are finished.

Then, the pages are vacuumed and written.

	The key is the size of the buffer. It should be large enough to contain  
enough pages so that it is actually possible to vacuum something out of  
them before writing them. However if the buffer capacity is exceeded (for  
instance, because there is a long running transaction), this is not a  
problem : the pages are simply written to disk normally, they will contain  
dead rows, which will need to be handled lated by the standard VACUUM.


	I think this would maximize code reuse by using the current bgwriter's  
architecture... did I miss something ?















---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Bruce Momjian

bruce wrote:
 Tom Lane wrote:
  Bruce Momjian [EMAIL PROTECTED] writes:
   I think at some point we have to admit that _polling_ the tables, which
   is what autovacuum does, just isn't going to work well, no matter how
   much it is tweeked, and another approach should be considered for
   certain workload cases.
  
  Autovacuum polls in its current, first-generation implementation;
  what I said upthread was it needs to be smarter than that.  I am not
  sure how you get from that to the conclusion that the very next step
  is to abandon the vacuuming approach altogether.
 
 I am not ready to abandon autovacuum, but as I stated later the UPDATE
 with no key change case is common enought that it could be handled
 better without involving autovacuum and its limitations.
 
 As I remember, most databases have problem with DELETE/INSERT cycles,
 but we seem to be hit by UPDATE performance more than most, and more
 than is wise.

In an attempt to get some concrete on these ideas...  ;-)

I think the major complexity in doing an in-place UPDATE when no key
columns change is allowing rollback on transaction abort (or backend
crash), and properly handling visibility for transactions in progress.

If the old and new rows are on the same heap page (perhaps a necessary
limitation), you can easily update the heap item id to point to the new
heap row.  All indexes will then point to the new row, and sequential
scans will still see both rows (which is good).  That leave the rollback
issue (undoing the item id change), and having index scans for current
backends still see the old row.

OK, I have an idea.  Right now, an UPDATE where the old and new rows are
on the same page have two index entries.  What if we made only one index
entry for both?  We already have UPDATE chaining, where the old row
points to the new one.  If no key columns change, we set a bit in the
heap that the chaining points to the old and new row (both on the same
page), so an index scan uses one index entry to see the old and new row,
and once the old row is no longer visible, the page index id can be
updated to point to the new row and the old row can be marked free and
perhaps used for a future UPDATE.  (UPDATE already tries to do keep
updates on the same heap page.)

FYI, the reason heap cleanup is possible once you go with a single index
entry for old and new rows is because there is no index cleanup (and
single-row index cleanup is very expensive).

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Mark Woodward

 On 6/24/06, Mark Woodward [EMAIL PROTECTED] wrote:
  On 6/24/06, Mark Woodward [EMAIL PROTECTED] wrote:
  In the scenario, as previously outlined:
 
  ver001-verN-...-ver003-ver2-|
^-/
 
  So you want to always keep an old version around?

 Prior to vacuum, it will be there anyway, and after vacuum, the new
 version will become ver001.

 So you do intend to move verN into ver001's slot?  What about the
 other conditions you had mentioned where you have to follow
 PostgreSQL's current behavior?  How are you going to have a pointer
 chain in that case?

Who said anything about moving anything. When vacuum comes along, it
cleans out previous versions of rows. Very little will change.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Heikki Linnakangas


On Sat, 24 Jun 2006, Bruce Momjian wrote:


OK, I have an idea.  Right now, an UPDATE where the old and new rows are
on the same page have two index entries.  What if we made only one index
entry for both?  We already have UPDATE chaining, where the old row
points to the new one.  If no key columns change, we set a bit in the
heap that the chaining points to the old and new row (both on the same
page), so an index scan uses one index entry to see the old and new row,
and once the old row is no longer visible, the page index id can be
updated to point to the new row and the old row can be marked free and
perhaps used for a future UPDATE.  (UPDATE already tries to do keep
updates on the same heap page.)


In fact, that's what I originally thought Mark was suggesting. A couple of 
points:


Why the limitation of old and new row being on the same page?

This only works if none of the updated columns are indexed. That's a bit 
annoying. It would be nice to be able to create new index tuples in those 
indexes that contain one of the changed columns, but not in others.


What happens if you create a new index that contains one of 
the changed columns?


- Heikki

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jan Wieck


On 6/24/2006 9:23 AM, Mark Woodward wrote:


On Sat, 24 Jun 2006, Mark Woodward wrote:


I'm probably mistaken, but aren't there already forward references in
tuples to later versions? If so, I'm only sugesting reversing the order
and referencing the latest version.


I thought I understood your idea, but now you lost me again. I thought
what you want is that the older heap tuple has a pointer to the
newer one. Which it already has, it's called t_ctid.


Perfect!


Can you try to explain more carefully how the whole thing would work?
What would an index tuple point to? What pointers would a heap tuple
have? What would an index scan do to find the row version it's interested
in? What exactly would an update do?



Since we already allocate space for some notion of linked list, then all
I'm suggesting is we reverse the order, sort of. Currently it looks like
this:

ver001-ver002-ver003-...-verN

That's what t_ctid does now, right? Well, that's sort of stupid. Why not
have it do this:

ver001-verN-...-ver003-ver002-|
 ^-/

This will speed up almost *all* queries when there are more than two
version of rows.

OK, here is the behavior of an update:
(1) Find the latest version of the row
(2) Duplicate row and modify as per plan
(3) Set the t_ctid of the new row to the last latest
(4) Set the t_ctid of the first row to that of the new row
(5) Attempt to index the row
(6) If the first version of the row is in the index already (ver001) Don't
modify the index, otherwise, add the new version (just as before)

When you vacuum, simply make the latest version (verN) the key row (ver001).


This isn't done simply. Currently, vacuum collects a trivial array of 
ctid's it is removing and every now and then does a bulk remove of the 
index tuples pointing to them. Now lets consider a table with two 
indexed columns with the following row versions resulting from an insert 
and 3 updates to that same row:


  v1:  a,b
  v2:  a,c
  v3:  a,d
  v4:  b,d

In your new scheme, there would be two index tuples for column 1 (one 
pointing to v1, one pointing to v4) and 3 index tuples for column 2 (one 
for each different value pointing to v1, v2 and v3). Did I get that 
right so far?


If vacuum now can remove v1, it has to update index 1 to point to v2 and 
remove the pointer to v1 from index 2. If it can remove v1 and v2, it 
has to update index 1 to point to v3 and remove v1 and v2 from index 2. 
If it can remove v1, v2 and v3 it must delete the index 1 tuple pointing 
to v1, delete the index 2 entries pointing to v1 and v2 and update the 
index 2 entry for v3 to point to v4. Figuring out which index tuples to 
remove and which ones to update can only be done by comparing each and 
every indexed columns old and new values. To do so, vacuum will have to 
fetch all the row versions, which can be scattered all over the place, 
with all possible locking issues including but not limited to deadlocks.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Hannu Krosing

Ühel kenal päeval, R, 2006-06-23 kell 13:08, kirjutas Tom Lane:
 Csaba Nagy [EMAIL PROTECTED] writes:
  Surprisingly its mostly WAL traffic, the heap/index pages themselves are
  often not yet synced to disk by time of vacuum, so no additional traffic
  there. If you had made 5 updates per page and then vacuum it, then you
  make effectively 1 extra WAL write meaning 20% increase in WAL traffic. 
 
  Is this also holding about read traffic ? I thought vacuum will make a
  full table scan... for big tables a full table scan is always badly
  influencing the performance of the box. If the full table scan would be
  avoided, then I wouldn't mind running vacuum in a loop... 
 
 If you're doing heavy updates of a big table then it's likely to end up
 visiting most of the table anyway, no?  There is talk of keeping a map
 of dirty pages, but I think it'd be a win for infrequently-updated
 tables, not ones that need constant vacuuming.
 
 I think a lot of our problems in this area could be solved with fairly
 straightforward tuning efforts on the existing autovacuum
 infrastructure.  In particular, someone should be looking into
 recommendable default vacuum-cost-delay settings so that a background
 vacuum doesn't affect performance too much.

One thing that would help updates quite a lot in some scenarios is
keeping the pages only partially-filled, so that most updates could keep
the new version in the same page. I think that has also been discussed
as an option to vacuum and maybe as part of initial inserts. Maybe some
of it even ended up as a todo item.

 Another problem with the
 current autovac infrastructure is that it doesn't respond very well to
 the case where there are individual tables that need constant attention
 as well as many that don't.  If you have N databases then you can visit
 a particular table at most once every N*autovacuum_naptime seconds, and
 *every* table in the entire cluster gets reconsidered at that same rate.
 I'm not sure if we need the ability to have multiple autovac daemons
 running at the same time, 

My patch enabling effective continuous vacuum of fast-update tables,
while still being able to vacuum huge slowly changing ones is still not
applied. Without that patch there is no reason to vacuum the small and
fast changingg tables while vacuum on bigger tables is running, as it
won't clean out dead tuples anyway.

 but we definitely could use something with a
 more flexible table-visiting pattern.  Perhaps it would be enough to
 look through the per-table stats for each database before selecting the
 database to autovacuum in each cycle, instead of going by least
 recently autovacuumed.
 
 Bottom line: there's still lots of low-hanging fruit.  Why are people
 feeling that we need to abandon or massively complicate our basic
 architecture to make progress?

Maybe we could start from reusing the index tuples which point to
invisible tuples ? The index is not MVCC anyway, so maybe it is easier
to do in-place replacement there ?

This probably has the same obstacles which have prevented us from
removing those in the first place (removing instead of marking as
invisible). Does it cause some locking issues ? Or does it go against
some other constraints of our index lookups ?

I think that just setting the invisible bit in an index leaf node causes
nearly as much disk io as removing the node.

If we could delete/reuse old index tuples, it would solve a sizable
chunk of index-growth problem, especially for cases where referenced key
value does not change.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Hannu Krosing

Ühel kenal päeval, R, 2006-06-23 kell 17:27, kirjutas Bruce Momjian:
 Jonah H. Harris wrote:
  On 6/23/06, Tom Lane [EMAIL PROTECTED] wrote:
   What I see in this discussion is a huge amount of the grass must be
   greener on the other side syndrome, and hardly any recognition that
   every technique has its downsides and complications.
  
  I'm being totally objective.  I don't think we should abandon
  PostgreSQL's overall design at all, because we do perform INSERTs and
  DELETEs much better than most systems.  However, I've looked at many
  systems and how they implement UPDATE so that it is a scalable
  operation.  Sure, there are costs and benefits to each implementation,
  but I think we have some pretty brilliant people in this community and
  can come up with an elegant design for scalable UPDATEs.
 
 I think the UPDATE case is similar to the bitmap index scan or perhaps
 bitmap indexes on disk --- there are cases we know can not be handled
 well by our existing code, so we have added (or might add) these
 features to try to address those difficult cases.

Not really. Bitmap index scan and bitmap index are both new additions
working well with existing framework. 

While the problem of slowdown on frequent updates is real, the suggested
fix is just plain wrong, as it is based on someones faulty assumption on
how index lookup works, and very much simplified view of how different
parts of the system work to implement MVCC.

The original fix he suggests was to that imagined behaviour and thus
ignored all the real problems of such change.

All the next suggestions were variations of the first ones, and failed
to address or even research any problems brought up.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Jan Wieck


On 6/22/2006 2:37 PM, Alvaro Herrera wrote:


Adding back pgsql-hackers.

Mark Woodward wrote:

 Mark Woodward wrote:

 Hmm, OK, then the problem is more serious than I suspected.
 This means that every index on a row has to be updated on every
 transaction that modifies that row. Is that correct?

 Add an index entry, yes.

 I am attaching some code that shows the problem with regard to
 applications such as web server session management, when run, each
 second
 the system can handle fewer and fewer connections. Here is a brief
 output:
 [...]
 There has to be a more linear way of handling this scenario.

 So vacuum the table often.

That fixes the symptom, not the problem. The problem is performance
steadily degrades over time.


No, you got it backwards.  The performance degradation is the symptom.
The problem is that there are too many dead tuples in the table.  There
is one way to solve that problem -- remove them, which is done by
running vacuum.


Precisely.


There are some problems with vacuum itself, that I agree with.  For
example it would be good if a long-running vacuum wouldn't affect a
vacuum running in another table because of the long-running transaction
effect it has.  It would be good if vacuum could be run partially over a
table.  It would be good if there was a way to speed up vacuum by using
a dead space map or something.


It would be good if vacuum wouldn't waste time on blocks that don't have 
any possible work in them. Vacuum has two main purposes. A) remove dead 
rows and B) freeze xids. Once a block has zero deleted rows and all xids 
are frozen, there is nothing to do with this block and vacuum should 
skip it until a transaction updates that block.


This requires 2 bits per block, which is 32K per 1G segment of a heap. 
Clearing the bits is done when the block is marked dirty. This way 
vacuum would not waste any time and IO on huge slow changing tables. 
That part, sequentially scanning huge tables that didn't change much is 
what keeps us from running vacuum every couple of seconds.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Hannu Krosing

Ühel kenal päeval, L, 2006-06-24 kell 15:44, kirjutas Jan Wieck:

  That fixes the symptom, not the problem. The problem is performance
  steadily degrades over time.
  
  No, you got it backwards.  The performance degradation is the symptom.
  The problem is that there are too many dead tuples in the table.  There
  is one way to solve that problem -- remove them, which is done by
  running vacuum.
 
 Precisely.
 
  There are some problems with vacuum itself, that I agree with.  For
  example it would be good if a long-running vacuum wouldn't affect a
  vacuum running in another table because of the long-running transaction
  effect it has.  It would be good if vacuum could be run partially over a
  table.  It would be good if there was a way to speed up vacuum by using
  a dead space map or something.
 
 It would be good if vacuum wouldn't waste time on blocks that don't have 
 any possible work in them. Vacuum has two main purposes. A) remove dead 
 rows and B) freeze xids. Once a block has zero deleted rows and all xids 
 are frozen, there is nothing to do with this block and vacuum should 
 skip it until a transaction updates that block.
 
 This requires 2 bits per block, which is 32K per 1G segment of a heap. 
 Clearing the bits is done when the block is marked dirty. This way 
 vacuum would not waste any time and IO on huge slow changing tables. 
 That part, sequentially scanning huge tables that didn't change much is 
 what keeps us from running vacuum every couple of seconds.

Seems like a plan. 

Still, there is another problem which is not solved by map approach
only, at least with current implementation of vacuum.

This is the fact that we need to do full scan over index(es) to clean up
pointers to removed tuples. And huge tables tend to have huge indexes.

As indexes have no MVCC info inside them, it may be possible to start
reusing index entries pointing to rows that are invisible to all running
transactions. Currently we just mark these index entries as dead, but
maybe there is a way to reuse them. This could solve the index bloat
problem for may cases.

Another possible solution for indexes with mostly dead pointers is doing
a reindex, but this will become possible only after we have implemented
a concurrent, non-blocking CREATE INDEX.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Martijn van Oosterhout

On Sat, Jun 24, 2006 at 10:04:43PM +0300, Hannu Krosing wrote:
 Maybe we could start from reusing the index tuples which point to
 invisible tuples ? The index is not MVCC anyway, so maybe it is easier
 to do in-place replacement there ?
 
 This probably has the same obstacles which have prevented us from
 removing those in the first place (removing instead of marking as
 invisible). Does it cause some locking issues ? Or does it go against
 some other constraints of our index lookups ?

The problem with updating an index is that you have to do it in a way
that concurrent scans (forwards and backwards) don't get confused
because the tuple they stopped on vanished.

AIUI, the current approach is two step. The first time round you mark
it deleted but don't actually delete it. Thus, any scan currently
stopped on that tuple won't have a problem. Sometime later you remove
the actual tuple, once you know there's no scan stopped on it (because
no scan will deliberatly stop on a deleted tuple).

I forget the actual locking steps that ensure this though.

 If we could delete/reuse old index tuples, it would solve a sizable
 chunk of index-growth problem, especially for cases where referenced key
 value does not change.

I think we recently changed the code to always scan an index a page at
a time so maybe scans no longer stop in the middle of a page anymore...
Or perhaps that was VACUUM only.

Have a noce day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Christopher Browne

Martha Stewart called it a Good Thing when [EMAIL PROTECTED] (Jan Wieck) wrote:
 On 6/22/2006 2:37 PM, Alvaro Herrera wrote:

 Adding back pgsql-hackers.
 Mark Woodward wrote:
  Mark Woodward wrote:
 
  Hmm, OK, then the problem is more serious than I suspected.
  This means that every index on a row has to be updated on every
  transaction that modifies that row. Is that correct?
 
  Add an index entry, yes.
 
  I am attaching some code that shows the problem with regard to
  applications such as web server session management, when run, each
  second
  the system can handle fewer and fewer connections. Here is a brief
  output:
  [...]
  There has to be a more linear way of handling this scenario.
 
  So vacuum the table often.
 That fixes the symptom, not the problem. The problem is performance
 steadily degrades over time.
 No, you got it backwards.  The performance degradation is the
 symptom.
 The problem is that there are too many dead tuples in the table.  There
 is one way to solve that problem -- remove them, which is done by
 running vacuum.

 Precisely.

 There are some problems with vacuum itself, that I agree with.  For
 example it would be good if a long-running vacuum wouldn't affect a
 vacuum running in another table because of the long-running transaction
 effect it has.  It would be good if vacuum could be run partially over a
 table.  It would be good if there was a way to speed up vacuum by using
 a dead space map or something.

 It would be good if vacuum wouldn't waste time on blocks that don't
 have any possible work in them. Vacuum has two main purposes. A)
 remove dead rows and B) freeze xids. Once a block has zero deleted
 rows and all xids are frozen, there is nothing to do with this block
 and vacuum should skip it until a transaction updates that block.

 This requires 2 bits per block, which is 32K per 1G segment of a
 heap. Clearing the bits is done when the block is marked dirty. This
 way vacuum would not waste any time and IO on huge slow changing
 tables. That part, sequentially scanning huge tables that didn't
 change much is what keeps us from running vacuum every couple of
 seconds.

This is, in effect, the VACUUM Space Map.

I see one unfortunate thing about that representation of it, namely
that it would in effect require that non-frozen pages be kept on the
VSM for potentially a long time.

Based on *present* VACUUM strategy, at least.

Would it not be the case, here, that any time a page could be
frozen, it would have to be?  In effect, we are always trying to run
VACUUM FREEZE?
-- 
output = (cbbrowne @ gmail.com)
http://cbbrowne.com/info/finances.html
Rules  of the  Evil  Overlord #72.  If  all the  heroes are  standing
together around  a strange device and  begin to taunt me,  I will pull
out a conventional weapon  instead of using my unstoppable superweapon
on them. http://www.eviloverlord.com/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Bruce Momjian

Heikki Linnakangas wrote:
 On Sat, 24 Jun 2006, Bruce Momjian wrote:
 
  OK, I have an idea.  Right now, an UPDATE where the old and new rows are
  on the same page have two index entries.  What if we made only one index
  entry for both?  We already have UPDATE chaining, where the old row
  points to the new one.  If no key columns change, we set a bit in the
  heap that the chaining points to the old and new row (both on the same
  page), so an index scan uses one index entry to see the old and new row,
  and once the old row is no longer visible, the page index id can be
  updated to point to the new row and the old row can be marked free and
  perhaps used for a future UPDATE.  (UPDATE already tries to do keep
  updates on the same heap page.)
 
 In fact, that's what I originally thought Mark was suggesting. A couple of 
 points:
 
 Why the limitation of old and new row being on the same page?

Because having them be on the same page is the only way you can update
the page item pointer so when you recycle the row, you the indexes are
now pointing to the new version.  Pages look like:

[marker][item1][item2][item3]...[tuple1][tuple2][tuple3]

and indexes only point to items, not to tuples.  This allows tuples to
be compacted on the page without affecting the indexes.

If tuple1 is updated to tuple2, once tuple1 is no longer visible to any
backends, you can modify item1 to point to tuple2, and you can mark the
space used by tuple1 as reusable:

[marker][item1(tuple2)][item2][item3]...[free][tuple2][tuple3]

 This only works if none of the updated columns are indexed. That's a bit 
 annoying. It would be nice to be able to create new index tuples in those 

The hope is that a commonly updated tuple will eventually be on a page
where there is sufficient free space for updated version to stay on
there.  For an active server, there might be several updated versions of
rows on the same page.

 indexes that contain one of the changed columns, but not in others.

If you can't expire the old row because one of the indexed columns was
modified, I see no reason to try to reduce the additional index entries.

 What happens if you create a new index that contains one of 
 the changed columns?

Uh, I had not thought of that.  You could easily create two index
entries for the old and new rows, but then the heap bit saying there is
only one index row would be inaccurate for the new index.  I suppose you
could create new rows in all indexes and clear the heap bit.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Alvaro Herrera

Mark Woodward wrote:

 The update behavior of PostgreSQL is probably the *last* serious issue.
 Debate all you want, vacuum mitigates the problem to varying levels,
 fixing the problem will be a huge win. If the update behavior gets fixed,
 I can't think of a single issue with postgresql that would be a show
 stopper.

Nah, it's just *your* pet peeve.  Everyone has theirs.  Some people may
share yours, of course.  I agree it's a problem, but from there to
saying it's _the last_ issue there's a lot of distance.


Your idea of reusing a tuple's self pointer (t_ctid) does not work BTW,
because the self pointer must point to self.  The case where the pointer
does not point to exactly the same tuple, it must point to a newer
version.  If you change that invariant, a lot of things break; see for
example heap_get_lastest_tid.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-24 Thread Bruce Momjian

Hannu Krosing wrote:
 ?hel kenal p?eval, R, 2006-06-23 kell 13:08, kirjutas Tom Lane:
  Csaba Nagy [EMAIL PROTECTED] writes:
   Surprisingly its mostly WAL traffic, the heap/index pages themselves are
   often not yet synced to disk by time of vacuum, so no additional traffic
   there. If you had made 5 updates per page and then vacuum it, then you
   make effectively 1 extra WAL write meaning 20% increase in WAL traffic. 
  
   Is this also holding about read traffic ? I thought vacuum will make a
   full table scan... for big tables a full table scan is always badly
   influencing the performance of the box. If the full table scan would be
   avoided, then I wouldn't mind running vacuum in a loop... 
  
  If you're doing heavy updates of a big table then it's likely to end up
  visiting most of the table anyway, no?  There is talk of keeping a map
  of dirty pages, but I think it'd be a win for infrequently-updated
  tables, not ones that need constant vacuuming.
  
  I think a lot of our problems in this area could be solved with fairly
  straightforward tuning efforts on the existing autovacuum
  infrastructure.  In particular, someone should be looking into
  recommendable default vacuum-cost-delay settings so that a background
  vacuum doesn't affect performance too much.
 
 One thing that would help updates quite a lot in some scenarios is
 keeping the pages only partially-filled, so that most updates could keep
 the new version in the same page. I think that has also been discussed
 as an option to vacuum and maybe as part of initial inserts. Maybe some
 of it even ended up as a todo item.

We have a patch in the queue for index fillfactor which will be in 8.2.
I am also hoping the frequently updated rows will migrate out to the
empty pages.

  Another problem with the
  current autovac infrastructure is that it doesn't respond very well to
  the case where there are individual tables that need constant attention
  as well as many that don't.  If you have N databases then you can visit
  a particular table at most once every N*autovacuum_naptime seconds, and
  *every* table in the entire cluster gets reconsidered at that same rate.
  I'm not sure if we need the ability to have multiple autovac daemons
  running at the same time, 
 
 My patch enabling effective continuous vacuum of fast-update tables,
 while still being able to vacuum huge slowly changing ones is still not
 applied. Without that patch there is no reason to vacuum the small and
 fast changingg tables while vacuum on bigger tables is running, as it
 won't clean out dead tuples anyway.

I think it will be applied, but I am looking for someone else to eyeball
it since Tom has come concerns.

  but we definitely could use something with a
  more flexible table-visiting pattern.  Perhaps it would be enough to
  look through the per-table stats for each database before selecting the
  database to autovacuum in each cycle, instead of going by least
  recently autovacuumed.
  
  Bottom line: there's still lots of low-hanging fruit.  Why are people
  feeling that we need to abandon or massively complicate our basic
  architecture to make progress?
 
 Maybe we could start from reusing the index tuples which point to
 invisible tuples ? The index is not MVCC anyway, so maybe it is easier
 to do in-place replacement there ?
 
 This probably has the same obstacles which have prevented us from
 removing those in the first place (removing instead of marking as
 invisible). Does it cause some locking issues ? Or does it go against
 some other constraints of our index lookups ?
 
 I think that just setting the invisible bit in an index leaf node causes
 nearly as much disk io as removing the node.
 
 If we could delete/reuse old index tuples, it would solve a sizable
 chunk of index-growth problem, especially for cases where referenced key
 value does not change.

I think heap _and_ index reuse is the only useful direction.  Index or
heap reuse alone seems too marginal for the added complexity.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-23 Thread Csaba Nagy

  Good advice, except if the table is huge :-)
 
 ... Then the table shouldn't be designed to be huge.  That represents
 a design error.
[snip]
 This demonstrates that archival material and active data should be
 kept separately.
 
 They have different access patterns; kludging them into the same table
 turns out badly.

Well, then please help me find a better design cause I can't see one...
what we have here is a big membership table of email lists. When
there's a sendout then the memberships of the affected group are heavily
read/updated, otherwise they are idle. None of the memberships is
archive data, they are all active data... the only problem is that they
are so many. Is it so hard to believe that 100 million rows is all
active data, but only used in bursts once per week (that's an example,
some groups are more active, others less) ?

Cheers,
Csaba.



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-23 Thread PFC




Well, then please help me find a better design cause I can't see one...
what we have here is a big membership table of email lists. When
there's a sendout then the memberships of the affected group are heavily
read/updated, otherwise they are idle. None of the memberships is
archive data, they are all active data... the only problem is that they
are so many. Is it so hard to believe that 100 million rows is all
active data, but only used in bursts once per week (that's an example,
some groups are more active, others less) ?


	I suppose you have a table memberships (user_id, group_id) or something  
like it ; it should have as few columns as possible ; then try regularly  
clustering on group_id (maybe once a week) so that all the records for a  
particular group are close together. Getting the members of a group to  
send them an email should be faster (less random seeks).


	For tables with very few small fields (like a few integers) the  
26-something bytes row overhead is significant ; MySQL can be faster  
because MyISAM tables have no transaction support and thus have very  
little things to store besides actual row data, and the table can then fit  
in RAM...


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] vacuum, performance, and MVCC

2006-06-23 Thread Csaba Nagy

   I suppose you have a table memberships (user_id, group_id) or something 
  
 like it ; it should have as few columns as possible ; then try regularly  
 clustering on group_id (maybe once a week) so that all the records for a  
 particular group are close together. Getting the members of a group to  
 send them an email should be faster (less random seeks).

It is like this, and some more bookkeeping data which must be there...
we could split the table for smaller records or for updatable/stable
fields, but at the end of the day it doesn't make much sense, usually
all the data is needed and I wonder if more big/shallow tables instead
of one big/wider makes sense...

Regularly clustering is out of question as it would render the system
unusable for hours. There's no 0 activity hour we could use for such
stuff. There's always something happening, only the overall load is
smaller at night...

Thanks,
Csaba.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

1 2 >

1 - 100 of 199 matches

Mail list logo