Re: [HACKERS] CLUSTER and MVCC

2007-03-22 Thread Bruce Momjian
Bruce Momjian wrote:
 Tom Lane wrote:
  Heikki Linnakangas [EMAIL PROTECTED] writes:
   We wouldn't clean up tuples that are visible to a transaction, but if 
   you have one long-running transaction like pg_dump in a database with 
   otherwise short transaction, you'll have a lot of tuples that are not 
   vacuumable because of the long-running process, but are not in fact 
   visible to any transaction.
  
  It sounds to me like you are proposing to remove the middles of update
  chains, which would break READ-COMMITTED updates initiated by the older
  transactions.  Now admittedly pg_dump isn't going to issue any such
  updates, but VACUUM doesn't know that.
 
 Since a multi-statement transaction can't change its transaction
 isolation level after its first statement, would adding a boolean to
 PGPROC help VACUUM be more aggressive about removing rows?  I am
 thinking something like PGPROC.cannot_be_serializable.

In researching, I found we already do this by updating PGPROC.xid for
every command in non-serialzable transactions:

 * GetTransactionSnapshot
 *  Get the appropriate snapshot for a new query in a transaction.
 *
 * The SerializableSnapshot is the first one taken in a transaction.
 * In serializable mode we just use that one throughout the transaction.
 * In read-committed mode, we take a new snapshot each time we are called.


-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] CLUSTER and MVCC

2007-03-22 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 In researching, I found we already do this by updating PGPROC.xid for
 every command in non-serialzable transactions:

Dunno how you arrived at that conclusion, but it's quite wrong.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] CLUSTER and MVCC

2007-03-22 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  In researching, I found we already do this by updating PGPROC.xid for
  every command in non-serialzable transactions:
 
 Dunno how you arrived at that conclusion, but it's quite wrong.

Looking in the function I now see you are right:

if (serializable)
MyProc-xmin = TransactionXmin = xmin;

So, can't this be improved to allow more aggressive vacuuming?

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] CLUSTER and MVCC

2007-03-22 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 So, can't this be improved to allow more aggressive vacuuming?

Not at that level.  We do not keep track of the oldest still-used
snapshot in a transaction.  I'm dubious that it'd be worth the
bookkeeping trouble to try --- often as not, the problem with a
long running transaction is that it's a long running statement,
anyway.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CLUSTER and MVCC

2007-03-19 Thread Bruce Momjian
Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  We wouldn't clean up tuples that are visible to a transaction, but if 
  you have one long-running transaction like pg_dump in a database with 
  otherwise short transaction, you'll have a lot of tuples that are not 
  vacuumable because of the long-running process, but are not in fact 
  visible to any transaction.
 
 It sounds to me like you are proposing to remove the middles of update
 chains, which would break READ-COMMITTED updates initiated by the older
 transactions.  Now admittedly pg_dump isn't going to issue any such
 updates, but VACUUM doesn't know that.

Since a multi-statement transaction can't change its transaction
isolation level after its first statement, would adding a boolean to
PGPROC help VACUUM be more aggressive about removing rows?  I am
thinking something like PGPROC.cannot_be_serializable.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] CLUSTER and MVCC

2007-03-15 Thread Heikki Linnakangas

Heikki Linnakangas wrote:
One potential issue I'm seeing is that if we rely on the unbroken chain 
starting from  OldestXmin, and that tuple isn't there because of a bug, 
for example, the later version of the tuple is skipped and the row is lost.


After further thought, I don't feel comfortable with the idea because of 
 the above loss of robustness.


I'm thinking of keeping an in-memory mapping of old and new tids of 
updated tuples while clustering, instead. That means that cluster 
requires a little bit of memory for each RECENTLY_DEAD updated tuple. In 
the worst case that means that you run out of memory if there's too many 
of those in the table, but I doubt that's going to be a problem in practice.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] CLUSTER and MVCC

2007-03-15 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I'm thinking of keeping an in-memory mapping of old and new tids of 
 updated tuples while clustering, instead. That means that cluster 
 requires a little bit of memory for each RECENTLY_DEAD updated tuple. In 
 the worst case that means that you run out of memory if there's too many 
 of those in the table, but I doubt that's going to be a problem in practice.

That is more or less isomorphic to what VACUUM FULL does.  While people
have complained about VACUUM FULL's memory usage on occasion, just at
the moment I feel that the main problem with it is complexity.  If we
still haven't gotten all the bugs out of VACUUM FULL after more than
eight years of work on it, what are the odds that we can make CLUSTER
do it right the first time?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] CLUSTER and MVCC

2007-03-15 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
I'm thinking of keeping an in-memory mapping of old and new tids of 
updated tuples while clustering, instead. That means that cluster 
requires a little bit of memory for each RECENTLY_DEAD updated tuple. In 
the worst case that means that you run out of memory if there's too many 
of those in the table, but I doubt that's going to be a problem in practice.


That is more or less isomorphic to what VACUUM FULL does.  While people
have complained about VACUUM FULL's memory usage on occasion, just at
the moment I feel that the main problem with it is complexity.  If we
still haven't gotten all the bugs out of VACUUM FULL after more than
eight years of work on it, what are the odds that we can make CLUSTER
do it right the first time?


Well, I can't guarantee that there's no bugs.

To copy a chain correctly, we need to correctly detect tuples that have 
a t_ctid pointing to a non-dead tuple (non-dead meaning 
HeapTupleSatisfiesVacuum(tuple) != DEAD), and tuples that are being 
pointed to by a non-dead tuple. If we incorrectly detect that a tuple 
belongs to either of those categories, when in fact it doesn't, we don't 
corrupt anything, but we waste a little bit of memory memorizing the 
tuple unnecessarily.


To detect tuples in the first category, we need to check that xmax of 
the tuple isn't invalid, and t_ctid doesn't point to itself.


To detect tuples in the second category, we need to check that xmin 
isn't invalid, and is greater than OldestXmin.


With both categories correctly identified, it's just a matter of mapping 
old ctids to corresponding tids in the new heap.


Unlike in my first proposal, if something nevertheless goes wrong in 
detecting the chains, we only lose the chaining between the tuples, but 
we don't otherwise lose any data. The latest version of each row is fine 
anyway. I think this approach is pretty robust, and it fails in a good way.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] CLUSTER and MVCC

2007-03-13 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
In each chain, there must be at least one non-dead tuple with xmin  
Oldestxmin.


Huh?  Typically *all* the tuples but the last are dead, for varying
values of dead.  Please be more specific what you mean.


I meant dead as in HeapTupleSatisfiesVacuum(...) == HEAPTUPLE_DEAD.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] CLUSTER and MVCC

2007-03-12 Thread Heikki Linnakangas

Heikki Linnakangas wrote:

Tom Lane wrote:

The reason it's not trivial is that you also have to preserve the t_ctid
links of update chains.  If you look into VACUUM FULL, a very large part
of its complexity is that it moves update chains as a unit to make that
possible.  (BTW, I believe the problem Pavan Deolasee reported yesterday
is a bug somewhere in there --- it looks to me like sometimes the same
update chain is getting copied multiple times.)


Ah, that's it. Thanks.

The easiest solution I can think of is to skip newer versions of updated 
rows when scanning the old relation, and to fetch and copy all tuples in 
the update chain to the new relation whenever you encounter the first 
tuple in the chain.


To get a stable view of what's the first tuple in chain, you need to get 
the oldest xmin once at the beginning, and use that throughout the 
operation. Since we take an exclusive lock on the table, no-one can 
insert new updated tuples during the operation, and all updaters are 
finished before the lock is granted.


I've been thinking about this some more, and I think the above would 
work. The tricky part is to recognize the first tuple in a chain, given 
the recent bug in vacuum full.


In each chain, there must be at least one non-dead tuple with xmin  
Oldestxmin. Otherwise we're already missing tuples; there would be no 
version visible to a transaction with xid between OldestXmin and the min 
xmin present in the chain. That tuple is the root of the chain.


There might be a dead tuple in the middle of the chain. Dead tuples have 
xmax  OldestXmin, which means there must be another non-dead tuple in 
the chain with xmax = OldestXmin. For cluster's purposes, that another 
non-dead tuple is also considered as a root. Dead tuples and chains 
ending in a dead tuples don't need to be stored in the new table.


This picture helped me a lot: 
http://community.enterprisedb.com/updatechain.gif


Arrows represent tuples, beginning at xmin and ending at xmax. 
OldestXmin is represented by the vertical bar, everything to the left is 
smaller than OldestXmin and everything to the right is larger than 
OldestXmin. The chain must begin with a tuple with xmin on the left side 
of OldestXmin, and the last tuple in the chain must end on the right side.


Does anyone see a problem with this? If not, I'm going to write a patch.

One potential issue I'm seeing is that if we rely on the unbroken chain 
starting from  OldestXmin, and that tuple isn't there because of a bug, 
for example, the later version of the tuple is skipped and the row is lost.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] CLUSTER and MVCC

2007-03-12 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 In each chain, there must be at least one non-dead tuple with xmin  
 Oldestxmin.

Huh?  Typically *all* the tuples but the last are dead, for varying
values of dead.  Please be more specific what you mean.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas
Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me 
that it would be trivial to fix, by using SnapshotAny instead of 
SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
cluster command.


I feel that I must missing something, or someone would've already fixed 
it a long time ago...


Csaba, you mentioned recently 
(http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that 
you're actually using the MVCC-violation to clean up tables during a 
backup. Can you tell us a bit more about that? Would you be upset if we 
shut that backdoor?


In any case, the MVCC-violation needs to be documented. I'll send a doc 
patch to pgsql-patches shortly.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Csaba Nagy
On Fri, 2007-03-09 at 12:29, Heikki Linnakangas wrote:
 Csaba, you mentioned recently 
 (http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that 
 you're actually using the MVCC-violation to clean up tables during a 
 backup. Can you tell us a bit more about that? Would you be upset if we 
 shut that backdoor?

My use case: a queue-like table (in fact a 'task' table) which is very
frequently inserted/updated/deleted. This table tends to be bloated in
the presence of any long running transaction... the only transactional
behavior we need from this table is to make sure that when we insert
something in this table in a transaction (possibly together with other
actions) and then commit/rollback, it commits/rolls back the insert.
CLUSTER's violation of MVCC does not affect this, as CLUSTER will not be
able to lock the table if another transaction inserted something in it
(the inserting transaction will have a lock on the table). Selections on
this table are not critical for us, it just doesn't matter which job
processor is getting which task and in what order... (actually it does
matter, but CLUSTER won't affect that either).

So what I do is execute CLUSTER once in 5 minutes on this table. This
works just fine, and keeps the table size small even if I have long
running transactions in progress. The DB backup is one of such
unavoidable long running transactions, and I use the table exclusion
switch to exclude this task table from the backup so it won't get locked
by it and let CLUSTER still do it's job (I had a rudimentary patch to do
this even before the feature was introduced to pg_dump). The table can
be dumped separately which is a brief operation, but I would have anyway
to clear it on a crash...

Now I could try and disable the CLUSTER cron job and see if i get
problems, as last it was disabled with postgres 7.4, maybe something
changed in between... but I can tell for sure that last time I enabled
it it really fixed our load on the DB server...

Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
parameter to enable/disable the current behavior, and use the MVCC
behavior as default ?

Cheers,
Csaba.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Hannu Krosing
Ühel kenal päeval, R, 2007-03-09 kell 11:29, kirjutas Heikki
Linnakangas:
 Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me 
 that it would be trivial to fix, by using SnapshotAny instead of 
 SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
 cluster command.
 
 I feel that I must missing something, or someone would've already fixed 
 it a long time ago...

Probably it is not MVCC safe because the relation is swapped out from
under the pg_class.

That is, it can be possible , that older and newer transactions read
different datafiles and so simle MVCC does not work.


 Csaba, you mentioned recently 
 (http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that 
 you're actually using the MVCC-violation to clean up tables during a 
 backup. Can you tell us a bit more about that? Would you be upset if we 
 shut that backdoor?
 
 In any case, the MVCC-violation needs to be documented. I'll send a doc 
 patch to pgsql-patches shortly.
 
-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Gregory Stark
Csaba Nagy [EMAIL PROTECTED] writes:

 Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
 parameter to enable/disable the current behavior, and use the MVCC
 behavior as default ?

Doing it in CLUSTER would be weird. However perhaps it would be useful to have
some sort of stand-alone tool that just bumped all the xmin/xmax's. It would
have to be super-user-only and carry big warning labels saying it breaks MVCC.

But it would be useful any time you have a table that you want to exempt a
particular table from serializable snapshots. Basically a per-table way to
force a read-committed snapshot on. Though, actually it's not quite a
read-committed snapshot is it? Anyone using an old serializable snapshot will
see what, no tuples at all?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Csaba Nagy
On Fri, 2007-03-09 at 13:42, Gregory Stark wrote:
 Csaba Nagy [EMAIL PROTECTED] writes:
 
  Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
  parameter to enable/disable the current behavior, and use the MVCC
  behavior as default ?
 
 Doing it in CLUSTER would be weird. However perhaps it would be useful to have
 some sort of stand-alone tool that just bumped all the xmin/xmax's. It would
 have to be super-user-only and carry big warning labels saying it breaks MVCC.

Well, the current behavior of CLUSTER is just perfect for what I'm using
it. If anything else would do the job, I would be happy to use it
instead...

 But it would be useful any time you have a table that you want to exempt a
 particular table from serializable snapshots. Basically a per-table way to
 force a read-committed snapshot on. Though, actually it's not quite a
 read-committed snapshot is it? Anyone using an old serializable snapshot will
 see what, no tuples at all?

I'm afraid what I need has nothing to do with serializable snapshots...
I still want the table to be completely transactional except if somebody
can get an exclusive lock on it, it can be compacted regardless of other
running transactions. I'm not sure how to express this in other way...
it means something like: no transaction cares about the content of the
table until it gets some kind of lock on it. In other words the table's
state is not connected with the state of other tables until I actually
do something on it...

Cheers,
Csaba.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Csaba Nagy wrote:

On Fri, 2007-03-09 at 12:29, Heikki Linnakangas wrote:
Csaba, you mentioned recently 
(http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that 
you're actually using the MVCC-violation to clean up tables during a 
backup. Can you tell us a bit more about that? Would you be upset if we 
shut that backdoor?


My use case: a queue-like table (in fact a 'task' table) which is very
frequently inserted/updated/deleted. This table tends to be bloated in
the presence of any long running transaction... the only transactional
behavior we need from this table is to make sure that when we insert
something in this table in a transaction (possibly together with other
actions) and then commit/rollback, it commits/rolls back the insert.
CLUSTER's violation of MVCC does not affect this, as CLUSTER will not be
able to lock the table if another transaction inserted something in it
(the inserting transaction will have a lock on the table). Selections on
this table are not critical for us, it just doesn't matter which job
processor is getting which task and in what order... (actually it does
matter, but CLUSTER won't affect that either).


Hmm. You could use something along these lines instead:

0. LOCK TABLE queue_table
1. SELECT * INTO queue_table_new FROM queue_table
2. DROP TABLE queue_table
3. ALTER TABLE queue_table_new RENAME queue_table

After all, it's not that you care about the clustering of the table, you 
just want to remove old tuples.


As a long term solution, it would be nice if we had more fine-grained 
bookkeeping of snapshots that are in use in the system. In your case, 
there's a lot of tuples that are not visible to pg_dump because xmin is 
too new, and also not visible to any other transaction because xmax is 
too old. If we had a way to recognize situations like that, and vacuum 
those tuples, much of the problem with long-running transactions would 
go away.



Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
parameter to enable/disable the current behavior, and use the MVCC
behavior as default ?


I guess we could, but I don't see why should encourage using CLUSTER for 
that. A more aggressive, MVCC-breaking version of VACUUM would make more 
sense to me, but I don't like the idea of adding break-MVCC flags to 
any commands.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Alvaro Herrera
Gregory Stark wrote:
 Csaba Nagy [EMAIL PROTECTED] writes:
 
  Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
  parameter to enable/disable the current behavior, and use the MVCC
  behavior as default ?
 
 Doing it in CLUSTER would be weird. However perhaps it would be useful to have
 some sort of stand-alone tool that just bumped all the xmin/xmax's. It would
 have to be super-user-only and carry big warning labels saying it breaks MVCC.
 
 But it would be useful any time you have a table that you want to exempt a
 particular table from serializable snapshots. Basically a per-table way to
 force a read-committed snapshot on. Though, actually it's not quite a
 read-committed snapshot is it? Anyone using an old serializable snapshot will
 see what, no tuples at all?

Unless you used FrozenTransactionId ...

But I'm not really seeing the problem here.  Why isn't Csaba's problem
fixed by the fact that HOT reduces the number of dead tuples in the
first place?  If it does, then he no longer needs the CLUSTER
workaround, or at least, he needs it to a much lesser extent.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Zeugswetter Andreas ADI SD

  Is there a particular reason why CLUSTER isn't MVCC-safe? It seems
to 
  me that it would be trivial to fix, by using SnapshotAny instead of 
  SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
  cluster command.
 
 It's trivial to fix now in this way, but it would break HOT, 
 since an indexscan only returns one row per index entry.

Well, with SnapshotAny HOT should probably return all possibly visible
tuples
with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems
for reading a whole table a seq scan and sort is usually cheaper, at
least when the clustering is so bad that a CLUSTER is needed.)

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Csaba Nagy
On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:
 But I'm not really seeing the problem here.  Why isn't Csaba's problem
 fixed by the fact that HOT reduces the number of dead tuples in the
 first place?  If it does, then he no longer needs the CLUSTER
 workaround, or at least, he needs it to a much lesser extent.

Is this actually true in the case of HOT + long running transactions ? I
was supposing HOT has the same problems in the presence of long running
transactions...

Cheers,
Csaba.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Csaba Nagy
 Hmm. You could use something along these lines instead:
 
 0. LOCK TABLE queue_table
 1. SELECT * INTO queue_table_new FROM queue_table
 2. DROP TABLE queue_table
 3. ALTER TABLE queue_table_new RENAME queue_table
 
 After all, it's not that you care about the clustering of the table, you 
 just want to remove old tuples.

... and then restart the app so all my pooled connections drop their
cached plans ;-)

Seriously, that won't work. If a session tries to insert a new row after
I lock the table to clean it up, I still want it to be able to insert
after the cleanup is finished... if I drop the table it tries to insert
to, it will fail.

 As a long term solution, it would be nice if we had more fine-grained 
 bookkeeping of snapshots that are in use in the system. In your case, 
 there's a lot of tuples that are not visible to pg_dump because xmin is 
 too new, and also not visible to any other transaction because xmax is 
 too old. If we had a way to recognize situations like that, and vacuum 
 those tuples, much of the problem with long-running transactions would 
 go away.

In the general case that won't work either in a strict MVCC sense... if
you have an old transaction, you should never clean up a dead tuple
which could be still visible to it.

  Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
  parameter to enable/disable the current behavior, and use the MVCC
  behavior as default ?
 
 I guess we could, but I don't see why should encourage using CLUSTER for 
 that. A more aggressive, MVCC-breaking version of VACUUM would make more 
 sense to me, but I don't like the idea of adding break-MVCC flags to 
 any commands.

Well, if there would be any other way to avoid the table bloat I would
agree.

Cheers,
Csaba.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Csaba Nagy wrote:

On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:

But I'm not really seeing the problem here.  Why isn't Csaba's problem
fixed by the fact that HOT reduces the number of dead tuples in the
first place?  If it does, then he no longer needs the CLUSTER
workaround, or at least, he needs it to a much lesser extent.


Is this actually true in the case of HOT + long running transactions ? I
was supposing HOT has the same problems in the presence of long running
transactions...


It does, HOT won't help you here. A long-running transaction is just as 
much of a problem with HOT as without. Besides, I don't recall that 
you're doing updates in the first place.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Csaba Nagy wrote:

On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:

But I'm not really seeing the problem here.  Why isn't Csaba's problem
fixed by the fact that HOT reduces the number of dead tuples in the
first place?  If it does, then he no longer needs the CLUSTER
workaround, or at least, he needs it to a much lesser extent.


Is this actually true in the case of HOT + long running transactions ? I
was supposing HOT has the same problems in the presence of long running
transactions...


It does, HOT won't help you here. A long-running transaction is just as 
much of a problem with HOT as without. Besides, I don't recall that 
you're doing updates in the first place.


Couldn't HOT in principle deal with this? Let's say you have two long-running
transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B, C (before
the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, because they
are not currently visible, nor will they ever become visible because they are
already deleted.

greetings, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Csaba Nagy wrote:

Hmm. You could use something along these lines instead:

0. LOCK TABLE queue_table
1. SELECT * INTO queue_table_new FROM queue_table
2. DROP TABLE queue_table
3. ALTER TABLE queue_table_new RENAME queue_table

After all, it's not that you care about the clustering of the table, you 
just want to remove old tuples.


... and then restart the app so all my pooled connections drop their
cached plans ;-)


Yeah, though Tom's working on plan invalidation for 8.3, so that 
wouldn't be an issue.



Seriously, that won't work. If a session tries to insert a new row after
I lock the table to clean it up, I still want it to be able to insert
after the cleanup is finished... if I drop the table it tries to insert
to, it will fail.


Hmm. How about:

1. LOCK TABLE queue_table
2. SELECT * INTO temp_table FROM queue_table
3. TRUNCATE queue_table
4. INSERT INTO queue_table SELECT * FROM temp_table

That way you're copying the rows twice, but if there isn't many live 
tuples it shouldn't matter too much.


As a long term solution, it would be nice if we had more fine-grained 
bookkeeping of snapshots that are in use in the system. In your case, 
there's a lot of tuples that are not visible to pg_dump because xmin is 
too new, and also not visible to any other transaction because xmax is 
too old. If we had a way to recognize situations like that, and vacuum 
those tuples, much of the problem with long-running transactions would 
go away.


In the general case that won't work either in a strict MVCC sense... if
you have an old transaction, you should never clean up a dead tuple
which could be still visible to it.


We wouldn't clean up tuples that are visible to a transaction, but if 
you have one long-running transaction like pg_dump in a database with 
otherwise short transaction, you'll have a lot of tuples that are not 
vacuumable because of the long-running process, but are not in fact 
visible to any transaction. That's transactions that were inserted too 
late to be seen by the old transaction, and deleted too long time ago to 
be seen by any other transaction. Let me illustrate this with a timeline:


 xmin1xmax1
 ||
-+--X-X+X-+ooXoXoXoXXo+--now
   |  |
   xmin2  xmax2

xmin1 and xmax1 are the xmin and xmax of an old, long-running 
serializable transaction, like pg_dump. The Xs between them are xids of 
transactions that the old transaction sees as in-progress, IOW the 
SnapshotData.xip-array.


xmin2 and xmax2 are the xmin and xmax of a newer transaction. Because of 
the old-running transaction, xmin2 is far behind xmax2, but there's a 
wide gap between that and the next transaction that the newer 
transaction sees as in-progress.


The current rule to determine if a tuple is dead or not is to check that 
 tuple's xmax  oldestxmin. Oldestxmin is in this case xmin1. But in 
addition to that, any tuple with an xmin  xmax1 and xmax that's not in 
the xip-array of any snapshot in use (marked with o above), isn't 
visible to any current or future transaction and can therefore be safely 
vacuumed.


The implementation problem is that we don't have a global view of all 
snapshots in the system. If we solve that, we can be more aggressive 
with vacuuming in presence of long-running transactions. It's not an 
easy problem, we don't want to add a lot of accounting overhead, but 
maybe we could have some kind of an approximation of the global state 
with little overhead, that would give most of the benefit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Alvaro Herrera
Zeugswetter Andreas ADI SD wrote:
 
   Is there a particular reason why CLUSTER isn't MVCC-safe? It seems
 to 
   me that it would be trivial to fix, by using SnapshotAny instead of 
   SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
   cluster command.
  
  It's trivial to fix now in this way, but it would break HOT, 
  since an indexscan only returns one row per index entry.
 
 Well, with SnapshotAny HOT should probably return all possibly visible
 tuples
 with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems
 for reading a whole table a seq scan and sort is usually cheaper, at
 least when the clustering is so bad that a CLUSTER is needed.)

Yes, it does an indexscan (last time I checked, at least).  I think if a
performance improvement is demonstrated, we would accept a patch ...

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Florian G. Pflug wrote:
Couldn't HOT in principle deal with this? Let's say you have two 
long-running

transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B, C 
(before

the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, 
because they
are not currently visible, nor will they ever become visible because 
they are

already deleted.


Yes, you could detect that but you'd need a global view of all snapshots 
in the system. I just posted a reply in this thread with more details..


It's not just with HOT, it's the way we determine that a tuple is 
vacuumable in general.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Mike Rylander

On 3/9/07, Florian G. Pflug [EMAIL PROTECTED] wrote:

Heikki Linnakangas wrote:
 Csaba Nagy wrote:
 On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:
 But I'm not really seeing the problem here.  Why isn't Csaba's problem
 fixed by the fact that HOT reduces the number of dead tuples in the
 first place?  If it does, then he no longer needs the CLUSTER
 workaround, or at least, he needs it to a much lesser extent.

 Is this actually true in the case of HOT + long running transactions ? I
 was supposing HOT has the same problems in the presence of long running
 transactions...

 It does, HOT won't help you here. A long-running transaction is just as
 much of a problem with HOT as without. Besides, I don't recall that
 you're doing updates in the first place.

Couldn't HOT in principle deal with this? Let's say you have two long-running
transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B, C (before
the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, because they
are not currently visible, nor will they ever become visible because they are
already deleted.


Couldn't they (or at least one of them) become visible due to
SAVEPOINT rollback?



greetings, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Mike Rylander wrote:

On 3/9/07, Florian G. Pflug [EMAIL PROTECTED] wrote:
Couldn't HOT in principle deal with this? Let's say you have two 
long-running

transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B, 
C (before

the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, 
because they
are not currently visible, nor will they ever become visible because 
they are

already deleted.


Couldn't they (or at least one of them) become visible due to
SAVEPOINT rollback?


You wouldn't remove tuples with an uncommited xmax, of course.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me 
 that it would be trivial to fix, by using SnapshotAny instead of 
 SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
 cluster command.

The reason it's not trivial is that you also have to preserve the t_ctid
links of update chains.  If you look into VACUUM FULL, a very large part
of its complexity is that it moves update chains as a unit to make that
possible.  (BTW, I believe the problem Pavan Deolasee reported yesterday
is a bug somewhere in there --- it looks to me like sometimes the same
update chain is getting copied multiple times.)

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Simon Riggs
On Fri, 2007-03-09 at 12:48 -0300, Alvaro Herrera wrote:
 Zeugswetter Andreas ADI SD wrote:
  
Is there a particular reason why CLUSTER isn't MVCC-safe? It seems
  to 
me that it would be trivial to fix, by using SnapshotAny instead of 
SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
cluster command.
   
   It's trivial to fix now in this way, but it would break HOT, 
   since an indexscan only returns one row per index entry.
  
  Well, with SnapshotAny HOT should probably return all possibly visible
  tuples
  with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems
  for reading a whole table a seq scan and sort is usually cheaper, at
  least when the clustering is so bad that a CLUSTER is needed.)
 
 Yes, it does an indexscan (last time I checked, at least).  I think if a
 performance improvement is demonstrated, we would accept a patch ...

Again, right now, most things people do here will break HOT. At this
late stage before freeze, please everybody be careful to look and plan
for patch conflicts. (That isn't stay away, just be careful). Thanks.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me 
that it would be trivial to fix, by using SnapshotAny instead of 
SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
cluster command.


The reason it's not trivial is that you also have to preserve the t_ctid
links of update chains.  If you look into VACUUM FULL, a very large part
of its complexity is that it moves update chains as a unit to make that
possible.  (BTW, I believe the problem Pavan Deolasee reported yesterday
is a bug somewhere in there --- it looks to me like sometimes the same
update chain is getting copied multiple times.)


Ah, that's it. Thanks.

The easiest solution I can think of is to skip newer versions of updated 
rows when scanning the old relation, and to fetch and copy all tuples in 
the update chain to the new relation whenever you encounter the first 
tuple in the chain.


To get a stable view of what's the first tuple in chain, you need to get 
the oldest xmin once at the beginning, and use that throughout the 
operation. Since we take an exclusive lock on the table, no-one can 
insert new updated tuples during the operation, and all updaters are 
finished before the lock is granted.


Those tuples wouldn't be in the cluster order, though, but that's not a 
big deal.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 We wouldn't clean up tuples that are visible to a transaction, but if 
 you have one long-running transaction like pg_dump in a database with 
 otherwise short transaction, you'll have a lot of tuples that are not 
 vacuumable because of the long-running process, but are not in fact 
 visible to any transaction.

It sounds to me like you are proposing to remove the middles of update
chains, which would break READ-COMMITTED updates initiated by the older
transactions.  Now admittedly pg_dump isn't going to issue any such
updates, but VACUUM doesn't know that.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
We wouldn't clean up tuples that are visible to a transaction, but if 
you have one long-running transaction like pg_dump in a database with 
otherwise short transaction, you'll have a lot of tuples that are not 
vacuumable because of the long-running process, but are not in fact 
visible to any transaction.


It sounds to me like you are proposing to remove the middles of update
chains, which would break READ-COMMITTED updates initiated by the older
transactions.  Now admittedly pg_dump isn't going to issue any such
updates, but VACUUM doesn't know that.


I was thinking of inserts+deletes. Updates are harder, you'd need to 
change the ctid of the old version to skip the middle part of the chain, 
atomically, but I suppose they could be handled as well.


Isolation level doesn't really matter. We just need a global view of 
in-use *snapshots* in the system, serializable or not. Not that that's 
an easy thing to do...


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Florian G. Pflug

Heikki Linnakangas wrote:
The implementation problem is that we don't have a global view of all 
snapshots in the system. If we solve that, we can be more aggressive 
with vacuuming in presence of long-running transactions. It's not an 
easy problem, we don't want to add a lot of accounting overhead, but 
maybe we could have some kind of an approximation of the global state 
with little overhead, that would give most of the benefit.


Hm.. Maybe there could be a fixed-sized list of xids together with
a usecount in shared memory.
If a transaction puts an xid into it's snapshot, it increments the
usecount of that xid in the global list (inserting it if it's not
already in the list). If there is no free space in the list, it
first removes all xid with xid  oldestxmin. If there is still no
free space, it does nothing.
When the transaction is done with the snapshot, it decrements all
the usecounts of xids it incremented before.

You than know that a xid is *not* viewed as in-progress by any
transaction if the xid is in that list, and has a refcount of zero.

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Florian G. Pflug

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
We wouldn't clean up tuples that are visible to a transaction, but if 
you have one long-running transaction like pg_dump in a database with 
otherwise short transaction, you'll have a lot of tuples that are not 
vacuumable because of the long-running process, but are not in fact 
visible to any transaction.


It sounds to me like you are proposing to remove the middles of update
chains, which would break READ-COMMITTED updates initiated by the older
transactions.  Now admittedly pg_dump isn't going to issue any such
updates, but VACUUM doesn't know that.


You could restrict this to serializable transactions, or even to
read-only transactions. Or maybe the tuple could be reduced to
just it's header - doesn't HOT do something similar?

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq