Re: [HACKERS] Piggybacking vacuum I/O

2007-01-26 Thread Heikki Linnakangas
I'd like to see still more evidence that it's a problem before we start 
changing that piece of code. It has served us well for years.


Bruce Momjian wrote:

Is there a TODO here?

---

Heikki Linnakangas wrote:

Pavan Deolasee wrote:

Another simpler solution for VACUUM would be to read the entire CLOG file
in local memory. Most of the transaction status queries can be satisfied
from
this local copy and the normal CLOG is consulted only when the status is
unknown (TRANSACTION_STATUS_IN_PROGRESS)

The clog is only for finished (committed/aborted/crashed) transactions.
If a transaction is in progress, the clog is never consulted. Anyway,
that'd only be reasonable for vacuums, and I'm actually more worried if
we had normal backends thrashing the clog buffers.



--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-26 Thread Alvaro Herrera
Heikki Linnakangas wrote:
 I'd like to see still more evidence that it's a problem before we start 
 changing that piece of code. It has served us well for years.

So the TODO could be investigate whether caching pg_clog and/or
pg_subtrans in local memory can be useful for vacuum performance.

 Bruce Momjian wrote:
 Is there a TODO here?
 
 ---
 
 Heikki Linnakangas wrote:
 Pavan Deolasee wrote:
 Another simpler solution for VACUUM would be to read the entire CLOG file
 in local memory. Most of the transaction status queries can be satisfied
 from
 this local copy and the normal CLOG is consulted only when the status is
 unknown (TRANSACTION_STATUS_IN_PROGRESS)
 The clog is only for finished (committed/aborted/crashed) transactions.
 If a transaction is in progress, the clog is never consulted. Anyway,
 that'd only be reasonable for vacuums, and I'm actually more worried if
 we had normal backends thrashing the clog buffers.


-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-26 Thread Pavan Deolasee

On 1/26/07, Alvaro Herrera [EMAIL PROTECTED] wrote:


Heikki Linnakangas wrote:
 I'd like to see still more evidence that it's a problem before we start
 changing that piece of code. It has served us well for years.

So the TODO could be investigate whether caching pg_clog and/or
pg_subtrans in local memory can be useful for vacuum performance.



As  Heikki suggested, we should also investigate the same for normal
backends as well.

It would also be interesting to investigate whether early setting of hint
bits
can reduce subsequent writes of blocks. A typical case would be a large
table
being updated heavily for a while, followed by SELECT queries. The SELECT
queries would set hint bits for the previously UPDATEd  tuples (old and new
versions) and thus cause subsequent writes of those blocks for what could
have been read-only queries.

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-26 Thread Alvaro Herrera
Pavan Deolasee wrote:
 On 1/26/07, Alvaro Herrera [EMAIL PROTECTED] wrote:
 
 Heikki Linnakangas wrote:
  I'd like to see still more evidence that it's a problem before we start
  changing that piece of code. It has served us well for years.
 
 So the TODO could be investigate whether caching pg_clog and/or
 pg_subtrans in local memory can be useful for vacuum performance.
 
 As  Heikki suggested, we should also investigate the same for normal
 backends as well.

Maybe.  An idea that comes to mind is to never cache the latest page,
since it'll most likely result in extra reads anyway because there'll be
a lot of IN_PROGRESS transactions.

Problem to solve: how much memory to dedicate to this?  Could we mmap()
portions of the pg_clog segment, so that the page could be shared across
backends instead of allocating them for each?

 It would also be interesting to investigate whether early setting of
 hint bits can reduce subsequent writes of blocks. A typical case would
 be a large table being updated heavily for a while, followed by SELECT
 queries. The SELECT queries would set hint bits for the previously
 UPDATEd  tuples (old and new versions) and thus cause subsequent
 writes of those blocks for what could have been read-only queries.

This has been suggested before, but I don't see how this could work.
How does the UPDATE transaction go back to the pages it wrote to update
the hint bits, _after_ it committed?

Maybe have the bgwriter update hint bits as it evicts pages out of the
cache?  It could result in pg_clog read traffic for each page that needs
eviction; not such a hot idea.

I don't see how this is related to the above proposal though.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-26 Thread Pavan Deolasee

On 1/26/07, Alvaro Herrera [EMAIL PROTECTED] wrote:



Maybe have the bgwriter update hint bits as it evicts pages out of the
cache?  It could result in pg_clog read traffic for each page that needs
eviction; not such a hot idea.



I thought once we enhance clog so that there are no clog reads,
bgwriter would be able to update hint bits without getting into any deadlock
with pg_clog read.

May be we can have this as a seperate TODO

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-26 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Pavan Deolasee wrote:
 It would also be interesting to investigate whether early setting of
 hint bits can reduce subsequent writes of blocks. A typical case would
 be a large table being updated heavily for a while, followed by SELECT
 queries. The SELECT queries would set hint bits for the previously
 UPDATEd  tuples (old and new versions) and thus cause subsequent
 writes of those blocks for what could have been read-only queries.

 This has been suggested before, but I don't see how this could work.
 How does the UPDATE transaction go back to the pages it wrote to update
 the hint bits, _after_ it committed?

I think what he's suggesting is deliberately not updating the hint bits
during a SELECT ... but that's surely misguided.  If you don't set the
hint bit after discovering the transaction commit state, then the next
visitor of the tuple will have to repeat the clog lookup, meaning that
any such policy greatly increases clog read traffic and contention.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-26 Thread Pavan Deolasee

On 1/26/07, Tom Lane [EMAIL PROTECTED] wrote:



I think what he's suggesting is deliberately not updating the hint bits
during a SELECT ...



No, I was suggesting doing it in bgwriter so that we may not need to that
during
a SELECT. Of course, we need to investigate more and have numbers to prove
the need. Also you have already expressed concerns that doing so in bgwriter
is deadlock
prone. So there is certainly more work needed for any such scheme to work.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-26 Thread Bruce Momjian
Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  I'd like to see still more evidence that it's a problem before we start 
  changing that piece of code. It has served us well for years.
 
 What I see here is mostly evidence suggesting that we should consider
 raising NUM_CLOG_BUFFERS, rather than anything more invasive.

Added to TODO:

* Consider increasing NUM_CLOG_BUFFERS

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-26 Thread Bruce Momjian
Pavan Deolasee wrote:
 On 1/26/07, Tom Lane [EMAIL PROTECTED] wrote:
 
 
  I think what he's suggesting is deliberately not updating the hint bits
  during a SELECT ...
 
 
 No, I was suggesting doing it in bgwriter so that we may not need to that
 during
 a SELECT. Of course, we need to investigate more and have numbers to prove
 the need. Also you have already expressed concerns that doing so in bgwriter
 is deadlock
 prone. So there is certainly more work needed for any such scheme to work.

Added to TODO:

* Consider having the background writer update the transaction status
  hint bits before writing out the page

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-25 Thread Pavan Deolasee

On 1/24/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:


Pavan Deolasee wrote:
 I have just counted the number of read/write calls on the CLOG blocks.
As
 you can
 see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
 1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock
 table.

Hmm. So there is some activity there. Could you modify the patch to
count how many of those reads came from OS cache? I'm thinking of doing
a gettimeofday() call before and after read, and counting how many
calls finished in less than say  1 ms. Also, summing up the total time
spent in reads would be interesting.



Here are some more numbers. I ran two tests of 4 hour each with CLOG cache
size set to 8 blocks (default) and 16 blocks. I counted the number of read()
calls
and specifically those read() calls which took more than 0.5 ms to complete.
As you guessed, almost 99% of the reads complete in less than 0.5 ms, but
the total read() time is still more than 1% of the duration of the test. Is
it
worth optimizing ?

CLOG (16 blocks)
reads(743317), writes(84), reads  0.5 ms (5171), time reads (186s), time
reads  0.5 ms(175s)

CLOG (8 blocks)
reads(1155917), writes(119), reads  0.5 ms (4040), time reads (146s), time
reads  0.5 ms(130s)

(amused to see increase in the total read time with 16 blocks)

Also is it worth optimizing on the total read() system calls which might not
cause physical I/O, but
still consume CPU ?

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-25 Thread Heikki Linnakangas

Pavan Deolasee wrote:

On 1/24/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:

Hmm. So there is some activity there. Could you modify the patch to
count how many of those reads came from OS cache? I'm thinking of doing
a gettimeofday() call before and after read, and counting how many
calls finished in less than say  1 ms. Also, summing up the total time
spent in reads would be interesting.


Here are some more numbers. I ran two tests of 4 hour each with CLOG cache
size set to 8 blocks (default) and 16 blocks. I counted the number of 
read()

calls
and specifically those read() calls which took more than 0.5 ms to 
complete.

As you guessed, almost 99% of the reads complete in less than 0.5 ms, but
the total read() time is still more than 1% of the duration of the test. Is
it
worth optimizing ?


Probably not. I wouldn't trust that 1% of test duration figure too much, 
gettimeofday() has some overhead of its own...



CLOG (16 blocks)
reads(743317), writes(84), reads  0.5 ms (5171), time reads (186s), time
reads  0.5 ms(175s)

CLOG (8 blocks)
reads(1155917), writes(119), reads  0.5 ms (4040), time reads (146s), time
reads  0.5 ms(130s)

(amused to see increase in the total read time with 16 blocks)


Hmm. That's surprising.

Also is it worth optimizing on the total read() system calls which might 
not

cause physical I/O, but
still consume CPU ?


I don't think it's worth it, but now that we're talking about it: What 
I'd like to do to all the slru files is to replace the custom buffer 
management with mmapping the whole file, and letting the OS take care of 
it. We would get rid of some guc variables, the OS would tune the amount 
of memory used for clog/subtrans dynamically, and we would avoid the 
memory copying. And I'd like to do the same for WAL.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-25 Thread Pavan Deolasee

On 1/25/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:


Pavan Deolasee wrote:

 Also is it worth optimizing on the total read() system calls which might
 not
 cause physical I/O, but
 still consume CPU ?

I don't think it's worth it, but now that we're talking about it: What
I'd like to do to all the slru files is to replace the custom buffer
management with mmapping the whole file, and letting the OS take care of
it. We would get rid of some guc variables, the OS would tune the amount
of memory used for clog/subtrans dynamically, and we would avoid the
memory copying. And I'd like to do the same for WAL.



Yes, we can do that. One problem though is mmaping wouldn't work when
CLOG file is extended and some of the backends may not see the extended
portion. But may be we can start with a sufficiently large initialized file
and
mmap the whole file.

Another simpler solution for VACUUM would be to read the entire CLOG file
in local memory. Most of the transaction status queries can be satisfied
from
this local copy and the normal CLOG is consulted only when the status is
unknown (TRANSACTION_STATUS_IN_PROGRESS)

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-25 Thread Heikki Linnakangas

Pavan Deolasee wrote:

Another simpler solution for VACUUM would be to read the entire CLOG file
in local memory. Most of the transaction status queries can be satisfied
from
this local copy and the normal CLOG is consulted only when the status is
unknown (TRANSACTION_STATUS_IN_PROGRESS)


The clog is only for finished (committed/aborted/crashed) transactions.
If a transaction is in progress, the clog is never consulted. Anyway,
that'd only be reasonable for vacuums, and I'm actually more worried if
we had normal backends thrashing the clog buffers.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-25 Thread Bruce Momjian

Is there a TODO here?

---

Heikki Linnakangas wrote:
 Pavan Deolasee wrote:
  Another simpler solution for VACUUM would be to read the entire CLOG file
  in local memory. Most of the transaction status queries can be satisfied
  from
  this local copy and the normal CLOG is consulted only when the status is
  unknown (TRANSACTION_STATUS_IN_PROGRESS)
 
 The clog is only for finished (committed/aborted/crashed) transactions.
 If a transaction is in progress, the clog is never consulted. Anyway,
 that'd only be reasonable for vacuums, and I'm actually more worried if
 we had normal backends thrashing the clog buffers.
 
 -- 
Heikki Linnakangas
EnterpriseDB   http://www.enterprisedb.com
 
 
 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-24 Thread Heikki Linnakangas

Pavan Deolasee wrote:

I have just counted the number of read/write calls on the CLOG blocks. As
you can
see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock 
table.


Hmm. So there is some activity there. Could you modify the patch to 
count how many of those reads came from OS cache? I'm thinking of doing 
a gettimeofday() call before and after read, and counting how many 
calls finished in less than say  1 ms. Also, summing up the total time 
spent in reads would be interesting.


Or, would it be possible to put the clog to a different drive, and use 
iostat to get the numbers?



This figure is only indicative since it also includes the CLOG block reads
which would
have happened as part of other backend operations (VACUUM took almost 55
minutes to
complete). Still in the first 210 minutes of the run, the total reads were
only 545323. So
most of the 636528 reads in the next 55 minutes can be attributed to 
VACUUM.


Actually, clog reads during normal activity is even worse.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-24 Thread Simon Riggs
On Wed, 2007-01-24 at 09:32 +0530, Pavan Deolasee wrote:

 On a typical desktop class 2 CPU Dell machine, we have seen pgbench
 clocking more than 1500 tps. That implies CLOG would get filled up in
 less
 than 262144/1500=174 seconds. VACUUM on accounts table takes much 
 longer to trigger.

You assume that all of the top level transactions have no
subtransactions. On that test, subtransactions are in use because of the
EXCEPTION clause in the PL/pgSQL used. That should at least double the
number of Xids.

  So
 most of the 636528 reads in the next 55 minutes can be attributed to
 VACUUM.

A similar argument might also be applied to subtrans, so a similar
investigation seems worthwhile. Subtrans has space for less Xids than
clog, BTW.

OTOH, I do think that 99% of that will not cause I/O.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-23 Thread Pavan Deolasee

On 1/22/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:


I've been looking at the way we do vacuums.

The fundamental performance issue is that a vacuum generates
nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to
spread the cost like part payment, but the total is the same. In an I/O
bound system, the extra I/O directly leads to less throughput.



Another source of I/O is perhaps the CLOG read/writes for checking
transaction status. If we are talking about large tables like accounts in
pgbench or customer/stock in DBT2, the tables are vacuumed much later than
the actual UPDATEs. I don't have any numbers to prove yet, but my sense is
that CLOG pages holding the status of many of the transactions might have
been already flushed out of the cache and require an I/O. Since the default
CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing
during VACUUM as the transaction ids will be all random in a heap page.

Would it help to set the status of the XMIN/XMAX of tuples early enough such
that the heap page is still in the buffer cache, but late enough such that
the XMIN/XMAX transactions are finished ? How about doing it when the
bgwriter is about to write the page to disk ? Assuming few seconds of life
of a heap page in the buffer cache, hopefully most of the XMIN/XMAX
transactions should have completed and bgwriter can set XMIN(XMAX)_COMMITTED
or XMIN(XMAX)_INVALID for most of the tuples in the page. This would save us
CLOG I/Os  later, either during subsequent access to the tuple and/or
vacuum.

Any thoughts ?

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-23 Thread Heikki Linnakangas

Pavan Deolasee wrote:

Another source of I/O is perhaps the CLOG read/writes for checking
transaction status. If we are talking about large tables like accounts in
pgbench or customer/stock in DBT2, the tables are vacuumed much later than
the actual UPDATEs. I don't have any numbers to prove yet, but my sense is
that CLOG pages holding the status of many of the transactions might have
been already flushed out of the cache and require an I/O. Since the default
CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing
during VACUUM as the transaction ids will be all random in a heap page.


8 log pages hold 8*8192*4=262144 transactions. If the active set of 
transactions is larger than that, the OS cache will probably hold more 
clog pages. I guess you could end up doing some I/O on clog on a vacuum 
of a big table, if you have a high transaction rate and vacuum 
infrequently...


Would it help to set the status of the XMIN/XMAX of tuples early enough 
such

that the heap page is still in the buffer cache, but late enough such that
the XMIN/XMAX transactions are finished ? How about doing it when the
bgwriter is about to write the page to disk ? Assuming few seconds of life
of a heap page in the buffer cache, hopefully most of the XMIN/XMAX
transactions should have completed and bgwriter can set 
XMIN(XMAX)_COMMITTED
or XMIN(XMAX)_INVALID for most of the tuples in the page. This would 
save us

CLOG I/Os  later, either during subsequent access to the tuple and/or
vacuum.


Yeah, we could do that. First I'd like to see some more evidence that 
clog trashing is a problem, though.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-23 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes:
 Would it help to set the status of the XMIN/XMAX of tuples early enough such
 that the heap page is still in the buffer cache, but late enough such that
 the XMIN/XMAX transactions are finished ? How about doing it when the
 bgwriter is about to write the page to disk ?

No.  The bgwriter would then become subject to deadlocks because it
would be needing to read in clog pages before it could flush out
dirty pages.  In any case, if the table is in active use then some
passing backend has probably updated the bits already ...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-23 Thread Pavan Deolasee

On 1/23/07, Tom Lane [EMAIL PROTECTED] wrote:


Pavan Deolasee [EMAIL PROTECTED] writes:
 Would it help to set the status of the XMIN/XMAX of tuples early enough
such
 that the heap page is still in the buffer cache, but late enough such
that
 the XMIN/XMAX transactions are finished ? How about doing it when the
 bgwriter is about to write the page to disk ?

No.  The bgwriter would then become subject to deadlocks because it
would be needing to read in clog pages before it could flush out
dirty pages.  In any case, if the table is in active use then some
passing backend has probably updated the bits already ...



Well, let me collect some evidence. If we figure out that there is indeed a
CLOG buffer thrash at VACUUM time, I am sure we would be able to solve
the problem one way or the other.

IMHO this case would be more applicable to the very large tables where the
UPDATEd rows are not accessed again for a long time. And hence the hint bits
might not have been updated.

Thanks,
Pavan




--

EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-23 Thread Pavan Deolasee

On 1/23/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:


Pavan Deolasee wrote:
 Another source of I/O is perhaps the CLOG read/writes for checking
 transaction status. If we are talking about large tables like accounts
in
 pgbench or customer/stock in DBT2, the tables are vacuumed much later
than
 the actual UPDATEs. I don't have any numbers to prove yet, but my sense
is
 that CLOG pages holding the status of many of the transactions might
have
 been already flushed out of the cache and require an I/O. Since the
default
 CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing
 during VACUUM as the transaction ids will be all random in a heap page.

8 log pages hold 8*8192*4=262144 transactions. If the active set of
transactions is larger than that, the OS cache will probably hold more
clog pages. I guess you could end up doing some I/O on clog on a vacuum
of a big table, if you have a high transaction rate and vacuum
infrequently...



On a typical desktop class 2 CPU Dell machine, we have seen pgbench
clocking more than 1500 tps. That implies CLOG would get filled up in less
than 262144/1500=174 seconds. VACUUM on accounts table takes much
longer to trigger.


Would it help to set the status of the XMIN/XMAX of tuples early enough
 such
 that the heap page is still in the buffer cache, but late enough such
that
 the XMIN/XMAX transactions are finished ?


Yeah, we could do that. First I'd like to see some more evidence that

clog trashing is a problem, though.



Here are some numbers from a 4 hour DBT2 run with 270 warehouses and 50
connections.

2007-01-23 07:40:30 PST_17428 LOG:  vacuuming public.warehouse
2007-01-23 07:40:30 PST_17428 LOG:  CLOG r(1), w(0)- vacuum start
2007-01-23 07:40:30 PST_17428 LOG:  CLOG r(1), w(0)- vacuum end
2007-01-23 07:40:30 PST_17428 LOG:  warehouse: found 1214 removable,
2275 nonremovable row versions in 111 pages

snip

2007-01-23 11:11:43 PST_30356 LOG:  vacuuming public.stock
2007-01-23 11:11:43 PST_30356 LOG:  CLOG r(545323), w(91)- vacuum start
2007-01-23 12:03:14 PST_30356 LOG:  CLOG r(1181851), w(133) - vacuum end
2007-01-23 12:03:14 PST_30356 LOG:  stock: found 5645264 removable,
27003788 nonremovable row versions in 1554697 pages

I have just counted the number of read/write calls on the CLOG blocks. As
you can
see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock table.
This figure is only indicative since it also includes the CLOG block reads
which would
have happened as part of other backend operations (VACUUM took almost 55
minutes to
complete). Still in the first 210 minutes of the run, the total reads were
only 545323. So
most of the 636528 reads in the next 55 minutes can be attributed to VACUUM.

The writes are very small though, may be because most of the CLOG pages are
accessed
read-only. A simple patch that I used to get these numbers is attached.

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com


CLOG-stats.patch.gz
Description: GNU Zip compressed data

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-23 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes:
 On a typical desktop class 2 CPU Dell machine, we have seen pgbench
 clocking more than 1500 tps.

Only if you had fsync off, or equivalently a disk drive that lies about
write-complete.  You could possibly achieve such rates in a non-broken
configuration with a battery-backed write cache, but that's not typical
desktop kit.

In any case, you ignored Heikki's point that the PG shared memory pages
holding CLOG are unlikely to be the sole level of caching, if the update
rate is that high.  The kernel will have some pages too.  And even if we
thought not, wouldn't bumping the size of the clog cache be a far
simpler solution offering benefit for more things than just this?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-23 Thread Pavan Deolasee

On 1/24/07, Tom Lane [EMAIL PROTECTED] wrote:


Pavan Deolasee [EMAIL PROTECTED] writes:
 On a typical desktop class 2 CPU Dell machine, we have seen pgbench
 clocking more than 1500 tps.

Only if you had fsync off, or equivalently a disk drive that lies about
write-complete.  You could possibly achieve such rates in a non-broken
configuration with a battery-backed write cache, but that's not typical
desktop kit.



May be I was too vague about the machine/test. Its probably not a
typical desktop machine since it has better storage. A two disk
RAID 0 configuration for data, and a dedicated disk for xlog. I remember
running with 50 clients and 50 scaling factor, 1 GB shared buffer,
autovacuum turned on with default parameters and rest with default
configuration. I don't think I had explicitly turned fsync off.



In any case, you ignored Heikki's point that the PG shared memory pages
holding CLOG are unlikely to be the sole level of caching, if the update
rate is that high.  The kernel will have some pages too.  And even if we
thought not, wouldn't bumping the size of the clog cache be a far
simpler solution offering benefit for more things than just this?



Yes. May be what Heikki said is true, but we don't know for sure.
Wouldn't bumping the cache size just delay the problem a bit ?
Especially with even larger table and a very high end machine/storage
which can clock very high transactions per minute ?

Anyways, if we agree that there is a problem, the solution could be
as simple as increasing the cache size, as you suggested.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


[HACKERS] Piggybacking vacuum I/O

2007-01-22 Thread Heikki Linnakangas

I've been looking at the way we do vacuums.

The fundamental performance issue is that a vacuum generates 
nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to 
spread the cost like part payment, but the total is the same. In an I/O 
bound system, the extra I/O directly leads to less throughput.


Therefore, we need to do less I/O. Dead space map helps by allowing us 
to skip blocks that don't need vacuuming, reducing the # of I/Os to 
2*ndirtyblocks+nindexblocks. That's great, but it doesn't help us if the 
dead tuples are spread uniformly.


If we could piggyback the vacuum I/Os to the I/Os that we're doing 
anyway, vacuum wouldn't ideally have to issue any I/O of its own. I've 
tried to figure out a way to do that.


Vacuum is done in 3 phases:

1. Scan heap
2. Vacuum index
3. Vacuum heap

Instead of doing a sequential scan, we could perform the 1st phase by 
watching the buffer pool, scanning blocks for dead tuples when they're 
in memory and keeping track of which pages we've seen. When all pages 
have been seen, the tid list is sorted and 1st phase is done.


In theory, the index vacuum could also be done that way, but let's 
assume for now that indexes would be scanned like they are currently.


The 3rd phase can be performed similarly to the 1st phase. Whenever a 
page enters the buffer pool, we check the tid list and remove any 
matching tuples from the page. When the list is empty, vacuum is complete.


Of course, there's some issues in the design as described above. For 
example, the vacuum might take a long time if there's cold spots in the 
table. In fact, a block full of dead tuples might never be visited again.


A variation of the scheme would be to keep scanning pages that are in 
cache, until the tid list reaches a predefined size, instead of keeping 
track of which pages have already been seen. That would deal better with 
tables with hot and cold spots, but it couldn't advance the relfrozenid 
because there would be no guarantee that all pages are visited. Also, we 
could start 1st phase of the next vacuum, while we're still in the 3rd 
phase of previous one.


Also, after we've seen 95% of the pages or a timeout expires, we could 
fetch the rest of them with random I/O to let the vacuum finish.


I'm not sure how exactly this would be implemented. Perhaps bgwriter or 
autovacuum would do it, or a new background process. Presumably the 
process would need access to relcache.


One issue is that if we're trying to vacuum every table simultaneously 
this way, we'll need more overall memory for the tid lists. I'm hoping 
there's a way to implement this without requiring shared memory for the 
tid lists, that would make the memory management a nightmare. Also, we'd 
need changes to bufmgr API to support this.


This would work nicely with the DSM. The list of pages that need to be 
visited in phase 1 could be initialized from the DSM, largely avoiding 
the problem with cold spots.


Any thoughts before I start experimenting?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-22 Thread Simon Riggs
On Mon, 2007-01-22 at 13:41 +, Heikki Linnakangas wrote:
 Any thoughts before I start experimenting?

Probably only to detail the various use cases we are discussing.

My thoughts on various use cases are:

- small table with frequent update/delete, heap and indexes all/mostly
cached
e.g. Counter tables, DBT2: District/Warehouse TPC-C, pgbench:
Branches/Tellers
Current VACUUM works well in this situation, since the only I/O incurred
is the WAL written for the VACUUM. VACUUM very cheap even if not in
cache because of sequential I/O. Keeping track of whether there are hot
spots in these tables seems like a waste of cycles and could potentially
introduce contention and hence reduce performance. These need to be very
frequently VACUUMed, even when other VACUUMs are required.
My current view: just need multiple concurrent autovacuum processes.

- large table with severe hotspots
e.g. DBT2: NewOrder, larger queue-style tables
The hotspots are likely to be in cache and the not-so-hotspots might or
might not be in cache, but we don't care either way. DSM concept works
well for this case, since we are able to avoid lots of I/O by
appropriate book-keeping. Works well for removing rows after a file-scan
DELETE, as well as for DELETE or UPDATE hot spots. 
My current view: DSM would be great for this

- large table with few hotspots
e.g. DBT2: Stock, pgbench: Accounts, most Customer tables
Current VACUUM works very badly in this case, since updates are sparsely
distributed across table. DSM wouldn't help either unless we
differentiate between few/many updates to a block. 
My current view: Piggyback concept seems on the right track, but clearly
needs further thought.

Currently we have only one technique for garbage collection, plus one
process to perform it. We need multiple techniques executed by multiple
processes, when required, plus some way of automatically selecting which
is appropriate depending upon the use case. Yes, automatic :-)

DSM and this piggyback idea need not be thought of as competing
techniques.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-22 Thread Jim C. Nasby
On Mon, Jan 22, 2007 at 02:51:47PM +, Heikki Linnakangas wrote:
 I've been looking at the way we do vacuums.
 
 The fundamental performance issue is that a vacuum generates 
 nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to 
 spread the cost like part payment, but the total is the same. In an I/O 
 bound system, the extra I/O directly leads to less throughput.
 
 Therefore, we need to do less I/O. Dead space map helps by allowing us 
 to skip blocks that don't need vacuuming, reducing the # of I/Os to 
 2*ndirtyblocks+nindexblocks. That's great, but it doesn't help us if the 
 dead tuples are spread uniformly.
 
 If we could piggyback the vacuum I/Os to the I/Os that we're doing 
 anyway, vacuum wouldn't ideally have to issue any I/O of its own. I've 
 tried to figure out a way to do that.
 
 Vacuum is done in 3 phases:
 
 1. Scan heap
 2. Vacuum index
 3. Vacuum heap
 


 Instead of doing a sequential scan, we could perform the 1st phase by 
 watching the buffer pool, scanning blocks for dead tuples when they're 
 in memory and keeping track of which pages we've seen. When all pages 
 have been seen, the tid list is sorted and 1st phase is done.
 
 In theory, the index vacuum could also be done that way, but let's 
 assume for now that indexes would be scanned like they are currently.
 
 The 3rd phase can be performed similarly to the 1st phase. Whenever a 
 page enters the buffer pool, we check the tid list and remove any 
 matching tuples from the page. When the list is empty, vacuum is complete.

Is there any real reason to demark the start and end of a vacuum? Why
not just go to a continuous process? One possibility is to keep a list
of TIDs for each phase, though that could prove tricky with multiple
indexes.

 A variation of the scheme would be to keep scanning pages that are in 
 cache, until the tid list reaches a predefined size, instead of keeping 
 track of which pages have already been seen. That would deal better with 
 tables with hot and cold spots, but it couldn't advance the relfrozenid 
 because there would be no guarantee that all pages are visited. Also, we 
 could start 1st phase of the next vacuum, while we're still in the 3rd 
 phase of previous one.
 
What if we tracked freeze status on a per-page basis? Perhaps track the
minimum XID that's on each page. That would allow us to ensure that we
freeze pages that are approaching XID wrap.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Piggybacking vacuum I/O

2007-01-22 Thread ITAGAKI Takahiro

Heikki Linnakangas [EMAIL PROTECTED] wrote:

 Vacuum is done in 3 phases:
 1. Scan heap
 2. Vacuum index
 3. Vacuum heap

 A variation of the scheme would be to keep scanning pages that are in 
 cache, until the tid list reaches a predefined size, instead of keeping 
 track of which pages have already been seen. That would deal better with 
 tables with hot and cold spots, but it couldn't advance the relfrozenid 
 because there would be no guarantee that all pages are visited. Also, we 
 could start 1st phase of the next vacuum, while we're still in the 3rd 
 phase of previous one.

ISTM, it is another DSM that has a tuple-level accuracy, not a page-level.
One of the benefits is that we can skip the 1st phase of vacuum; We will
have a TID list of dead tuples at the start of vacuum, so we can start
from 2nd phase.

I have another idea for use of TID lists -- Store the TIDs after the 1st
or 2nd phase, and exit the vacuum. At the next vacuum, we will do both
the previous 3rd phase and new 1st phase at once, so that I/Os are reduced
(ndirtyblocks + nindexblocks) from (2*ndirtyblocks + nindexblocks) in
average. We've already use a similar method in vacuuming btree indexes
to collect recyclable empty pages.

I think piggybacking of I/Os are very useful. Buffer manager helps us
folding up some of I/Os, but explicit orders are more effective.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings