[PERFORM] Join runs for > 10 hours and then fills up >1.3TB of disk space

2008-05-15 Thread kevin kempter

Hi List;

I have a table with 9,961,914 rows in it (see the describe of  
bigtab_stats_fact_tmp14 below)


I also have a table with 7,785 rows in it (see the describe of  
xsegment_dim below)


I'm running the join shown below and it takes > 10 hours and  
eventually runs out of disk space on a 1.4TB file system


I've included below a describe of both tables, the join and an explain  
plan, any help / suggestions would be much appreciated !


I need to get this beast to run as quickly as possible (without  
filling up my file system)



Thanks in advance...











select
f14.xpublisher_dim_id,
f14.xtime_dim_id,
f14.xlocation_dim_id,
f14.xreferrer_dim_id,
f14.xsite_dim_id,
f14.xsystem_cfg_dim_id,
f14.xaffiliate_dim_id,
f14.customer_id,
pf_dts_id,
episode_id,
sessionid,
bytes_received,
bytes_transmitted,
total_played_time_sec,
segdim.xsegment_dim_id as episode_level_segid
from
bigtab_stats_fact_tmp14 f14,
xsegment_dim segdim
where
f14.customer_id = segdim.customer_srcid
and f14.show_id = segdim.show_srcid
and f14.season_id = segdim.season_srcid
and f14.episode_id = segdim.episode_srcid
and segdim.segment_srcid is NULL;






QUERY PLAN
---
Merge Join  (cost=1757001.74..73569676.49 rows=3191677219 width=118)
Merge Cond: ((segdim.episode_srcid = f14.episode_id) AND  
(segdim.customer_srcid = f14.customer_id) AND (segdim.show_srcid =  
f14.show_id) AND (segdim.season_srcid = f14.season_id))

->  Sort  (cost=1570.35..1579.46 rows=3643 width=40)
Sort Key: segdim.episode_srcid, segdim.customer_srcid,  
segdim.show_srcid, segdim.season_srcid
->  Seq Scan on xsegment_dim segdim  (cost=0.00..1354.85 rows=3643  
width=40)

Filter: (segment_srcid IS NULL)
->  Sort  (cost=1755323.26..1780227.95 rows=9961874 width=126)
Sort Key: f14.episode_id, f14.customer_id, f14.show_id, f14.season_id
->  Seq Scan on bigtab_stats_fact_tmp14 f14  (cost=0.00..597355.74  
rows=9961874 width=126)

(9 rows)









# \d bigtab_stats_fact_tmp14
Table "public.bigtab_stats_fact_tmp14"
Column  |Type | Modifiers
--+-+---
pf_dts_id | bigint  |
pf_device_id   | bigint  |
segment_id   | bigint  |
cdn_id   | bigint  |
collector_id | bigint  |
digital_envoy_id | bigint  |
maxmind_id   | bigint  |
quova_id | bigint  |
website_id   | bigint  |
referrer_id  | bigint  |
affiliate_id | bigint  |
custom_info_id   | bigint  |
start_dt | timestamp without time zone |
total_played_time_sec| numeric(18,5)   |
bytes_received   | bigint  |
bytes_transmitted| bigint  |
stall_count  | integer |
stall_duration_sec   | numeric(18,5)   |
hiccup_count | integer |
hiccup_duration_sec  | numeric(18,5)   |
watched_duration_sec | numeric(18,5)   |
rewatched_duration_sec   | numeric(18,5)   |
requested_start_position | numeric(18,5)   |
requested_stop_position  | numeric(18,5)   |
post_position| numeric(18,5)   |
is_vod   | numeric(1,0)|
sessionid| bigint  |
create_dt| timestamp without time zone |
segment_type_id  | bigint  |
customer_id  | bigint  |
content_publisher_id | bigint  |
content_owner_id | bigint  |
episode_id   | bigint  |
duration_sec | numeric(18,5)   |
device_id| bigint  |
os_id| bigint  |
browser_id   | bigint  |
cpu_id   | bigint  |
xsystem_cfg_dim_id  | bigint  |
xreferrer_dim_id| bigint  |
xaffiliate_dim_id   | bigint  |
xsite_dim_id| bigint  |
xpublisher_dim_id   | bigint  |
season_id| bigint  |
show_id  | bigint  |
xsegment_dim_id | bigint  |
location_id  | bigint  |
zipcode  | character varying(20)   |
xlocation_dim

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Kevin Grittner
>>> On Thu, May 15, 2008 at  5:11 PM, in message
<[EMAIL PROTECTED]>, James Mansion
<[EMAIL PROTECTED]> wrote: 
> Alvaro Herrera wrote:
>> Hint bits are used to mark tuples as created and/or deleted by
>> transactions that are know committed or aborted.  To determine the
>> visibility of a tuple without such bits set, you need to consult
pg_clog
>> and possibly pg_subtrans, so it is an expensive check.  On the
other
>>   
> So, how come there is this outstanding work to do, which will
inevitably 
> be done, and it
> hasn't been done until it is 'just too late' to avoid getting in the
way 
> of the query?
 
There has been discussion from time to time about setting the hint
bits for tuple inserts which occur within the same database
transaction as the creation of the table into which they're being
inserted.  That would allow people to cover many of the bulk load
situations.  I don't see it on the task list.  (I would also argue
that there is little information lost, even from a forensic
perspective, to writing such rows as "frozen".)  Is this idea done,
dead, or is someone working on it?
 
If we could set hint bits on dirty buffer pages after the commit, we'd
cover the OLTP situation.  In many situations, there is a bigger OS
cache than PostgreSQL shared memory, and an attempt to set the bits
soon after the commit would coalesce the two writes into one physical
write using RAM-based access, which would be almost as good.  I don't
know if it's feasible to try to do that after the pages have moved
from the PostgreSQL cache to the OS cache, but it would likely be a
performance win.
 
If we are going to burden any requester process with the job of
setting the hint bits, it would typically be better to burden the one
doing the data modification rather than some random thread later
trying to read data from the table.  Of course, getting work off the
requester processes onto some background worker process is generally
even better.
 
-Kevin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Gregory Stark
"Luke Lonergan" <[EMAIL PROTECTED]> writes:

> BTW ­ we¹ve removed HINT bit checking in Greenplum DB and improved the
> visibility caching which was enough to provide performance at the same level
> as with the HINT bit optimization, but avoids this whole ³write the data,
> write it to the log also, then write it again just for good measure²
> behavior.
>
> For people doing data warehousing work like the poster, this Postgres
> behavior is miserable.  It should be fixed for 8.4 for sure (volunteers?)

For people doing data warehousing I would think the trick would be to do
something like what we do to avoid WAL logging for tables created in the same
transaction. 

That is, if you're loading a lot of data at the same time then all of that
data is going to be aborted or committed and that will happen at the same
time. Ideally we would find a way to insert the data with the hint bits
already set to committed and mark the section of the table as being only
provisionally extended so other transactions wouldn't even look at those pages
until the transaction commits.

This is similar to the abortive attempt to have the abovementioned WAL logging
trick insert the records pre-frozen. I recall there were problems with that
idea though but I don't recall if they were insurmountable or just required
more work.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread James Mansion

Alvaro Herrera wrote:

Hint bits are used to mark tuples as created and/or deleted by
transactions that are know committed or aborted.  To determine the
visibility of a tuple without such bits set, you need to consult pg_clog
and possibly pg_subtrans, so it is an expensive check.  On the other
  
So, how come there is this outstanding work to do, which will inevitably 
be done, and it
hasn't been done until it is 'just too late' to avoid getting in the way 
of the query?


The OP didn't suggest that he had just loaded the data.

Also - is it the case that this only affects the case where updated 
pages were spilled
during the transaction that changed them?  ie, if we commit a 
transaction and there
are changed rows still in the cache since their pages are not evicted 
yet, are the hint
bits set immediately so that page is written just once?  Seems this 
would be common

in most OLTP systems.

Heikki points out that the list might get big and need to be abandoned, 
but then you
fall back to scheduling a clog scan that can apply the bits, which does 
what you have
now, though hopefully in a way that fills slack disk IO rather than 
waiting for the

read.

Matthew says: 'it would be a list of changes since the last checkpoint' 
but I don't
see why you can't start writing hints to in-memory pages as soon as the 
transaction

ends.  You might fall behind, but I doubt it with modern CPU speeds.

I can't see why Pavan's suggestion to try to update as many of the bits 
as possible

when a dirty page is evicted would be contentious.

I do think this is something of interest to users, not just developers, 
since it

may influence the way updates are processed where it is reasonable to do
so in 'bite sized chunks' as a multipart workflow.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Robert Lor

Tom Lane wrote:

Hmm, the problem would be trying to figure out what percentage of writes
could be blamed solely on hint-bit updates and not any other change to
the page.  I don't think that the bufmgr currently keeps enough state to
know that, but you could probably modify it easily enough, since callers
distinguish MarkBufferDirty from SetBufferCommitInfoNeedsSave.  Define
another flag bit that's set only by the first, and test it during
write-out.
  
Ok, I made a few changes to bufmgr per my understanding of your 
description above and with my limited understanding of the code. Patch 
is attached.


Assuming the patch is correct, the effect of writes due to hint bits is 
quite significant. I collected the data below by runing pgbench in one 
terminal and psql on another to run the query.


Is the data plausible?

-Robert

--


Backend PID: 16189
SQL Statement  : select count(*) from accounts;
Execution time : 17.33 sec

 Buffer Read Counts 
Tablespace   Database  Table  Count
 1663  16384   2600  1
 1663  16384   2601  1
 1663  16384   2615  1
 1663  16384   1255  2
 1663  16384   2602  2
 1663  16384   2603  2
 1663  16384   2616  2
 1663  16384   2650  2
 1663  16384   2678  2
 1663  16384   1247  3
 1663  16384   1249  3
 1663  16384   2610  3
 1663  16384   2655  3
 1663  16384   2679  3
 1663  16384   2684  3
 1663  16384   2687  3
 1663  16384   2690  3
 1663  16384   2691  3
 1663  16384   2703  4
 1663  16384   1259  5
 1663  16384   2653  5
 1663  16384   2662  5
 1663  16384   2663  5
 1663  16384   2659  7
 1663  16384  16397   8390

 Dirty Buffer Write Counts =
Tablespace   Database  Table  Count
 1663  16384  16402  2
 1663  16384  16394 11
 1663  16384  16397   4771

== Hint Bits Write Counts ==
Tablespace   Database  Table  Count
 1663  16384  16397   4508

Total buffer cache hits  :   732
Total buffer cache misses:  7731
Average read time from cache :  9136 (ns)
Average read time from disk  :384201 (ns)
Average write time to disk   :210709 (ns)


Backend PID: 16189
SQL Statement  : select count(*) from accounts;
Execution time : 12.72 sec

 Buffer Read Counts 
Tablespace   Database  Table  Count
 1663  16384  16397   8392

 Dirty Buffer Write Counts =
Tablespace   Database  Table  Count
 1663  16384  16394  6
 1663  16384  16402  7
 1663  16384  16397   2870

== Hint Bits Write Counts ==
Tablespace   Database  Table  Count
 1663  16384  16402  2
 1663  16384  16397   2010

Total buffer cache hits  :   606
Total buffer cache misses:  7786
Average read time from cache :  6949 (ns)
Average read time from disk  :706288 (ns)
Average write time to disk   : 90426 (ns)

Index: bufmgr.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.228
diff -u -3 -p -r1.228 bufmgr.c
--- bufmgr.c	1 Jan 2008 19:45:51 -	1.228
+++ bufmgr.c	15 May 2008 20:56:38 -
@@ -42,6 +42,7 @@
 #include "storage/smgr.h"
 #include "utils/resowner.h"
 #include "pgstat.h"
+#include "pg_trace.h"
 
 
 /* Note: these two macros only work on shared buffers, not local ones! */
@@ -171,6 +172,7 @@ ReadBuffer_common(Relation reln, BlockNu
 	if (isExtend)
 		blockNum = smgrnblocks(reln->rd_smgr);
 
+	TRACE_POSTGRESQL_BUFFER_READ_START(blockNum, reln->rd_node.spcNode, reln->rd_node.dbNode, reln->rd_node.relNode, isLocalBuf);
 	pgstat_count_buffer_read(reln);
 
 	if (isLocalBuf)
@@ -200,12 +202,16 @@ ReadBuffer_common(Relation reln, BlockNu
 	{
 		if (!isExtend)
 		{
+			TRACE_POSTGRESQL_BUFFER_HIT();
 			/* Just need to update stats before we exit */
 			pgstat_count_buffer_hit(reln);
 
 			if (VacuumCostActive)
 VacuumCostBalance += VacuumCostPageHit;
 
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(blockNum,
+ reln->rd_node.spcNode, reln->rd_node.dbNode,
+ reln->rd_node.relNode, isLocalBuf, found);
 			return BufferDescriptorGetBuffer(bufHdr);
 		}
 
@@ -257,6 +263,7 @@ ReadBuffer_common(Relation reln, BlockNu
 			} while (!StartBufferIO(bufHdr, true));
 		}
 	}
+	TRACE_POSTGRESQL_BUFFER_MISS();
 
 	/*
 	 * if we have go

Re: [PERFORM] Update performance degrades over time

2008-05-15 Thread Subbiah Stalin-XCGF84
Any system catalog views I can check for wait events causing slower
response times.

Thanks in advance.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Subbiah
Stalin
Sent: Thursday, May 15, 2008 9:28 AM
To: Jeffrey Baker; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Update performance degrades over time

Yes we are updating one of indexed timestamp columns which gets unique
value on every update. We tried setting autovacuum_vacuum_scale_factor =
0.1 from default to make autovacuum bit aggressive, we see bloating on
both table and it's indexes but it's creeping up slowly though. 

Anyways, even with slower bloating, I still see update performance to
degrade with 15 sec response time captured by setting
log_min_duration_stmt. Looks like bloating isn't causing slower updates.
Any help/ideas to tune this is appreciated.

Explain plan seems reasonable for the update statement.

update tablexy set col2=$1,col9=$2, col10=$3,col11=$4,col3=$5 WHERE
ID=$6;

 QUERY PLAN



 Index Scan using ct_tablexy_id_u1 on tablexy  (cost=0.00..8.51 rows=1
width=194) (actual time=0.162..0.166 rows=1 loops=1)
   Index Cond: ((id)::text = '32xka8axki8'::text)

Thanks in advance.

Stalin

-Original Message-
From: Jeffrey Baker [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 15, 2008 6:56 AM
To: Subbiah Stalin-XCGF84; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Update performance degrades over time

On Wed, May 14, 2008 at 6:31 PM, Subbiah Stalin-XCGF84
<[EMAIL PROTECTED]> wrote:
> Hi All,
>
> We are doing some load tests with our application running postgres 
> 8.2.4. At times we see updates on a table taking longer (around
> 11-16secs) than expected sub-second response time. The table in 
> question is getting updated constantly through the load tests. In 
> checking the table size including indexes, they seem to be bloated got

> it confirmed after recreating it (stats below). We have autovacuum 
> enabled with default parameters. I thought autovaccum would avoid 
> bloating issues but looks like its not aggressive enough. Wondering if

> table/index bloating is causing update slowness in over a period of 
> time. Any ideas how to troubleshoot this further.

Sometimes it is necessary to not only VACUUM, but also REINDEX.  If your
update changes an indexed column to a new, distinct value, you can
easily get index bloat.

Also, you should check to see if you have any old, open transactions on
the same instance.  If you do, it's possible that VACUUM will have no
beneficial effect.

-jwb

--
Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
__

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Tom Lane
Robert Lor <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> It's certainly true that hint-bit updates cost something, but
>> quantifying how much isn't easy. 

> Maybe we can instrument the code with DTrace probes to quantify the 
> actual costs.

Hmm, the problem would be trying to figure out what percentage of writes
could be blamed solely on hint-bit updates and not any other change to
the page.  I don't think that the bufmgr currently keeps enough state to
know that, but you could probably modify it easily enough, since callers
distinguish MarkBufferDirty from SetBufferCommitInfoNeedsSave.  Define
another flag bit that's set only by the first, and test it during
write-out.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Robert Lor

Tom Lane wrote:

It's certainly true that hint-bit updates cost something, but
quantifying how much isn't easy. 
Maybe we can instrument the code with DTrace probes to quantify the 
actual costs.  I'm not familiar with the code, but if I know where to 
place the probes, I can easily do a quick test and provide the data.

 The off-the-cuff answer is to do the
select count(*) twice and see how much cheaper the second one is. 
Doesn't seem the second run is cheaper as shown in the results below. 
The data came from the probes I've added recently.


*** Run #1 **
SQL Statement  : select count(*) from accounts;
Execution time : 1086.58 (ms)

 Buffer Read Counts 
Tablespace   Database  Table  Count
 1663  16384   1247  1
 1663  16384   2600  1
 1663  16384   2703  1
 1663  16384   1255  2
 1663  16384   2650  2
 1663  16384   2690  3
 1663  16384   2691  3
 1663  16384  16397   8390

 Dirty Buffer Write Counts =
Tablespace   Database  Table  Count
 1663  16384  16397   2865

Total buffer cache hits  :  1932
Total buffer cache misses:  6471
Average read time from cache :  5638 (ns)
Average read time from disk  :143371 (ns)
Average write time to disk   : 20368 (ns)


*** Run #2 **
SQL Statement  : select count(*) from accounts;
Execution time : 1115.94 (ms)

 Buffer Read Counts 
Tablespace   Database  Table  Count
 1663  16384  16397   8390

 Dirty Buffer Write Counts =
Tablespace   Database  Table  Count
 1663  16384  16397   2865

Total buffer cache hits  :  1931
Total buffer cache misses:  6459
Average read time from cache :  4357 (ns)
Average read time from disk  :154127 (ns)
Average write time to disk   : 20368 (ns)


-Robert

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Update performance degrades over time

2008-05-15 Thread Subbiah Stalin-XCGF84
Yes we are updating one of indexed timestamp columns which gets unique
value on every update. We tried setting autovacuum_vacuum_scale_factor =
0.1 from default to make autovacuum bit aggressive, we see bloating on
both table and it's indexes but it's creeping up slowly though. 

Anyways, even with slower bloating, I still see update performance to
degrade with 15 sec response time captured by setting
log_min_duration_stmt. Looks like bloating isn't causing slower updates.
Any help/ideas to tune this is appreciated.

Explain plan seems reasonable for the update statement.

update tablexy set col2=$1,col9=$2, col10=$3,col11=$4,col3=$5 WHERE
ID=$6;

 QUERY PLAN



 Index Scan using ct_tablexy_id_u1 on tablexy  (cost=0.00..8.51 rows=1
width=194) (actual time=0.162..0.166 rows=1 loops=1)
   Index Cond: ((id)::text = '32xka8axki8'::text)

Thanks in advance.

Stalin

-Original Message-
From: Jeffrey Baker [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 15, 2008 6:56 AM
To: Subbiah Stalin-XCGF84; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Update performance degrades over time

On Wed, May 14, 2008 at 6:31 PM, Subbiah Stalin-XCGF84
<[EMAIL PROTECTED]> wrote:
> Hi All,
>
> We are doing some load tests with our application running postgres 
> 8.2.4. At times we see updates on a table taking longer (around
> 11-16secs) than expected sub-second response time. The table in 
> question is getting updated constantly through the load tests. In 
> checking the table size including indexes, they seem to be bloated got

> it confirmed after recreating it (stats below). We have autovacuum 
> enabled with default parameters. I thought autovaccum would avoid 
> bloating issues but looks like its not aggressive enough. Wondering if

> table/index bloating is causing update slowness in over a period of 
> time. Any ideas how to troubleshoot this further.

Sometimes it is necessary to not only VACUUM, but also REINDEX.  If your
update changes an indexed column to a new, distinct value, you can
easily get index bloat.

Also, you should check to see if you have any old, open transactions on
the same instance.  If you do, it's possible that VACUUM will have no
beneficial effect.

-jwb

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Tom Lane
Alvaro Herrera <[EMAIL PROTECTED]> writes:
> Heikki Linnakangas escribió:
>> We know what kind of a relation we're dealing with in ReadBuffer, so we  
>> could add a flag to BufferDesc to mark heap pages.

> Hmm, I was thinking that it would need access to the catalogs to know
> where the tuples are, but that's certainly not true, so perhaps it could
> be made to work.

The issue in my mind is not so much could bgwriter physically do it
as that it's a violation of module layering.  That has real
consequences, like potential for deadlocks.  It'll become particularly
pressing if we go forward with the plans to get rid of the separate
dedicated buffers for pg_clog etc and have them work in the main
shared-buffer pool.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] which ext3 fs type should I use for postgresql

2008-05-15 Thread Guillaume Cottenceau
"Joshua D. Drake"  writes:

> Guillaume Cottenceau wrote:
>> Matthew Wakeling  writes:
>
>> It is still relevant, as with 5% margin, you can afford changing
>> that to 0% with tune2fs, just the time for you to start PG and
>> remove some data by SQL, then shutdown and set the margin to 5%
>> again.
>
> I find that if you actually reach that level of capacity failure it is
> due to lack of management and likely there is much lower hanging fruit
> left over by a lazy dba or sysadmin than having to adjust filesystem
> level parameters.
>
> Manage actively and the above change is absolutely irrelevant.

Of course. I didn't say otherwise. I only say that it's useful in
that case. E.g. if you're using a dedicated partition for PG,
then a good solution is what I describe, rather than horrifyingly
trying to remove some random PG files, or when you cannot
temporarily move some of them and symlink from the PG partition.
I don't praise that kind of case, it should of course be avoided
by sane management. A bad management is not a reason for hiding
solutions to the problems that can happen!

-- 
Guillaume Cottenceau

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Alvaro Herrera
Matthew Wakeling wrote:

Aside from the rest of commentary, a slight clarification:

> So, as I understand it, Postgres works like this:
>
> 1. You begin a transaction. Postgres writes an entry into pg_clog.

Starting a transaction does not write anything to pg_clog.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Alvaro Herrera
Heikki Linnakangas escribió:
> Alvaro Herrera wrote:

>> The problem is that the bgwriter does not understand about the content
>> of the pages it is writing -- they're opaque pages for all it knows.  So
>> it cannot touch the hint bits.
>
> We know what kind of a relation we're dealing with in ReadBuffer, so we  
> could add a flag to BufferDesc to mark heap pages.

Hmm, I was thinking that it would need access to the catalogs to know
where the tuples are, but that's certainly not true, so perhaps it could
be made to work.

>> If we had the bitmask in a separate map fork, this could be cheap.
>
> I don't buy that. The point of a hint bit is that it's right there along  
> with the tuple you're looking at. If you have to look at a separate  
> buffer, you might as well just look at clog.

True -- I was confusing this with the idea of having the tuple MVCC
header (xmin, xmax, etc) in a separate fork, which would make the idea
of index-only scans more feasible at the expense of seqscans.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] which ext3 fs type should I use for postgresql

2008-05-15 Thread Joshua D. Drake

Guillaume Cottenceau wrote:

Matthew Wakeling  writes:



It is still relevant, as with 5% margin, you can afford changing
that to 0% with tune2fs, just the time for you to start PG and
remove some data by SQL, then shutdown and set the margin to 5%
again.



I find that if you actually reach that level of capacity failure it is 
due to lack of management and likely there is much lower hanging fruit 
left over by a lazy dba or sysadmin than having to adjust filesystem 
level parameters.


Manage actively and the above change is absolutely irrelevant.

Joshua D. Drake

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] which ext3 fs type should I use for postgresql

2008-05-15 Thread Guillaume Cottenceau
Matthew Wakeling  writes:

> On Thu, 15 May 2008, Guillaume Cottenceau wrote:
>> Also, IIRC when PG writes data up to a full filesystem,
>> postmaster won't be able to then restart if the filesystem is
>> still full (it needs some free disk space for its startup).
>>
>> Or maybe this has been fixed in recent versions?
>
> Ah, the "not enough space to delete file, delete some files and try
> again" problem. Anyway, that isn't relevant to the reserved
> percentage, as that will happen whether or not the filesystem is 5%
> smaller.

It is still relevant, as with 5% margin, you can afford changing
that to 0% with tune2fs, just the time for you to start PG and
remove some data by SQL, then shutdown and set the margin to 5%
again.

-- 
Guillaume Cottenceau

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] which ext3 fs type should I use for postgresql

2008-05-15 Thread Matthew Wakeling

On Thu, 15 May 2008, Guillaume Cottenceau wrote:

Also, IIRC when PG writes data up to a full filesystem,
postmaster won't be able to then restart if the filesystem is
still full (it needs some free disk space for its startup).

Or maybe this has been fixed in recent versions?


Ah, the "not enough space to delete file, delete some files and try again" 
problem. Anyway, that isn't relevant to the reserved percentage, as that 
will happen whether or not the filesystem is 5% smaller.


Matthew

--
Let's say I go into a field and I hear "baa baa baa". Now, how do I work 
out whether that was "baa" followed by "baa baa", or if it was "baa baa"

followed by "baa"?
- Computer Science Lecturer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Matthew Wakeling

On Thu, 15 May 2008, Heikki Linnakangas wrote:
There's not much point optimizing something that only helps with aborted 
transactions.


That's fair enough, but this list method is likely to speed up index 
writes anyway.


The general problem with any idea that involves keeping a list of changes 
made in a transaction is that that list will grow big during bulk loads, so 
you'll have to overflow to disk or abandon the list approach. Which means 
that it won't help with bulk loads.


Yeah, it wouldn't be a list of changes for the transaction, it would be a 
list of changes since the last checkpoint. Keeping data in memory for the 
length of the transaction is doomed to failure, because there is no bound 
on its size, so bulk loads are still going to miss out on hint 
optimisation.


Matthew

--
for a in past present future; do
 for b in clients employers associates relatives neighbours pets; do
 echo "The opinions here in no way reflect the opinions of my $a $b."
done; done

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Heikki Linnakangas

Alvaro Herrera wrote:

Greg Smith escribió:

On Thu, 15 May 2008, Pavan Deolasee wrote:

I had suggested in the past that whenever we set hint bits for a tuple, 
we should check all other tuples in the page and set their hint bits 
too to avoid multiple writes of the same page. I guess the idea got 
rejected because of lack of benchmarks to prove the benefit.


From glancing at http://www.postgresql.org/docs/faqs.TODO.html I got the  
impression the idea was to have the background writer get involved to 
help with this particular situation.


The problem is that the bgwriter does not understand about the content
of the pages it is writing -- they're opaque pages for all it knows.  So
it cannot touch the hint bits.


We know what kind of a relation we're dealing with in ReadBuffer, so we 
could add a flag to BufferDesc to mark heap pages.



If we had the bitmask in a separate map fork, this could be cheap.


I don't buy that. The point of a hint bit is that it's right there along 
with the tuple you're looking at. If you have to look at a separate 
buffer, you might as well just look at clog.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Heikki Linnakangas

Matthew Wakeling wrote:

On Thu, 15 May 2008, Heikki Linnakangas wrote:
> Is it really safe to update the hint bits in place? If there is a > 
power cut in the middle of writing a block, is there a guarantee from 
> the disc that the block will never be garbled?


Don't know, to be honest. We've never seen any reports of corrupted 
data that would suggest such a problem, but it doesn't seem impossible 
to me that some exotic storage system might do that.


Hmm. That problem is what WAL full-page-writes is meant to handle, isn't 
it? So basically, if you're telling people that WAL full-page-writes is 
safer than partial WAL, because it avoids updating pages in-place, then 
you shouldn't be updating pages in-place for the hint bits either. You 
can't win!


Full-page-writes protect from torn pages, that is, when one half of an 
update hits the disk but the other one doesn't. In particular, if the 
beginning of the page where the WAL pointer (XLogRecPtr) is flushed to 
disk, but the actual changes elsewhere in the page aren't, you're in 
trouble. WAL replay will look at the WAL pointer, and think that the 
page doesn't need to be replayed, while other half of the update is 
still missing.


Hint bits are different. We're only updating a single bit, and it 
doesn't matter from correctness point of view whether the hint bit 
update hits the disk or not. But what would spell trouble is if the disk 
controller/whatever garbles the whole sector, IOW changes something else 
than the changed bit, while doing the update.


In fact, if the tuple's creating transaction has aborted, then the 
tuple can be vacuumed right there and then before it is even written. 


Not if you have any indexes on the table. To vacuum, you'll have to 
scan all indexes to remove pointers to the tuple.


Ah. Well, would that be so expensive? After all, someone has to do it 
eventually, and these are index entries that have only just been added 
anyway.


Scanning all indexes? Depends on your table of course, but yes it would 
be expensive in general.


An alternative would be to build a "list of changes" in the WAL without 
actually changing the underlying index at all. When reading the index, 
you would read the "list" first (which would be in memory, and in an 
efficient-to-search structure), then read the original index and add the 
two. Then when checkpointing, vet all the changes against known aborted 
transactions before making all the changes to the index together. This 
is likely to speed up index writes quite a bit, and also allow you to 
effectively vacuum aborted tuples before they get written to the disc.


There's not much point optimizing something that only helps with aborted 
transactions.


The general problem with any idea that involves keeping a list of 
changes made in a transaction is that that list will grow big during 
bulk loads, so you'll have to overflow to disk or abandon the list 
approach. Which means that it won't help with bulk loads.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] which ext3 fs type should I use for postgresql

2008-05-15 Thread Guillaume Cottenceau
Craig James  writes:

> Matthew Wakeling wrote:
>> Probably of more use are some of the other settings:
>>
>>  -m reserved-blocks-percentage - this reserves a portion of the filesystem
>> that only root can write to. If root has no need for it, you can kill
>> this by setting it to zero. The default is for 5% of the disc to be
>> wasted.
>
> This is not a good idea.  The 5% is NOT reserved for root's
> use, but rather is to prevent severe file fragmentation.  As

Also, IIRC when PG writes data up to a full filesystem,
postmaster won't be able to then restart if the filesystem is
still full (it needs some free disk space for its startup).

Or maybe this has been fixed in recent versions?

-- 
Guillaume Cottenceau

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] which ext3 fs type should I use for postgresql

2008-05-15 Thread Craig James

Matthew Wakeling wrote:

Probably of more use are some of the other settings:

 -m reserved-blocks-percentage - this reserves a portion of the filesystem
that only root can write to. If root has no need for it, you can kill
this by setting it to zero. The default is for 5% of the disc to be
wasted.


This is not a good idea.  The 5% is NOT reserved for root's use, but rather is 
to prevent severe file fragmentation.  As the disk gets full, the remaining 
empty spaces tend to be small spaces scattered all over the disk, meaning that 
even for modest-sized files, the kernel can't allocate contiguous disk blocks.  
If you reduce this restriction to 0%, you are virtually guaranteed poor 
performance when you fill up your disk, since those files that are allocated 
last will be massively fragmented.

Worse, the fragmented files that you create remain fragmented even if you clean 
up to get back below the 95% mark.  If Postgres happened to insert a lot of 
data on a 99% full file system, those blocks could be spread all over the 
place, and they'd stay that way forever, even after you cleared some space.

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Tom Lane
Matthew Wakeling <[EMAIL PROTECTED]> writes:
> Hmm. That problem is what WAL full-page-writes is meant to handle, isn't 
> it? So basically, if you're telling people that WAL full-page-writes is 
> safer than partial WAL, because it avoids updating pages in-place, then 
> you shouldn't be updating pages in-place for the hint bits either. You 
> can't win!

This argument ignores the nature of the data change.  With a hint-bit
update, no data is being shuffled around, so there is no danger from a
partial page write.

A disk that leaves an individual sector corrupt would be a problem,
but I don't think that's a huge risk.  Keep in mind that disks aren't
designed to just stop dead when power dies --- they are made to be able
to park their heads before the juice is entirely gone.  I think it's
reasonable to assume they'll finish writing the sector in progress
before they start parking.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Alvaro Herrera
Greg Smith escribió:
> On Thu, 15 May 2008, Pavan Deolasee wrote:
>
>> I had suggested in the past that whenever we set hint bits for a tuple, 
>> we should check all other tuples in the page and set their hint bits 
>> too to avoid multiple writes of the same page. I guess the idea got 
>> rejected because of lack of benchmarks to prove the benefit.
>
> From glancing at http://www.postgresql.org/docs/faqs.TODO.html I got the  
> impression the idea was to have the background writer get involved to 
> help with this particular situation.

The problem is that the bgwriter does not understand about the content
of the pages it is writing -- they're opaque pages for all it knows.  So
it cannot touch the hint bits.

I agree with Pavan that it's likely that setting hint bits in batches
instead of just for the tuple being examined is a benefit.  However,
it's perhaps not so good to be doing it in a foreground process, because
you're imposing extra cost to the client queries which we want to be as
fast as possible.  Perhaps the thing to do is have a "database-local
bgwriter" which would scan pages and do this kind of change ...
a different kind of process to be launched by autovacuum perhaps.

If we had the bitmask in a separate map fork, this could be cheap.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Update performance degrades over time

2008-05-15 Thread Jeffrey Baker
On Wed, May 14, 2008 at 6:31 PM, Subbiah Stalin-XCGF84
<[EMAIL PROTECTED]> wrote:
> Hi All,
>
> We are doing some load tests with our application running postgres 8.2.4. At
> times we see updates on a table taking longer (around
> 11-16secs) than expected sub-second response time. The table in question is
> getting updated constantly through the load tests. In checking the table
> size including indexes, they seem to be bloated got it confirmed after
> recreating it (stats below). We have autovacuum enabled with default
> parameters. I thought autovaccum would avoid bloating issues but looks like
> its not aggressive enough. Wondering if table/index bloating is causing
> update slowness in over a period of time. Any ideas how to troubleshoot this
> further.

Sometimes it is necessary to not only VACUUM, but also REINDEX.  If
your update changes an indexed column to a new, distinct value, you
can easily get index bloat.

Also, you should check to see if you have any old, open transactions
on the same instance.  If you do, it's possible that VACUUM will have
no beneficial effect.

-jwb

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Ron Mayer

Matthew Wakeling wrote:

On Thu, 15 May 2008, Luke Lonergan wrote:

...HINT bit optimization, but avoids this whole ³write the data,
write it to the log also, then write it again just for good measure²

...
The hint data will be four bits per tuple plus overheads, so it could be 
made very compact, and therefore likely to stay in the cache fairly 
well. 


Does it seem like these HINT bits would be good candidates to move
off to map forks similar to how the visibility map stuff will be handled?

Since (if I understand right) only the hint bits change during the
select(*) it seems a lot less write-IO would happen if such a map
were updated rather than the data pages themselves.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Matthew Wakeling

On Thu, 15 May 2008, Heikki Linnakangas wrote:
> Is it really safe to update the hint bits in place? If there is a 
> power cut in the middle of writing a block, is there a guarantee from 
> the disc that the block will never be garbled?


Don't know, to be honest. We've never seen any reports of corrupted data 
that would suggest such a problem, but it doesn't seem impossible to me 
that some exotic storage system might do that.


Hmm. That problem is what WAL full-page-writes is meant to handle, isn't 
it? So basically, if you're telling people that WAL full-page-writes is 
safer than partial WAL, because it avoids updating pages in-place, then 
you shouldn't be updating pages in-place for the hint bits either. You 
can't win!


In fact, if the tuple's creating transaction has aborted, then the tuple 
can be vacuumed right there and then before it is even written. 


Not if you have any indexes on the table. To vacuum, you'll have to scan all 
indexes to remove pointers to the tuple.


Ah. Well, would that be so expensive? After all, someone has to do it 
eventually, and these are index entries that have only just been added 
anyway.


I can understand index updating being a bit messy in the middle of a 
checkpoint though, as you would have to write the update to the WAL, which 
you are checkpointing...


So, I don't know exactly how the WAL updates to indexes work, but my guess 
is that it has been implemented as "write the blocks that we would change 
to the WAL". The problem with this is that all the changes to the index 
are done individually, so there's no easy way to "undo" one of them later 
on when you find out that the transaction has been aborted during the 
checkpoint.


An alternative would be to build a "list of changes" in the WAL without 
actually changing the underlying index at all. When reading the index, you 
would read the "list" first (which would be in memory, and in an 
efficient-to-search structure), then read the original index and add the 
two. Then when checkpointing, vet all the changes against known aborted 
transactions before making all the changes to the index together. This is 
likely to speed up index writes quite a bit, and also allow you to 
effectively vacuum aborted tuples before they get written to the disc.


Matthew

--
Vacuums are nothings. We only mention them to let them know we know
they're there.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Jan de Visser
On Thursday 15 May 2008 03:02:19 Tom Lane wrote:
> "Jan de Visser" <[EMAIL PROTECTED]> writes:
> > Obviously, this issue is tied to the slow count(*) one, as I found out
> > the hard way. Consider the following scenario:
> > * Insert row
> > * Update that row a couple of times
> > * Rinse and repeat many times
> >
> > Now somewhere during that cycle, do a select count(*) just to see
> > where you are. You will be appalled by how slow that is, due to not
> > only the usual 'slow count(*)' reasons. This whole hint bit business
> > makes it even worse, as demonstrated by the fact that running a vacuum
> > before the count(*) makes the latter noticably faster.
>
> Uh, well, you can't blame that entirely on hint-bit updates.  The vacuum
> has simply *removed* two-thirds of the rows in the system, resulting in
> a large drop in the number of rows that the select even has to look at.
>
> It's certainly true that hint-bit updates cost something, but
> quantifying how much isn't easy.  The off-the-cuff answer is to do the
> select count(*) twice and see how much cheaper the second one is.  But
> there are two big holes in that answer: the first is the possible cache
> effects from having already read in the pages, and the second is that
> the follow-up scan gets to avoid the visits to pg_clog that the first
> scan had to make (which after all is the point of the hint bits).
>
> I don't know any easy way to disambiguate the three effects that are at
> work here.  But blaming it all on the costs of writing out hint-bit
> updates is wrong.
>
>   regards, tom lane

True. But it still contributes to the fact that queries sometimes behave in a 
non-deterministic way, which IMHO is the major annoyance when starting to 
work with pgsql. And contrary to other causes (vacuum, checkpoints) this is 
woefully underdocumented.

jan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Matthew Wakeling

On Thu, 15 May 2008, Luke Lonergan wrote:

BTW ­ we¹ve removed HINT bit checking in Greenplum DB and improved the
visibility caching which was enough to provide performance at the same level
as with the HINT bit optimization, but avoids this whole ³write the data,
write it to the log also, then write it again just for good measure²
behavior.


This sounds like a good option. I believe I suggested this a few months 
ago, however it was rejected because in the worst case (when the hints are 
not cached), if you're doing an index scan, you can do twice the number of 
seeks as before.


http://archives.postgresql.org/pgsql-performance/2007-12/msg00217.php

The hint data will be four bits per tuple plus overheads, so it could be 
made very compact, and therefore likely to stay in the cache fairly well. 
Each tuple fetched would have to be spaced really far apart in the 
database table in order to exhibit the worst case, because fetching a page 
of hint cache will cause 64kB or so of disc to appear in the disc's 
read-ahead buffer, which will be equivalent to 128MB worth of database 
table (assuming eight tuples per block and no overhead). As soon as you 
access another tuple in the same 128MB bracket, you'll hit the disc 
read-ahead buffer for the hints.


On balance, to me it still seems like a good option.

Matthew

--
Those who do not understand Unix are condemned to reinvent it, poorly.
   -- Henry Spencer
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Heikki Linnakangas

Matthew Wakeling wrote:
Is it really safe to update the hint bits in place? If there is a power 
cut in the middle of writing a block, is there a guarantee from the disc 
that the block will never be garbled?


Don't know, to be honest. We've never seen any reports of corrupted data 
that would suggest such a problem, but it doesn't seem impossible to me 
that some exotic storage system might do that.


Is there a way to make a shortcut and have the hint bits written the 
first time the data is written to the table? One piece of obvious 
low-hanging fruit would be to enhance step five above, so that the 
bgwriter or checkpoint that writes the data to the database table checks 
the pg_clog and writes the correct hint bits.


Yep, that's an idea that's been suggested before. In fact, I seem to 
remember a patch to do just that. Don't remember what happened to it,


In fact, if the tuple's 
creating transaction has aborted, then the tuple can be vacuumed right 
there and then before it is even written. 


Not if you have any indexes on the table. To vacuum, you'll have to scan 
all indexes to remove pointers to the tuple.


However, this idea does not deal well with bulk data loads, where the 
data is checkpointed before transaction is committed or aborted.


Yep, that's the killer :-(.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Matthew Wakeling

On Wed, 14 May 2008, Alvaro Herrera wrote:

Hint bits are used to mark tuples as created and/or deleted by
transactions that are know committed or aborted.  To determine the
visibility of a tuple without such bits set, you need to consult pg_clog
and possibly pg_subtrans, so it is an expensive check.


So, as I understand it, Postgres works like this:

1. You begin a transaction. Postgres writes an entry into pg_clog.
2. You write some tuples. Postgres writes them to the WAL, but doesn't
bother fsyncing.
3. At some point, the bgwriter or a checkpoint may write the tuples to the
database tables, and fsync the lot.
4. You commit the transaction. Postgres alters pg_clog again, writes that
to the WAL, and fsyncs the WAL.
5. If the tuples hadn't already made it to the database tables, then a
checkpoint or bgwriter will do it later on, and fsync the lot.
6. You read the tuples. Postgres reads them from the database table, looks
in pg_clog, notices that the transaction has been committed, and
writes the tuples to the database table again with the hint bits set.
This write is not WAL protected, and is not fsynced.

This seems like a good architecture, with some cool characteristics, 
mainly that at no point does Postgres have to hold vast quantities of data 
in memory. I have two questions though:


Is it really safe to update the hint bits in place? If there is a power 
cut in the middle of writing a block, is there a guarantee from the disc 
that the block will never be garbled?


Is there a way to make a shortcut and have the hint bits written the first 
time the data is written to the table? One piece of obvious low-hanging 
fruit would be to enhance step five above, so that the bgwriter or 
checkpoint that writes the data to the database table checks the pg_clog 
and writes the correct hint bits. In fact, if the tuple's creating 
transaction has aborted, then the tuple can be vacuumed right there and 
then before it is even written. For OLTP, almost all the hint bits will be 
written first time, and also the set of transactions that will be looked 
up in the pg_clog will be small (the set of transactions that were active 
since the last checkpoint), so its cache coherency will be good.


However, this idea does not deal well with bulk data loads, where the data 
is checkpointed before transaction is committed or aborted.


Matthew

--
Now, you would have thought these coefficients would be integers, given that
we're working out integer results. Using a fraction would seem really
stupid. Well, I'm quite willing to be stupid here - in fact, I'm going to
use complex numbers.-- Computer Science Lecturer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] which ext3 fs type should I use for postgresql

2008-05-15 Thread Matthew Wakeling

On Thu, 15 May 2008, [EMAIL PROTECTED] wrote:
IIRC postgres likes to do 1M/file, which isn't very largeas far as the -T 
setting goes.


ITYF it's actually 1GB/file.

think twice about this. ext2/3 get slow when they fill up (they have 
fragmentation problems when free space gets too small), this 5% that only 
root can use also serves as a buffer against that as well.


It makes sense to me that the usage pattern of Postgres would be much less 
susceptible to causing fragmentation than normal filesystem usage. Has 
anyone actually tested this and found out?


Matthew

--
Isn't "Microsoft Works" something of a contradiction?

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] which ext3 fs type should I use for postgresql

2008-05-15 Thread david

On Thu, 15 May 2008, Matthew Wakeling wrote:


On Thu, 15 May 2008, Philippe Amelant wrote:

using mkfs.ext3 I can use "-T" to tune the filesytem

mkfs.ext3 -T fs_type ...

fs_type are in /etc/mke2fs.conf (on debian)


If you look at that file, you'd see that tuning really doesn't change that 
much. In fact, the only thing it does change (if you avoid "small" and 
"floppy") is the number of inodes available in the filesystem. Since Postgres 
tends to produce few large files, you don't need that many inodes, so the 
"largefile" option may be best. However, note that the number of inodes is a 
hard limit of the filesystem - if you try to create more files on the 
filesystem than there are available inodes, then you will get an out of space 
error even if the filesystem has space left.
The only real benefit of having not many inodes is that you waste a little 
less space, so many admins are pretty generous with this setting.


IIRC postgres likes to do 1M/file, which isn't very largeas far as the -T 
setting goes.



Probably of more use are some of the other settings:

-m reserved-blocks-percentage - this reserves a portion of the filesystem
   that only root can write to. If root has no need for it, you can kill
   this by setting it to zero. The default is for 5% of the disc to be
   wasted.


think twice about this. ext2/3 get slow when they fill up (they have 
fragmentation problems when free space gets too small), this 5% that 
only root can use also serves as a buffer against that as well.



-j turns the filesystem into ext3 instead of ext2 - many people say that
   for Postgres you shouldn't do this, as ext2 is faster.


for the partition with the WAL on it you may as well do ext2 (the WAL is 
written synchronously and sequentially so the journal doesn't help you), 
but for the data partition you may benifit from the journal.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] which ext3 fs type should I use for postgresql

2008-05-15 Thread Matthew Wakeling

On Thu, 15 May 2008, Philippe Amelant wrote:

using mkfs.ext3 I can use "-T" to tune the filesytem

mkfs.ext3 -T fs_type ...

fs_type are in /etc/mke2fs.conf (on debian)


If you look at that file, you'd see that tuning really doesn't change that 
much. In fact, the only thing it does change (if you avoid "small" and 
"floppy") is the number of inodes available in the filesystem. Since 
Postgres tends to produce few large files, you don't need that many 
inodes, so the "largefile" option may be best. However, note that the 
number of inodes is a hard limit of the filesystem - if you try to create 
more files on the filesystem than there are available inodes, then you 
will get an out of space error even if the filesystem has space left.
The only real benefit of having not many inodes is that you waste a little 
less space, so many admins are pretty generous with this setting.


Probably of more use are some of the other settings:

 -m reserved-blocks-percentage - this reserves a portion of the filesystem
that only root can write to. If root has no need for it, you can kill
this by setting it to zero. The default is for 5% of the disc to be
wasted.
 -j turns the filesystem into ext3 instead of ext2 - many people say that
for Postgres you shouldn't do this, as ext2 is faster.

Matthew

--
The surest protection against temptation is cowardice.
 -- Mark Twain

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] which ext3 fs type should I use for postgresql

2008-05-15 Thread Philippe Amelant
Hi all,
using mkfs.ext3 I can use "-T" to tune the filesytem

mkfs.ext3 -T fs_type ...

fs_type are in /etc/mke2fs.conf (on debian)

is there a recommended setting for this parameter ???

thanks

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Tom Lane
Greg Smith <[EMAIL PROTECTED]> writes:
> ... To put that another way:  if I've run a manual vacuum, is it true 
> that it will have updated all the hint bits to XMIN_COMMITTED for all the 
> tuples that were all done when the vacuum started?

Any examination whatsoever of a tuple --- whether by vacuum or any
ordinary DML operation --- will update its hint bits to match the
commit/abort status of the inserting/deleting transaction(s) as of
the instant of the examination.  Your statement above is true but
is weaker than reality.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Tom Lane
"Jan de Visser" <[EMAIL PROTECTED]> writes:
> Obviously, this issue is tied to the slow count(*) one, as I found out
> the hard way. Consider the following scenario:
> * Insert row
> * Update that row a couple of times
> * Rinse and repeat many times

> Now somewhere during that cycle, do a select count(*) just to see
> where you are. You will be appalled by how slow that is, due to not
> only the usual 'slow count(*)' reasons. This whole hint bit business
> makes it even worse, as demonstrated by the fact that running a vacuum
> before the count(*) makes the latter noticably faster.

Uh, well, you can't blame that entirely on hint-bit updates.  The vacuum
has simply *removed* two-thirds of the rows in the system, resulting in
a large drop in the number of rows that the select even has to look at.

It's certainly true that hint-bit updates cost something, but
quantifying how much isn't easy.  The off-the-cuff answer is to do the
select count(*) twice and see how much cheaper the second one is.  But
there are two big holes in that answer: the first is the possible cache
effects from having already read in the pages, and the second is that
the follow-up scan gets to avoid the visits to pg_clog that the first
scan had to make (which after all is the point of the hint bits).

I don't know any easy way to disambiguate the three effects that are at
work here.  But blaming it all on the costs of writing out hint-bit
updates is wrong.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance