On Mon, 8 Aug 2005, Tom Lane wrote:
What that sounds like to me is a machine with inadequate disk I/O bandwidth.
Your earlier comment that checkpoint drives the machine into the ground
fits right into that theory, too. You said there is "almost no IO-wait"
but are you sure you are measuring that correctly?
Reducing checkpoint_timeout to 600 seconds had a positive effect. Previous
value was 1800 seconds.
We have a spare disk array from the old server and I'm planning to use it
as a tablespace for the comment table (the 100M+ rows one) as Ron
suggested.
Queries accumulate and when checkpointing is over, there can be
something like 400 queries running but over 50% of cpu is just idling.
400 queries? Are you launching 400 separate backends to do that?
Some sort of connection pooling seems like a good idea, if you don't
have it in place already. If the system's effective behavior in the
face of heavy load is to start even more concurrent backends, that
could easily drive things into the ground.
Ok, I implemented connection pooling using pgpool and it increased
performance a lot! We are now delivering about 1500 dynamic pages a second
without problems. Each of the eight single-cpu webservers are running a
pgpool instance with 20 connections.
However, those configuration changes didn't have significant effect to
oprofile results. AtEOXact_CatCache consumes even more cycles. This isn't
a problem right now but it may be in the future...
CPU: AMD64 processors, speed 2190.23 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask
of 0x00 (No unit mask) count 100000
samples % symbol name
1147870 21.1602 AtEOXact_CatCache
187466 3.4558 hash_seq_search
174357 3.2142 AllocSetAlloc
170896 3.1504 nocachegetattr
131724 2.4282 ExecMakeFunctionResultNoSets
125292 2.3097 SearchCatCache
117264 2.1617 StrategyDirtyBufferList
105741 1.9493 hash_search
98245 1.8111 FunctionCall2
97878 1.8043 yyparse
90932 1.6763 LWLockAcquire
83555 1.5403 LWLockRelease
81045 1.4940 _bt_compare
... and so on ...
----->8 Signigicant rows from current postgresql.conf 8<-----
max_connections = 768 # unnecessarily large with connection
pooling
shared_buffers = 15000
work_mem = 2048
maintenance_work_mem = 32768
max_fsm_pages = 1000000
max_fsm_relations = 5000
bgwriter_percent = 2
fsync = true
wal_buffers = 512
checkpoint_segments = 200 # less would probably be enuff with 600sec
timeout
checkpoint_timeout = 600
effective_cache_size = 500000
random_page_cost = 1.5
default_statistics_target = 150
stats_start_collector = true
stats_command_string = true
|\__/|
( oo ) Kari Lavikka - [EMAIL PROTECTED] - (050) 380 3808
__ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
""
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match