Re: [PERFORM] too many clog files

2008-09-09 Thread Matt Smiley
Alvaro Herrera wrote:
> Move the old clog files back where they were, and run VACUUM FREEZE in
> all your databases.  That should clean up all the old pg_clog files, if
> you're really that desperate.

Has anyone actually seen a CLOG file get removed under 8.2 or 8.3?  How about 
8.1?

I'm probably missing something, but looking at src/backend/commands/vacuum.c 
(under 8.2.9 and 8.3.3), it seems like vac_truncate_clog() scans through *all* 
tuples of pg_database looking for the oldest datfrozenxid.  Won't that always 
be template0, which as far as I know can never be vacuumed (or otherwise 
connected to)?

postgres=# select datname, datfrozenxid, age(datfrozenxid), datallowconn from 
pg_database order by age(datfrozenxid), datname ;
 datname  | datfrozenxid |   age| datallowconn
--+--+--+--
 template1| 36347792 | 3859 | t
 postgres | 36347733 | 3918 | t
 mss_test | 36347436 | 4215 | t
 template0|  526 | 36351125 | f
(4 rows)

I looked at several of my 8.2 databases' pg_clog directories, and they all have 
all the sequentially numbered segments ( through current segment).  Would 
it be reasonable for vac_truncate_clog() to skip databases where datallowconn 
is false (i.e. template0)?  Looking back to the 8.1.13 code, it does exactly 
that:
if (!dbform->datallowconn)
continue;

Also, Duan, if you have lots of files under pg_clog, you may be burning through 
transactions faster than necessary.  Do your applications leave autocommit 
turned on?  And since no one else mentioned it, as a work-around for a small 
filesystem you can potentially shutdown your database, move the pg_clog 
directory to a separate filesystem, and create a symlink to it under your 
PGDATA directory.  That's not a solution, just a mitigation.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] too many clog files

2008-09-09 Thread Matt Smiley
Hi Duan,

As others have said, you should probably attempt to run pg_dump to export your 
database.  If that doesn't work, consider restoring from backup.  If the dump 
does work, you can create a clean PGDATA directory (using initdb like when you 
setup your original installation), and create a fresh copy of your database 
using the dump file.  Then abandon your potentially damaged PGDATA directory.

For future reference:

 - The "autovacuum" parameter in postgresql.conf is off by default under 
Postgres 8.1.  You should probably turn it on to ensure regular vacuuming, 
unless you have your own cronjob to do the vacuuming.

 - About finding old transactions, there are 2 places you have to look for old 
transactions.  The usual place is in pg_stat_activity.  The 2nd place is 
"pg_prepared_xacts", where prepared transactions are listed.  If there's a 
prepared transaction in your system, it might explain why your old commit-logs 
aren't being purged.  The following query shows both prepared and normal 
transactions:

select
  l.transactionid,
  age(l.transactionid) as age,  /* measured in number of other transactions 
elapsed, not in terms of time */
  l.pid,
  case when l.pid is null then false else true end as is_prepared,
  a.backend_start,
  p.prepared as time_xact_was_prepared,
  p.gid as prepared_name
from
  pg_locks l
  left outer join pg_stat_activity a on l.pid = a.procpid
  left outer join pg_prepared_xacts p on l.transactionid = p.transaction
where
  l.locktype = 'transactionid'
  and l.mode = 'ExclusiveLock'
  and l.granted
order by age(l.transactionid) desc
;

 transactionid | age | pid  | is_prepared | backend_start |
time_xact_was_prepared |  prepared_name
---+-+--+-+---+---+--
316645 |  44 |  | f   |   | 
2008-09-09 00:31:46.724178-07 | my_prepared_transaction1
316689 |   0 | 6093 | t   | 2008-09-09 00:40:10.928287-07 | 
  |
(2 rows)

Note that unless you run this query as a superuser (e.g. "postgres"), the 
columns from pg_stat_activity will only be visible for sessions that belong to 
you.  To rollback this example prepared transaction, you'd type:
  ROLLBACK PREPARED 'my_prepared_transaction1';

Hope this helps!
Matt



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] inaccurate stats on large tables

2008-09-08 Thread Matt Smiley
Hi Kiran,

You gave great info on your problem.

First, is this the query you're actually trying to speed up, or is it a 
simplified version?  It looks like the optimizer has already chosen the best 
execution plan for the given query.  Since the query has no joins, we only have 
to consider access paths.  You're fetching 58221/37909009 = 0.15% of the rows, 
so a sequential scan is clearly inappropriate.  A basic index scan is likely to 
incur extra scattered I/O, so a bitmap index scan is favored.

To improve on this query's runtime, you could try any of the following:

 - Reorganize the data to reduce this query's scattered I/O (i.e. cluster on 
"paliasorigin_search3_idx" rather than "paliasorigin_alias_casefold_idx").  
Bear in mind, this may adversely affect other queries.

 - Increase the cache hit frequency by ensuring the underlying filesystem cache 
has plenty of RAM (usually so under Linux) and checking that other concurrent 
queries aren't polluting the cache.  Consider adding RAM if you think the 
working set of blocks required by most queries is larger than the combined 
Postgres and filesystem caches.  If other processes than the db do I/O on this 
machine, consider them as resource consumers, too.

 - Restructure the table, partitioning along a column that would be useful for 
pruning whole partitions for your painful queries.  In this case, origin_id or 
tax_id seems like a good bet, but again, consider other queries against this 
table.  38 million rows probably makes your table around 2 GB (guessing about 
55 bytes/row).  Depending on the size and growth rate of the table, it may be 
time to consider partitioning.  Out of curiosity, what runtime are you 
typically seeing from this query?  The explain-analyze ran in 113 ms, which I'm 
guessing is the effect of caching, not the runtime you're trying to improve.

 - Rebuild the indexes on this table.  Under certain use conditions, btree 
indexes can get horribly bloated.  Rebuilding the indexes returns them to their 
most compact and balanced form.  For example: reindex index 
"paliasorigin_search3_idx";  Apart from the locking and CPU usage during the 
rebuild, this has no negative consequences, so I'd try this before something 
drastic like partitioning.  First review the current size of the index for 
comparison: select pg_size_pretty(pg_relation_size('paliasorigin_search3_idx'));

Since you asked specifically about improving the row-count estimate, like the 
previous responder said, you should consider increasing the statistics target.  
This will help if individual columns are being underestimated, but not if the 
overestimate is due to joint variation.  In other words, the optimizer has no 
way to tell if there is there a logical relationship between columns A and B 
such that certain values in B only occur with certain values of A.  Just 
judging from the names, it sounds like origin_id and tax_id might have a 
parent-child relationship, so I thought it was worth mentioning.

Do the columns individually have good estimates?
explain analyze select * from paliasorigin where origin_id=20;
explain analyze select * from paliasorigin where tax_id=9606;

If not, increase the statistics on that column, reanalyze the table, and 
recheck the selectivity estimate:
alter table paliasorigin alter column origin_id set statistics 20;
analyze paliasorigin;
explain analyze select * from paliasorigin where origin_id=20;

Good luck!
Matt



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] limit clause breaks query planner?

2008-09-04 Thread Matt Smiley
"Tom Lane" <[EMAIL PROTECTED]> writes:
> I'm not sure offhand whether the existing correlation stats would be of use 
> for
> it, or whether we'd have to get ANALYZE to gather additional data.

Please forgive the tangent, but would it be practical to add support for 
gathering statistics on an arbitrary expression associated with a table, rather 
than just on materialized columns?  For example:
analyze my_tab for expression 'my_func(my_tab.col)' ;
It seems like any time you'd consider using a functional index, this feature 
would let the planner calculate decent selectivity estimates for the 
expression's otherwise opaque data distribution.  The expression might be 
treated as a virtual column on the table; not sure if that helps or hurts.  
Should I post this question on pgsql-hackers?



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] limit clause breaks query planner?

2008-09-03 Thread Matt Smiley
"Tom Lane" <[EMAIL PROTECTED]> writes:
>  "Matt Smiley" <[EMAIL PROTECTED]> writes:
>  >  So an Index Scan is always going to have a higher cost estimate than
>  >  an equivalent Seq Scan returning the same result rows (unless
>  >  random_page_cost is < 1).  That's why I think the planner is always
>  >  preferring the plan that uses a Seq Scan.
>  
>  If that were the case, we'd never choose an indexscan at all...

You're right, that was a silly guess.

>  It's true that a plain indexscan is not preferred for queries that will
>  return a large fraction of the table.  However, it should be willing to
>  use a bitmap scan for this query, given default cost settings (the
>  default cost settings will cause it to prefer bitmap scan for retrieving
>  up to about a third of the table, in my experience).  I too am confused
>  about why it doesn't prefer that choice in the OP's example.

It looks like the bitmap scan has a higher cost estimate because the entire 
bitmap index must be built before beginning the heap scan and returning rows up 
the pipeline.  The row-count limit can't be pushed lower than the 
bitmap-heap-scan like it can for the basic index-scan.

test_8_3_3=# set enable_seqscan = false ;
SET

test_8_3_3=# set enable_indexscan = false ;
SET

test_8_3_3=# explain analyze select * from my_table where a is null and b = 3 
limit 15 ;
   QUERY PLAN

 Limit  (cost=17070.22..17071.02 rows=15 width=8) (actual time=606.902..607.086 
rows=15 loops=1)
   ->  Bitmap Heap Scan on my_table  (cost=17070.22..69478.96 rows=988217 
width=8) (actual time=606.892..606.983 rows=15 loops=1)
 Recheck Cond: (b = 3)
 Filter: (a IS NULL)
 ->  Bitmap Index Scan on idx_b  (cost=0.00..16823.17 rows=109 
width=0) (actual time=592.657..592.657 rows=100 loops=1)
   Index Cond: (b = 3)
 Total runtime: 607.340 ms
(7 rows)


>  It would be interesting to alter the random_page_cost setting and see if he 
> gets
>  different results.

Using an unmodified postgresql.conf, the cost estimate for an index-scan were 
so much higher than for a seqscan that random_page_cost had to be set below 0.2 
before the index-scan was preferred.  However, it looks like this was mainly 
because effective_cache_size was too small.  The planner thought the cache was 
only 128 MB, and the size of the complete table+index was 39492 + 21946 pages * 
8 KB/block = 330 MB.  It makes sense for the cost estimate to be so much higher 
if blocks are expected to be repeatedly re-fetched from disk.  I wonder if 
David's effective_cache_size is too small.

test_8_3_3=# reset all ;
RESET

test_8_3_3=# explain analyze select * from my_table where a is null and b = 3 
limit 15 ;
QUERY PLAN
---
 Limit  (cost=0.00..2.50 rows=15 width=8) (actual time=0.036..0.239 rows=15 
loops=1)
   ->  Seq Scan on my_table  (cost=0.00..164492.74 rows=988217 width=8) (actual 
time=0.028..0.138 rows=15 loops=1)
 Filter: ((a IS NULL) AND (b = 3))
 Total runtime: 0.338 ms
(4 rows)

test_8_3_3=# set enable_seqscan = false ;
SET

test_8_3_3=# show random_page_cost ;
 random_page_cost
--
 4
(1 row)

test_8_3_3=# explain analyze select * from my_table where a is null and b = 3 
limit 15 ;
QUERY PLAN
--
 Limit  (cost=0.00..45.99 rows=15 width=8) (actual time=0.051..0.200 rows=15 
loops=1)
   ->  Index Scan using idx_b on my_table  (cost=0.00..3029924.36 rows=988217 
width=8) (actual time=0.043..0.100 rows=15 loops=1)
 Index Cond: (b = 3)
 Filter: (a IS NULL)
 Total runtime: 0.308 ms
(5 rows)

test_8_3_3=# set random_page_cost = 0.19 ;
SET
test_8_3_3=# explain analyze select * from my_table where a is null and b = 3 
limit 15 ;
   QUERY PLAN
-
 Limit  (cost=0.00..2.45 rows=15 width=8) (actual time=0.050..0.201 rows=15 
loops=1)
   ->  Index Scan using idx_b on my_table  (cost=0.00..161190.65 rows=988217 
width=8) (actual time=0.042..0.097 rows=15 loops=1)
 Index Cond: (b = 3)
 Filter: (a IS NULL)
 Total runtime: 0.307 ms
(5 rows)


Now fix effective_cache_size and try again.

test_8_3_3=# reset all ;
RESET

test_8_3_3=# set effect

Re: [PERFORM] limit clause breaks query planner?

2008-09-02 Thread Matt Smiley
Hi David,

Early in this thread, Pavel suggested:

> you should partial index
> 
> create index foo(b) on mytable where a is null;

Rather, you might try the opposite partial index (where a is NOT null) as a 
replacement for the original unqualified index on column A.  This new index 
will be ignored by the query you're trying to tune, but it'll be available to 
the other queries that filter to a non-null value of column A.  (Omitting NULL 
from that index should be ok because you normally wouldn't want to use an index 
when 95% of the table's rows match the filtered key.)

Then you can temporarily disable Seq Scans in your session for just this one 
query, as follows:

SQL> create table my_table ( a int, b int ) ;
CREATE TABLE

SQL> create index idx_a_not_null on my_table ( a ) where a is not null ;
CREATE INDEX

SQL> create index idx_b on my_table ( b ) ;
CREATE INDEX

SQL> insert into my_table (a, b)
select
  case when random() <= 0.95 then null else i end as a,
  mod(i, 10) as b
from generate_series(1, 1000) s(i)
;
INSERT 0 1000

SQL> analyze my_table ;
ANALYZE


Review the statistics available to the optimizer:

SQL> select attname, null_frac, n_distinct, most_common_vals, 
most_common_freqs, histogram_bounds, correlation
from pg_stats
where tablename = 'my_table'
order by attname
;
 attname | null_frac | n_distinct |   most_common_vals| 
 most_common_freqs   |  
  histogram_bounds| 
correlation
-+---++---+--++-
 a   | 0.945 | -1 |   | 
 | 
{2771,1301755,2096051,3059786,3680728,4653531,5882434,6737141,8240245,9428702,9875768}
 |   1
 b   | 0 | 10 | {9,4,3,1,2,6,8,5,7,0} | 
{0.110333,0.104,0.102333,0.100333,0.100333,0.0996667,0.0986667,0.098,0.096,0.09}
 |  
  |0.127294
(2 rows)

SQL> select relname, reltuples, relpages from pg_class where relname in 
('my_table', 'idx_a_not_null', 'idx_b') order by relname ;
relname | reltuples | relpages
+---+--
 idx_a_not_null |499955 | 1100
 idx_b  | 1e+07 |21946
 my_table   | 1e+07 |39492
(3 rows)


Run the test query, first without disabling Seq Scan to show this example 
reproduces the plan you're trying to avoid.

SQL> explain analyze select * from my_table where a is null and b = 5 limit 15 ;
QUERY PLAN
---
 Limit  (cost=0.00..2.66 rows=15 width=8) (actual time=0.070..0.263 rows=15 
loops=1)
   ->  Seq Scan on my_table  (cost=0.00..164492.00 rows=929250 width=8) (actual 
time=0.061..0.159 rows=15 loops=1)
 Filter: ((a IS NULL) AND (b = 5))
 Total runtime: 0.371 ms
(4 rows)


Now run the same query without the Seq Scan option.

SQL> set enable_seqscan = false ;
SET

SQL> explain analyze select * from my_table where a is null and b = 5 limit 15 ;
QUERY PLAN
--
 Limit  (cost=0.00..46.33 rows=15 width=8) (actual time=0.081..0.232 rows=15 
loops=1)
   ->  Index Scan using idx_b on my_table  (cost=0.00..2869913.63 rows=929250 
width=8) (actual time=0.072..0.130 rows=15 loops=1)
 Index Cond: (b = 5)
 Filter: (a IS NULL)
 Total runtime: 0.341 ms
(5 rows)

SQL> reset enable_seqscan ;
RESET


Yes, it's unsavory to temporarily adjust a session-level parameter to tune a 
single query, but I don't know of a less intrusive way to avoid the SeqScan.  
Here's why I think it might be your simplest option:

As far as I can tell, the plan nodes for accessing the table/index are unaware 
of the LIMIT.  The cost of the Limit node is estimated as the cost of its input 
row-source multiplied by the ratio of requested/returned rows.  For example, 
from the preceding plan output:
2869913.63 for "Index Scan" upper cost * (15 row limit / 929250 returned 
rows) = 46.326 upper cost for the "Limit" node
The underlying plan nodes each assume that all the rows matching their filter 
predicates will be returned up the pipeline; the cost estimate is only reduced 
at the Limit node.  A Seq Scan and an Index Scan (over a complete index) will 
both expected the same number of input rows (pg_class.reltuples).  They also 
produce the same estimate

Re: [PERFORM] Sunfire X4500 recommendations

2007-03-29 Thread Matt Smiley
Hi David,

Thanks for your feedback!  I'm rather a newbie at this, and I do appreciate the 
critique.

First, let me correct myself: The formulas for the risk of loosing data when 
you loose 2 and 3 disks shouldn't have included the first term (g/n).  I'll 
give the corrected formulas and tables at the end of the email.


> please explain why you are saying that the risk of loosing any 1 disk is 
> 1/n. shouldn't it be probability of failure * n instead?

1/n represents the assumption that all disks have an equal probability of being 
the next one to fail.  This seems like a fair assumption in general for the 
active members of a stripe (not including hot spares).  A possible exception 
would be the parity disks (because reads always skip them and writes always hit 
them), but that's only a consideration if the RAID configuration used dedicated 
disks for parity instead of distributing it across the RAID 5/6 group members.  
Apart from that, whether the workload is write-heavy or read-heavy, sequential 
or scattered, the disks in the stripe ought to handle a roughly equivalent 
number of iops over their lifetime.


> following this logic the risk of loosing all 48 disks in a single group of 
> 48 would be 100%

Exactly.  Putting all disks in one group is RAID 0 -- no data protection.  If 
you loose even 1 active member of the stripe, the probability of loosing your 
data is 100%.


> also what you are looking for is the probability of the second (and third) 
> disks failing in time X (where X is the time nessasary to notice the 
> failure, get a replacement, and rebuild the disk)

Yep, that's exactly what I'm looking for.  That's why I said, "these 
probabilities are only describing the case where we don't have enough time 
between disk failures to recover the array."  My goal wasn't to estimate how 
long time X is.  (It doesn't seem like a generalizable quantity; due partly to 
logistical and human factors, it's unique to each operating environment.)  
Instead, I start with the assumption that time X has been exceeded, and we've 
lost a 2nd (or 3rd) disk in the array.  Given that assumption, I wanted to show 
the probability that the loss of the 2nd disk has caused the stripe to become 
unrecoverable.

We know that RAID 10 and 50 can tolerate the loss of anywhere between 1 and n/g 
disks, depending on how lucky you are.  I wanted to quantify the amount of luck 
required, as a risk management tool.  The duration of time X can be minimized 
with hot spares and attentive administrators, but the risk after exceeding time 
X can only be minimized (as far as I know) by configuring the RAID stripe with 
small enough underlying failure groups.


> the killer is the time needed to rebuild the disk, with multi-TB arrays 
> is't sometimes faster to re-initialize the array and reload from backup 
> then it is to do a live rebuild (the kernel.org servers had a raid failure 
> recently and HPA mentioned that it took a week to rebuild the array, but 
> it would have only taken a couple days to do a restore from backup)

That's very interesting.  I guess the rebuild time also would depend on how 
large the damaged failure group was.  Under RAID 10, for example, I think you'd 
still only have to rebuild 1 disk from its mirror, regardless of how many other 
disks were in the stripe, right?  So shortening the rebuild time may be another 
good motivation to keep the failure groups small.


> add to this the fact that disk failures do not appear to be truely 
> independant from each other statisticly (see the recent studies released 
> by google and cmu), and I wouldn't bother with single-parity for a 

I don't think I've seen the studies you mentioned.  Would you cite them please? 
 This may not be typical of everyone's experience, but what I've seen during 
in-house load tests is an equal I/O rate for each disk in my stripe, using 
short-duration sampling intervals to avoid long-term averaging effects.  This 
is what I expected to find, so I didn't delve deeper.

Certainly it's true that some disks may be more heavily burdened than others 
for hours or days, but I wouldn't expect any bias from an application-driven 
access pattern to persist for a significant fraction of a disk's lifespan.  The 
only influence I'd expect to bias the cumulative I/O handled by a disk over its 
entire life would be its role in the RAID configuration.  Hot spares will have 
minimal wear-and-tear until they're activated.  Dedicated parity disks will 
probably live longer than data disks, unless the workload is very heavily 
oriented towards small writes (e.g. logging).


> multi-TB array. If the data is easy to recreate (including from backup) or 
> short lived (say a database of log data that cycles every month or so) I 
> would just do RAID-0 and plan on loosing the data on drive failure (this 
> assumes that you can afford the loss of service when this happens). if the 
> data is more important then I'd do dual-parity or more, along with a ho

Re: [PERFORM] Sunfire X4500 recommendations

2007-03-27 Thread Matt Smiley
Hi Dimitri,

First of all, thanks again for the great feedback!

Yes, my I/O load is mostly read operations.  There are some bulk writes done in 
the background periodically throughout the day, but these are not as 
time-sensitive.  I'll have to do some testing to find the best balance of read 
vs. write speed and tolerance of disk failure vs. usable diskspace.

I'm looking forward to seeing the results of your OLTP tests!  Good luck!  
Since I won't be doing that myself, it'll be all new to me.

About disk failure, I certainly agree that increasing the number of disks will 
decrease the average time between disk failures.  Apart from any performance 
considerations, I wanted to get a clear idea of the risk of data loss under 
various RAID configurations.  It's a handy reference, so I thought I'd share it:



The goal is to calculate the probability of data loss when we loose a certain 
number of disks within a short timespan (e.g. loosing a 2nd disk before 
replacing+rebuilding the 1st one).  For RAID 10, 50, and Z, we will loose data 
if any disk group (i.e. mirror or parity-group) looses 2 disks.  For RAID 60 
and Z2, we will loose data if 3 disks die in the same parity group.  The parity 
groups can include arbitrarily many disks.  Having larger groups gives us more 
usable diskspace but less protection.  (Naturally we're more likely to loose 2 
disks in a group of 50 than in a group of 5.)

g = number of disks in each group (e.g. mirroring = 2; single-parity = 3 or 
more; dual-parity = 4 or more)
n = total number of disks
risk of loosing any 1 disk = 1/n
risk of loosing 1 disk from a particular group = g/n
risk of loosing 2 disks in the same group = g/n * (g-1)/(n-1)
risk of loosing 3 disks in the same group = g/n * (g-1)/(n-1) * (g-2)/(n-2)

For the x4500, we have 48 disks.  If we stripe our data across all those disks, 
then these are our configuration options:

RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same 
group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
  2  24   4824  0.09%
  3  16   4832  0.27%
  4  12   4836  0.53%
  6   8   4840  1.33%
  8   6   4842  2.48%
 12   4   4844  5.85%
 24   2   4846 24.47%
 48   1   4847100.00%

RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose 
data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
  2  24   48   n/an/a
  3  16   4816  0.01%
  4  12   4824  0.02%
  6   8   4832  0.12%
  8   6   4836  0.32%
 12   4   4840  1.27%
 24   2   4844 11.70%
 48   1   4846100.00%

So, in terms of fault tolerance:
 - RAID 60 and Z2 always beat RAID 10, since they never risk data loss when 
only 2 disks fail.
 - RAID 10 always beats RAID 50 and Z, since it has the largest number of disk 
groups across which to spread the risk.
 - Having more parity groups increases fault tolerance but decreases usable 
diskspace.

That's all assuming each disk has an equal chance of failure, which is probably 
true since striping should distribute the workload evenly.  And again, these 
probabilities are only describing the case where we don't have enough time 
between disk failures to recover the array.

In terms of performance, I think RAID 10 should always be best for write speed. 
 (Since it doesn't calculate parity, writing a new block doesn't require 
reading the rest of the RAID stripe just to recalculate the parity bits.)  I 
think it's also normally just as fast for reading, since the controller can 
load-balance the pending read requests to both sides of each mirror.





---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Sunfire X4500 recommendations

2007-03-23 Thread Matt Smiley
Thanks Dimitri!  That was very educational material!  I'm going to think out 
loud here, so please correct me if you see any errors.

The section on tuning for OLTP transactions was interesting, although my OLAP 
workload will be predominantly bulk I/O over large datasets of 
mostly-sequential blocks.

The NFS+ZFS section talked about the zil_disable control for making zfs ignore 
commits/fsyncs.  Given that Postgres' executor does single-threaded synchronous 
I/O like the tar example, it seems like it might benefit significantly from 
setting zil_disable=1, at least in the case of frequently flushed/committed 
writes.  However, zil_disable=1 sounds unsafe for the datafiles' filesystem, 
and would probably only be acceptible for the xlogs if they're stored on a 
separate filesystem and you're willing to loose recently committed 
transactions.  This sounds pretty similar to just setting fsync=off in 
postgresql.conf, which is easier to change later, so I'll skip the zil_disable 
control.

The RAID-Z section was a little surprising.  It made RAID-Z sound just like 
RAID 50, in that you can customize the trade-off between iops versus usable 
diskspace and fault-tolerance by adjusting the number/size of parity-protected 
disk groups.  The only difference I noticed was that RAID-Z will apparently set 
the stripe size across vdevs (RAID-5s) to be as close as possible to the 
filesystem's block size, to maximize the number of disks involved in 
concurrently fetching each block.  Does that sound about right?

So now I'm wondering what RAID-Z offers that RAID-50 doesn't.  I came up with 2 
things: an alleged affinity for full-stripe writes and (under RAID-Z2) the 
added fault-tolerance of RAID-6's 2nd parity bit (allowing 2 disks to fail per 
zpool).  It wasn't mentioned in this blog, but I've heard that under certain 
circumstances, RAID-Z will magically decide to mirror a block instead of 
calculating parity on it.  I'm not sure how this would happen, and I don't know 
the circumstances that would trigger this behavior, but I think the goal (if it 
really happens) is to avoid the performance penalty of having to read the rest 
of the stripe required to calculate parity.  As far as I know, this is only an 
issue affecting small writes (e.g. single-row updates in an OLTP workload), but 
not large writes (compared to the RAID's stripe size).  Anyway, when I saw the 
filesystem's intent log mentioned, I thought maybe the small writes are 
converted to full-stripe writes by deferring their commit until a full stripe's 
worth of data had been accumulated.  Does that sound plausible?

Are there any other noteworthy perks to RAID-Z, rather than RAID-50?  If not, 
I'm inclined to go with your suggestion, Dimitri, and use zfs like RAID-10 to 
stripe a zpool over a bunch of RAID-1 vdevs.  Even though many of our queries 
do mostly sequential I/O, getting higher seeks/second is more important to us 
than the sacrificed diskspace.

For the record, those blogs also included a link to a very helpful ZFS Best 
Practices Guide:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

To sum up, so far the short list of tuning suggestions for ZFS includes:
 - Use a separate zpool and filesystem for xlogs if your apps write often.
 - Consider setting zil_disable=1 on the xlogs' dedicated filesystem.  ZIL is 
the intent log, and it sounds like disabling it may be like disabling 
journaling.  Previous message threads in the Postgres archives debate whether 
this is safe for the xlogs, but it didn't seem like a conclusive answer was 
reached.
 - Make filesystem block size (zfs record size) match the Postgres block size.
 - Manually adjust vdev_cache.  I think this sets the read-ahead size.  It 
defaults to 64 KB.  For OLTP workload, reduce it; for DW/OLAP maybe increase it.
 - Test various settings for vq_max_pending (until zfs can auto-tune it).  See 
http://blogs.sun.com/erickustarz/entry/vq_max_pending
 - A zpool of mirrored disks should support more seeks/second than RAID-Z, just 
like RAID 10 vs. RAID 50.  However, no single Postgres backend will see better 
than a single disk's seek rate, because the executor currently dispatches only 
1 logical I/O request at a time.


>>> Dimitri <[EMAIL PROTECTED]> 03/23/07 2:28 AM >>>
On Friday 23 March 2007 03:20, Matt Smiley wrote:
> My company is purchasing a Sunfire x4500 to run our most I/O-bound
> databases, and I'd like to get some advice on configuration and tuning. 
> We're currently looking at: - Solaris 10 + zfs + RAID Z
>  - CentOS 4 + xfs + RAID 10
>  - CentOS 4 + ext3 + RAID 10
> but we're open to other suggestions.
>

Matt,

for Solaris + ZFS you may find answers to all your questions here:

  http://blogs.sun.com/roch/category/ZFS
  http://blogs.sun.com/realneel/entry/zfs_and_databases

Think to

[PERFORM] Sunfire X4500 recommendations

2007-03-22 Thread Matt Smiley
My company is purchasing a Sunfire x4500 to run our most I/O-bound databases, 
and I'd like to get some advice on configuration and tuning.  We're currently 
looking at:
 - Solaris 10 + zfs + RAID Z
 - CentOS 4 + xfs + RAID 10
 - CentOS 4 + ext3 + RAID 10
but we're open to other suggestions.

>From previous message threads, it looks like some of you have achieved stellar 
>performance under both Solaris 10 U2/U3 with zfs and CentOS 4.4 with xfs.  
>Would those of you who posted such results please describe how you tuned the 
>OS/fs to yield those figures (e.g. patches, special drivers, read-ahead, 
>checksumming, write-through cache settings, etc.)?

Most of our servers currently run CentOS/RedHat, and we have little experience 
with Solaris, but we're not opposed to Solaris if there's a compelling reason 
to switch.  For example, it sounds like zfs snapshots may have a lighter 
performance penalty than LVM snapshots.  We've heard that just using LVM (even 
without active snapshots) imposes a maximum sequential I/O rate of around 600 
MB/s (although we haven't yet reached this limit experimentally).

By the way, we've also heard that Solaris is "more stable" under heavy I/O load 
than Linux.  Have any of you experienced this?  It's hard to put much stock in 
such a blanket statement, but naturally we don't want to introduce 
instabilities.

Thanks in advance for your thoughts!

For reference:

Our database cluster will be 3-6 TB in size.  The Postgres installation will be 
8.1 (at least initially), compiled to use 32 KB blocks (rather than 8 KB).  The 
workload will be predominantly OLAP.  The Sunfire X4500 has 2 dual-core 
Opterons, 16 GB RAM, 48 SATA disks (500 GB/disk * 48 = 24 TB raw -> 12 TB 
usable under RAID 10).

So far, we've seen the X4500 deliver impressive but suboptimal results using 
the out-of-the-box installation of Solaris + zfs.  The Linux testing is in the 
early stages (no xfs, yet), but so far it yeilds comparatively modest write 
rates and very poor read and rewrite rates.

===
Results under Solaris with zfs:
===

Four concurrent writers:
% time dd if=/dev/zero of=/zpool1/test/50GB-zero1 bs=1024k count=51200 ; time 
sync
% time dd if=/dev/zero of=/zpool1/test/50GB-zero2 bs=1024k count=51200 ; time 
sync
% time dd if=/dev/zero of=/zpool1/test/50GB-zero3 bs=1024k count=51200 ; time 
sync
% time dd if=/dev/zero of=/zpool1/test/50GB-zero4 bs=1024k count=51200 ; time 
sync

Seq Write (bs = 1 MB):  128 + 122 + 131 + 124 = 505 MB/s

Four concurrent readers:
% time dd if=/zpool1/test/50GB-zero1 of=/dev/null bs=1024k
% time dd if=/zpool1/test/50GB-zero2 of=/dev/null bs=1024k
% time dd if=/zpool1/test/50GB-zero3 of=/dev/null bs=1024k
% time dd if=/zpool1/test/50GB-zero4 of=/dev/null bs=1024k

Seq Read (bs = 1 MB):   181 + 177 + 180 + 178 = 716 MB/s


One bonnie++ process:
% bonnie++ -r 16384 -s 32g:32k -f -n0 -d /zpool1/test/bonnie_scratch

Version  1.03   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
thumper132G:32k   604173  98 268893  43   543389  59 519.2  
 3
thumper1,32G:32k,,,604173,98,268893,43,,,543389,59,519.2,3,


4 concurrent synchronized bonnie++ processes:
% bonnie++ -p4
% bonnie++ -r 16384 -s 32g:32k -y -f -n0 -d /zpool1/test/bonnie_scratch
% bonnie++ -r 16384 -s 32g:32k -y -f -n0 -d /zpool1/test/bonnie_scratch
% bonnie++ -r 16384 -s 32g:32k -y -f -n0 -d /zpool1/test/bonnie_scratch
% bonnie++ -r 16384 -s 32g:32k -y -f -n0 -d /zpool1/test/bonnie_scratch
% bonnie++ -p-1

Combined results of 4 sessions:
Seq Output:   124 + 124 + 124 + 140 = 512 MB/s
Rewrite:   93 +  94 +  93 +  96 = 376 MB/s
Seq Input:192 + 194 + 193 + 197 = 776 MB/s
Random Seek:  327 + 327 + 335 + 332 = 1321 seeks/s


=
Results under CentOS 4 with ext3 and LVM:
=

% bonnie++ -s 32g:32k -f -n0 -d /large_lvm_stripe/test/bonnie_scratch
Version  1.03   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
thumper1.rt 32G:32k   346595  94 59448  11   132471  12 479.4   
2
thumper1.rtkinternal,32G:32k,,,346595,94,59448,11,,,132471,12,479.4,2,



Summary of bonnie++ results:


   sequential  sequentialsequential  scattered
Test case  write MB/s  rewrite MB/s  read MB/s   seeks/s
-  --    --  -
Sol10+zfs, 1 process  604   269 543519
Sol10+zfs, 4 processes512   376