Re: [PERFORM] SAN performance mystery

2006-06-19 Thread Tim Allen

Jeff Trout wrote:

On Jun 16, 2006, at 5:11 AM, Tim Allen wrote:
One curious thing is that some postgres backends seem to spend an  
inordinate amount of time in uninterruptible iowait state. I found  a 
posting to this list from December 2004 from someone who reported  
that very same thing. For example, bringing down postgres on the  
customer box requires kill -9, because there are invariably one or  
two processes so deeply uninterruptible as to not respond to a  
politer signal. That indicates something not quite right, doesn't it?


Sounds like there could be a driver/array/kernel bug there that is  
kicking the performance down the tube.

If it was PG's fault it wouldn't be stuck uninterruptable.


That's what I thought. I've advised the customer to upgrade their kernel 
drivers, and to preferably upgrade their kernel as well. We'll see if 
they accept the advice :-|.


Tim

--
---
Tim Allen  [EMAIL PROTECTED]
Proximity Pty Ltd  http://www.proximity.com.au/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] SAN performance mystery

2006-06-19 Thread Tim Allen

Scott Marlowe wrote:

On Thu, 2006-06-15 at 16:50, Tim Allen wrote:

We have a customer who are having performance problems. They have a 
large (36G+) postgres 8.1.3 database installed on an 8-way opteron with 
8G RAM, attached to an EMC SAN via fibre-channel (I don't have details 
of the EMC SAN model, or the type of fibre-channel card at the moment). 
They're running RedHat ES3 (which means a 2.4.something Linux kernel).


They are unhappy about their query performance. We've been doing various 
things to try to work out what we can do. One thing that has been 
apparent is that autovacuum has not been able to keep the database 
sufficiently tamed. A pg_dump/pg_restore cycle reduced the total 
database size from 81G to 36G. Performing the restore took about 23 hours.


Do you have the ability to do any simple IO performance testing, like
with bonnie++ (the old bonnie is not really capable of properly testing
modern equipment, but bonnie++ will give you some idea of the throughput
of the SAN)  Or even just timing a dd write to the SAN?


I've done some timed dd's. The timing results vary quite a bit, but it 
seems you can write to the SAN at about 20MB/s and read from it at about 
 12MB/s. Not an entirely scientific test, as I wasn't able to stop 
other activity on the machine, though I don't think much else was 
happening. Certainly not impressive figures, compared with our machine 
with the SATA disk (referred to below), which can get 161MB/s copying 
files on the same disk, and 48MB/s and 138Mb/s copying files from the 
sata disk respectively to and from a RAID5 array.


The customer is a large organisation, with a large IT department who 
guard their turf carefully, so there is no way I could get away with 
installing any heavier duty testing tools like bonnie++ on their machine.


We tried restoring the pg_dump output to one of our machines, a 
dual-core pentium D with a single SATA disk, no raid, I forget how much 
RAM but definitely much less than 8G. The restore took five hours. So it 
would seem that our machine, which on paper should be far less 
impressive than the customer's box, does more than four times the I/O 
performance.


To simplify greatly - single local SATA disk beats EMC SAN by factor of 
four.


Is that expected performance, anyone? It doesn't sound right to me. Does 
anyone have any clues about what might be going on? Buggy kernel 
drivers? Buggy kernel, come to think of it? Does a SAN just not provide 
adequate performance for a large database?



Yes, this is not uncommon.  It is very likely that your SATA disk is
lying about fsync.


I guess a sustained write will flood the disk's cache and negate the 
effect of the write-completion dishonesty. But I have no idea how large 
a copy would have to be to do that - can anyone suggest a figure? 
Certainly, the read performance of the SATA disk still beats the SAN, 
and there is no way to lie about read performance.



What kind of backup are you using?  insert statements or copy
statements?  If insert statements, then the difference is quite
believable.  If copy statements, less so.


A binary pg_dump, which amounts to copy statements, if I'm not mistaken.


Next time, on their big server, see if you can try a restore with fsync
turned off and see if that makes the restore faster.  Note you should
turn fsync back on after the restore, as running without it is quite
dangerous should you suffer a power outage.

How are you mounting to the EMC SAN?  NFS, iSCSI? Other?


iSCSI, I believe. Some variant of SCSI, anyway, of that I'm certain.

The conclusion I'm drawing here is that this SAN does not perform at all 
well, and is not a good database platform. It's sounding from replies 
from other people that this might be a general property of SAN's, or at 
least the ones that are not stratospherically priced.


Tim

--
---
Tim Allen  [EMAIL PROTECTED]
Proximity Pty Ltd  http://www.proximity.com.au/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] SAN performance mystery

2006-06-19 Thread Tim Allen

John Vincent wrote:

snipped
Is that expected performance, anyone? It doesn't sound right to me. Does
anyone have any clues about what might be going on? Buggy kernel
drivers? Buggy kernel, come to think of it? Does a SAN just not provide
adequate performance for a large database?

Tim,

Here are the areas I would look at first if we're considering hardware 
to be the problem:


HBA and driver:
   Since this is a Intel/Linux system, the HBA is PROBABLY a qlogic. I 
would need to know the SAN model to see what the backend of the SAN is 
itself. EMC has some FC-attach models that actually have SATA disks 
underneath. You also might want to look at the cache size of the 
controllers on the SAN.


As I noted in another thread, the HBA is an Emulex LP1050, and they have 
a rather old driver for it. I've recommended that they update ASAP. This 
hasn't happened yet.


I know very little about the SAN itself - the customer hasn't provided 
any information other than the brand name, as they selected it and 
installed it themselves. I shall ask for more information.


   - Something also to note is that EMC provides a add-on called 
PowerPath for load balancing multiple HBAs. If they don't have this, it 
might be worth investigating.


OK, thanks, I'll ask the customer whether they've used PowerPath at all. 
They do seem to have it installed on the machine, but I suppose that 
doesn't guarantee it's being used correctly. However, it looks like they 
have just the one HBA, so, if I've correctly understood what load 
balancing means in this context, it's not going to help; right?


  - As with anything, disk layout is important. With the lower end IBM 
SAN (DS4000) you actually have to operate on physical spindle level. On 
our 4300, when I create a LUN, I select the exact disks I want and which 
of the two controllers are the preferred path. On our DS6800, I just ask 
for storage. I THINK all the EMC models are the ask for storage type 
of scenario. However with the 6800, you select your storage across 
extent pools.


Have they done any benchmarking of the SAN outside of postgres? Before 
we settle on a new LUN configuration, we always do the 
dd,umount,mount,dd routine. It's not a perfect test for databases but it 
will help you catch GROSS performance issues.


I've done some dd'ing myself, as described in another thread. The 
results are not at all encouraging - their SAN seems to do about 20MB/s 
or less.



SAN itself:
  - Could the SAN be oversubscribed? How many hosts and LUNs total do 
they have and what are the queue_depths for those hosts? With the qlogic 
card, you can set the queue depth in the BIOS of the adapter when the 
system is booting up. CTRL-Q I think.  If the system has enough local 
DASD to relocate the database internally, it might be a valid test to do 
so and see if you can isolate the problem to the SAN itself.


The SAN possibly is over-subscribed. Can you suggest any easy ways for 
me to find out? The customer has an IT department who look after their 
SANs, and they're not keen on outsiders poking their noses in. It's hard 
for me to get any direct access to the SAN itself.



PG itself:
 
 If you think it's a pgsql configuration, I'm guessing you already 
configured postgresql.conf to match thiers (or at least a fraction of 
thiers since the memory isn't the same?). What about loading a 
from-scratch config file and restarting the tuning process?


The pg configurations are not identical. However, given the differences 
in raw I/O speed observed, it doesn't seem likely that the difference in 
configuration is responsible. Yes, as you guessed, we set more 
conservative options on the less capable box. Doing proper double-blind 
tests on the customer box is difficult, as it is in production and the 
customer has a very low tolerance for downtime.


Just a dump of my thought process from someone who's been spending too 
much time tuning his SAN and postgres lately.


Thanks for all the suggestions, John. I'll keep trying to follow some of 
them up.


Tim

--
---
Tim Allen  [EMAIL PROTECTED]
Proximity Pty Ltd  http://www.proximity.com.au/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] SAN performance mystery

2006-06-16 Thread Tim Allen

Tim Allen wrote:
We have a customer who are having performance problems. They have a 
large (36G+) postgres 8.1.3 database installed on an 8-way opteron with 
8G RAM, attached to an EMC SAN via fibre-channel (I don't have details 
of the EMC SAN model, or the type of fibre-channel card at the moment). 
They're running RedHat ES3 (which means a 2.4.something Linux kernel).


To simplify greatly - single local SATA disk beats EMC SAN by factor of 
four.


Is that expected performance, anyone? It doesn't sound right to me. Does 
anyone have any clues about what might be going on? Buggy kernel 
drivers? Buggy kernel, come to think of it? Does a SAN just not provide 
adequate performance for a large database?


I'd be grateful for any clues anyone can offer,

Tim


Thanks to all who have replied so far. I've learned a few new things in 
the meantime.


Firstly, the fibrechannel card is an Emulex LP1050. The customer seems 
to have rather old drivers for it, so I have recommended that they 
upgrade asap. I've also suggested they might like to upgrade their 
kernel to something recent too (eg upgrade to RHEL4), but no telling 
whether they'll accept that recommendation.


The fact that SATA drives are wont to lie about write completion, which 
several posters have pointed out, presumably has an effect on write 
performance (ie apparent write performance is increased at the cost of 
an increased risk of data-loss), but, again presumably, not much of an 
effect on read performance. After loading the customer's database on our 
fairly modest box with the single SATA disk, we also tested select query 
performance, and while we didn't see a factor of four gain, we certainly 
saw that read performance is also substantially better. So the fsync 
issue possibly accounts for part of our factor-of-four, but not all of 
it. Ie, the SAN is still not doing well by comparison, even allowing for 
the presumption that it is more honest.


One curious thing is that some postgres backends seem to spend an 
inordinate amount of time in uninterruptible iowait state. I found a 
posting to this list from December 2004 from someone who reported that 
very same thing. For example, bringing down postgres on the customer box 
requires kill -9, because there are invariably one or two processes so 
deeply uninterruptible as to not respond to a politer signal. That 
indicates something not quite right, doesn't it?


Tim

--
---
Tim Allen  [EMAIL PROTECTED]
Proximity Pty Ltd  http://www.proximity.com.au/

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] SAN performance mystery

2006-06-15 Thread Tim Allen
We have a customer who are having performance problems. They have a 
large (36G+) postgres 8.1.3 database installed on an 8-way opteron with 
8G RAM, attached to an EMC SAN via fibre-channel (I don't have details 
of the EMC SAN model, or the type of fibre-channel card at the moment). 
They're running RedHat ES3 (which means a 2.4.something Linux kernel).


They are unhappy about their query performance. We've been doing various 
things to try to work out what we can do. One thing that has been 
apparent is that autovacuum has not been able to keep the database 
sufficiently tamed. A pg_dump/pg_restore cycle reduced the total 
database size from 81G to 36G. Performing the restore took about 23 hours.


We tried restoring the pg_dump output to one of our machines, a 
dual-core pentium D with a single SATA disk, no raid, I forget how much 
RAM but definitely much less than 8G. The restore took five hours. So it 
would seem that our machine, which on paper should be far less 
impressive than the customer's box, does more than four times the I/O 
performance.


To simplify greatly - single local SATA disk beats EMC SAN by factor of 
four.


Is that expected performance, anyone? It doesn't sound right to me. Does 
anyone have any clues about what might be going on? Buggy kernel 
drivers? Buggy kernel, come to think of it? Does a SAN just not provide 
adequate performance for a large database?


I'd be grateful for any clues anyone can offer,

Tim


begin:vcard
fn:Tim Allen
n:Allen;Tim
email;internet:[EMAIL PROTECTED]
x-mozilla-html:FALSE
version:2.1
end:vcard


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] partial indexes and inference

2006-03-19 Thread Tim Allen
I have a case where it seems the planner should be able to infer more 
from its partial indexes than it is doing. Observe:


px=# select version();
version

 PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2
(1 row)

px=# \d pxmdvalue
  Table store.pxmdvalue
   Column   |   Type   | Modifiers
+--+---
 entityid   | bigint   | not null
 fieldid| integer  | not null
 value  | text | not null
 datatypeid | integer  | not null
 tsi| tsvector |
Indexes:
pxmdvalue_pk PRIMARY KEY, btree (entityid, fieldid)
pxmdvalue_atom_val_idx btree (value) WHERE datatypeid = 22
pxmdvalue_bigint_val_idx btree ((value::bigint)) WHERE datatypeid 
= 43

pxmdvalue_datatypeid_idx btree (datatypeid)
pxmdvalue_int_val_idx btree ((value::integer)) WHERE datatypeid = 16
pxmdvalue_str32_val0_idx btree (lower(value)) WHERE datatypeid = 
2 AND octet_length(value)  2700
pxmdvalue_str32_val1_idx btree (lower(value) text_pattern_ops) 
WHERE datatypeid = 2 AND octet_length(value)  2700
pxmdvalue_str_val0_idx btree (lower(value)) WHERE datatypeid = 85 
AND octet_length(value)  2700
pxmdvalue_str_val1_idx btree (lower(value) text_pattern_ops) 
WHERE datatypeid = 85 AND octet_length(value)  2700
pxmdvalue_time_val_idx btree (px_text2timestamp(value)) WHERE 
datatypeid = 37


px=# explain analyse select * from pxmdvalue where datatypeid = 43 and 
fieldid = 857 and cast(value as bigint) = '1009';


QUERY PLAN

 Bitmap Heap Scan on pxmdvalue  (cost=2143.34..2685.74 rows=1 
width=245) (actual time=144.411..144.415 rows=1 loops=1)

   Recheck Cond: (((value)::bigint = 1009::bigint) AND (datatypeid = 43))
   Filter: (fieldid = 857)
   -  BitmapAnd  (cost=2143.34..2143.34 rows=138 width=0) (actual 
time=144.394..144.394 rows=0 loops=1)
 -  Bitmap Index Scan on pxmdvalue_bigint_val_idx 
(cost=0.00..140.23 rows=1758 width=0) (actual time=0.021..0.021 rows=2 
loops=1)

   Index Cond: ((value)::bigint = 1009::bigint)
 -  Bitmap Index Scan on pxmdvalue_datatypeid_idx 
(cost=0.00..2002.85 rows=351672 width=0) (actual time=144.127..144.127 
rows=346445 loops=1)

   Index Cond: (datatypeid = 43)
 Total runtime: 144.469 ms
(9 rows)

px=# drop index pxmdvalue_datatypeid_idx;
DROP INDEX
px=# explain analyse select * from pxmdvalue where datatypeid = 43 and 
fieldid = 857 and cast(value as bigint) = '1009';

   QUERY PLAN
-
 Index Scan using pxmdvalue_bigint_val_idx on pxmdvalue 
(cost=0.00..6635.06 rows=1 width=245) (actual time=0.018..0.022 rows=1 
loops=1)

   Index Cond: ((value)::bigint = 1009::bigint)
   Filter: (fieldid = 857)
 Total runtime: 0.053 ms
(4 rows)



Notice the two bitmap index scans in the first version of the query. The 
one that hits the pxmdvalue_bigint_val_idx actually subsumes the work of 
the second one, as it is a partial index on the same condition that the 
second bitmap scan is checking. So that second bitmap scan is a complete 
waste of time and effort, afaict. When I remove the 
pxmdvalue_datatypeid_idx index, to prevent it using that second bitmap 
scan, the resulting query is much faster, although its estimated cost is 
rather higher.


Any clues, anyone? Is this indeed a limitation of the query planner, in 
that it doesn't realise that the partial index is all it needs here? Or 
is something else going on that is leading the cost estimation astray?


Tim

--
---
Tim Allen  [EMAIL PROTECTED]
Proximity Pty Ltd  http://www.proximity.com.au/

---(end of broadcast)---
TIP 6: explain analyze is your friend