Re: [PERFORM] SAN performance mystery
Jeff Trout wrote: On Jun 16, 2006, at 5:11 AM, Tim Allen wrote: One curious thing is that some postgres backends seem to spend an inordinate amount of time in uninterruptible iowait state. I found a posting to this list from December 2004 from someone who reported that very same thing. For example, bringing down postgres on the customer box requires kill -9, because there are invariably one or two processes so deeply uninterruptible as to not respond to a politer signal. That indicates something not quite right, doesn't it? Sounds like there could be a driver/array/kernel bug there that is kicking the performance down the tube. If it was PG's fault it wouldn't be stuck uninterruptable. That's what I thought. I've advised the customer to upgrade their kernel drivers, and to preferably upgrade their kernel as well. We'll see if they accept the advice :-|. Tim -- --- Tim Allen [EMAIL PROTECTED] Proximity Pty Ltd http://www.proximity.com.au/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] SAN performance mystery
Scott Marlowe wrote: On Thu, 2006-06-15 at 16:50, Tim Allen wrote: We have a customer who are having performance problems. They have a large (36G+) postgres 8.1.3 database installed on an 8-way opteron with 8G RAM, attached to an EMC SAN via fibre-channel (I don't have details of the EMC SAN model, or the type of fibre-channel card at the moment). They're running RedHat ES3 (which means a 2.4.something Linux kernel). They are unhappy about their query performance. We've been doing various things to try to work out what we can do. One thing that has been apparent is that autovacuum has not been able to keep the database sufficiently tamed. A pg_dump/pg_restore cycle reduced the total database size from 81G to 36G. Performing the restore took about 23 hours. Do you have the ability to do any simple IO performance testing, like with bonnie++ (the old bonnie is not really capable of properly testing modern equipment, but bonnie++ will give you some idea of the throughput of the SAN) Or even just timing a dd write to the SAN? I've done some timed dd's. The timing results vary quite a bit, but it seems you can write to the SAN at about 20MB/s and read from it at about 12MB/s. Not an entirely scientific test, as I wasn't able to stop other activity on the machine, though I don't think much else was happening. Certainly not impressive figures, compared with our machine with the SATA disk (referred to below), which can get 161MB/s copying files on the same disk, and 48MB/s and 138Mb/s copying files from the sata disk respectively to and from a RAID5 array. The customer is a large organisation, with a large IT department who guard their turf carefully, so there is no way I could get away with installing any heavier duty testing tools like bonnie++ on their machine. We tried restoring the pg_dump output to one of our machines, a dual-core pentium D with a single SATA disk, no raid, I forget how much RAM but definitely much less than 8G. The restore took five hours. So it would seem that our machine, which on paper should be far less impressive than the customer's box, does more than four times the I/O performance. To simplify greatly - single local SATA disk beats EMC SAN by factor of four. Is that expected performance, anyone? It doesn't sound right to me. Does anyone have any clues about what might be going on? Buggy kernel drivers? Buggy kernel, come to think of it? Does a SAN just not provide adequate performance for a large database? Yes, this is not uncommon. It is very likely that your SATA disk is lying about fsync. I guess a sustained write will flood the disk's cache and negate the effect of the write-completion dishonesty. But I have no idea how large a copy would have to be to do that - can anyone suggest a figure? Certainly, the read performance of the SATA disk still beats the SAN, and there is no way to lie about read performance. What kind of backup are you using? insert statements or copy statements? If insert statements, then the difference is quite believable. If copy statements, less so. A binary pg_dump, which amounts to copy statements, if I'm not mistaken. Next time, on their big server, see if you can try a restore with fsync turned off and see if that makes the restore faster. Note you should turn fsync back on after the restore, as running without it is quite dangerous should you suffer a power outage. How are you mounting to the EMC SAN? NFS, iSCSI? Other? iSCSI, I believe. Some variant of SCSI, anyway, of that I'm certain. The conclusion I'm drawing here is that this SAN does not perform at all well, and is not a good database platform. It's sounding from replies from other people that this might be a general property of SAN's, or at least the ones that are not stratospherically priced. Tim -- --- Tim Allen [EMAIL PROTECTED] Proximity Pty Ltd http://www.proximity.com.au/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] SAN performance mystery
John Vincent wrote: snipped Is that expected performance, anyone? It doesn't sound right to me. Does anyone have any clues about what might be going on? Buggy kernel drivers? Buggy kernel, come to think of it? Does a SAN just not provide adequate performance for a large database? Tim, Here are the areas I would look at first if we're considering hardware to be the problem: HBA and driver: Since this is a Intel/Linux system, the HBA is PROBABLY a qlogic. I would need to know the SAN model to see what the backend of the SAN is itself. EMC has some FC-attach models that actually have SATA disks underneath. You also might want to look at the cache size of the controllers on the SAN. As I noted in another thread, the HBA is an Emulex LP1050, and they have a rather old driver for it. I've recommended that they update ASAP. This hasn't happened yet. I know very little about the SAN itself - the customer hasn't provided any information other than the brand name, as they selected it and installed it themselves. I shall ask for more information. - Something also to note is that EMC provides a add-on called PowerPath for load balancing multiple HBAs. If they don't have this, it might be worth investigating. OK, thanks, I'll ask the customer whether they've used PowerPath at all. They do seem to have it installed on the machine, but I suppose that doesn't guarantee it's being used correctly. However, it looks like they have just the one HBA, so, if I've correctly understood what load balancing means in this context, it's not going to help; right? - As with anything, disk layout is important. With the lower end IBM SAN (DS4000) you actually have to operate on physical spindle level. On our 4300, when I create a LUN, I select the exact disks I want and which of the two controllers are the preferred path. On our DS6800, I just ask for storage. I THINK all the EMC models are the ask for storage type of scenario. However with the 6800, you select your storage across extent pools. Have they done any benchmarking of the SAN outside of postgres? Before we settle on a new LUN configuration, we always do the dd,umount,mount,dd routine. It's not a perfect test for databases but it will help you catch GROSS performance issues. I've done some dd'ing myself, as described in another thread. The results are not at all encouraging - their SAN seems to do about 20MB/s or less. SAN itself: - Could the SAN be oversubscribed? How many hosts and LUNs total do they have and what are the queue_depths for those hosts? With the qlogic card, you can set the queue depth in the BIOS of the adapter when the system is booting up. CTRL-Q I think. If the system has enough local DASD to relocate the database internally, it might be a valid test to do so and see if you can isolate the problem to the SAN itself. The SAN possibly is over-subscribed. Can you suggest any easy ways for me to find out? The customer has an IT department who look after their SANs, and they're not keen on outsiders poking their noses in. It's hard for me to get any direct access to the SAN itself. PG itself: If you think it's a pgsql configuration, I'm guessing you already configured postgresql.conf to match thiers (or at least a fraction of thiers since the memory isn't the same?). What about loading a from-scratch config file and restarting the tuning process? The pg configurations are not identical. However, given the differences in raw I/O speed observed, it doesn't seem likely that the difference in configuration is responsible. Yes, as you guessed, we set more conservative options on the less capable box. Doing proper double-blind tests on the customer box is difficult, as it is in production and the customer has a very low tolerance for downtime. Just a dump of my thought process from someone who's been spending too much time tuning his SAN and postgres lately. Thanks for all the suggestions, John. I'll keep trying to follow some of them up. Tim -- --- Tim Allen [EMAIL PROTECTED] Proximity Pty Ltd http://www.proximity.com.au/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] SAN performance mystery
Tim Allen wrote: We have a customer who are having performance problems. They have a large (36G+) postgres 8.1.3 database installed on an 8-way opteron with 8G RAM, attached to an EMC SAN via fibre-channel (I don't have details of the EMC SAN model, or the type of fibre-channel card at the moment). They're running RedHat ES3 (which means a 2.4.something Linux kernel). To simplify greatly - single local SATA disk beats EMC SAN by factor of four. Is that expected performance, anyone? It doesn't sound right to me. Does anyone have any clues about what might be going on? Buggy kernel drivers? Buggy kernel, come to think of it? Does a SAN just not provide adequate performance for a large database? I'd be grateful for any clues anyone can offer, Tim Thanks to all who have replied so far. I've learned a few new things in the meantime. Firstly, the fibrechannel card is an Emulex LP1050. The customer seems to have rather old drivers for it, so I have recommended that they upgrade asap. I've also suggested they might like to upgrade their kernel to something recent too (eg upgrade to RHEL4), but no telling whether they'll accept that recommendation. The fact that SATA drives are wont to lie about write completion, which several posters have pointed out, presumably has an effect on write performance (ie apparent write performance is increased at the cost of an increased risk of data-loss), but, again presumably, not much of an effect on read performance. After loading the customer's database on our fairly modest box with the single SATA disk, we also tested select query performance, and while we didn't see a factor of four gain, we certainly saw that read performance is also substantially better. So the fsync issue possibly accounts for part of our factor-of-four, but not all of it. Ie, the SAN is still not doing well by comparison, even allowing for the presumption that it is more honest. One curious thing is that some postgres backends seem to spend an inordinate amount of time in uninterruptible iowait state. I found a posting to this list from December 2004 from someone who reported that very same thing. For example, bringing down postgres on the customer box requires kill -9, because there are invariably one or two processes so deeply uninterruptible as to not respond to a politer signal. That indicates something not quite right, doesn't it? Tim -- --- Tim Allen [EMAIL PROTECTED] Proximity Pty Ltd http://www.proximity.com.au/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PERFORM] SAN performance mystery
We have a customer who are having performance problems. They have a large (36G+) postgres 8.1.3 database installed on an 8-way opteron with 8G RAM, attached to an EMC SAN via fibre-channel (I don't have details of the EMC SAN model, or the type of fibre-channel card at the moment). They're running RedHat ES3 (which means a 2.4.something Linux kernel). They are unhappy about their query performance. We've been doing various things to try to work out what we can do. One thing that has been apparent is that autovacuum has not been able to keep the database sufficiently tamed. A pg_dump/pg_restore cycle reduced the total database size from 81G to 36G. Performing the restore took about 23 hours. We tried restoring the pg_dump output to one of our machines, a dual-core pentium D with a single SATA disk, no raid, I forget how much RAM but definitely much less than 8G. The restore took five hours. So it would seem that our machine, which on paper should be far less impressive than the customer's box, does more than four times the I/O performance. To simplify greatly - single local SATA disk beats EMC SAN by factor of four. Is that expected performance, anyone? It doesn't sound right to me. Does anyone have any clues about what might be going on? Buggy kernel drivers? Buggy kernel, come to think of it? Does a SAN just not provide adequate performance for a large database? I'd be grateful for any clues anyone can offer, Tim begin:vcard fn:Tim Allen n:Allen;Tim email;internet:[EMAIL PROTECTED] x-mozilla-html:FALSE version:2.1 end:vcard ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PERFORM] partial indexes and inference
I have a case where it seems the planner should be able to infer more from its partial indexes than it is doing. Observe: px=# select version(); version PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2 (1 row) px=# \d pxmdvalue Table store.pxmdvalue Column | Type | Modifiers +--+--- entityid | bigint | not null fieldid| integer | not null value | text | not null datatypeid | integer | not null tsi| tsvector | Indexes: pxmdvalue_pk PRIMARY KEY, btree (entityid, fieldid) pxmdvalue_atom_val_idx btree (value) WHERE datatypeid = 22 pxmdvalue_bigint_val_idx btree ((value::bigint)) WHERE datatypeid = 43 pxmdvalue_datatypeid_idx btree (datatypeid) pxmdvalue_int_val_idx btree ((value::integer)) WHERE datatypeid = 16 pxmdvalue_str32_val0_idx btree (lower(value)) WHERE datatypeid = 2 AND octet_length(value) 2700 pxmdvalue_str32_val1_idx btree (lower(value) text_pattern_ops) WHERE datatypeid = 2 AND octet_length(value) 2700 pxmdvalue_str_val0_idx btree (lower(value)) WHERE datatypeid = 85 AND octet_length(value) 2700 pxmdvalue_str_val1_idx btree (lower(value) text_pattern_ops) WHERE datatypeid = 85 AND octet_length(value) 2700 pxmdvalue_time_val_idx btree (px_text2timestamp(value)) WHERE datatypeid = 37 px=# explain analyse select * from pxmdvalue where datatypeid = 43 and fieldid = 857 and cast(value as bigint) = '1009'; QUERY PLAN Bitmap Heap Scan on pxmdvalue (cost=2143.34..2685.74 rows=1 width=245) (actual time=144.411..144.415 rows=1 loops=1) Recheck Cond: (((value)::bigint = 1009::bigint) AND (datatypeid = 43)) Filter: (fieldid = 857) - BitmapAnd (cost=2143.34..2143.34 rows=138 width=0) (actual time=144.394..144.394 rows=0 loops=1) - Bitmap Index Scan on pxmdvalue_bigint_val_idx (cost=0.00..140.23 rows=1758 width=0) (actual time=0.021..0.021 rows=2 loops=1) Index Cond: ((value)::bigint = 1009::bigint) - Bitmap Index Scan on pxmdvalue_datatypeid_idx (cost=0.00..2002.85 rows=351672 width=0) (actual time=144.127..144.127 rows=346445 loops=1) Index Cond: (datatypeid = 43) Total runtime: 144.469 ms (9 rows) px=# drop index pxmdvalue_datatypeid_idx; DROP INDEX px=# explain analyse select * from pxmdvalue where datatypeid = 43 and fieldid = 857 and cast(value as bigint) = '1009'; QUERY PLAN - Index Scan using pxmdvalue_bigint_val_idx on pxmdvalue (cost=0.00..6635.06 rows=1 width=245) (actual time=0.018..0.022 rows=1 loops=1) Index Cond: ((value)::bigint = 1009::bigint) Filter: (fieldid = 857) Total runtime: 0.053 ms (4 rows) Notice the two bitmap index scans in the first version of the query. The one that hits the pxmdvalue_bigint_val_idx actually subsumes the work of the second one, as it is a partial index on the same condition that the second bitmap scan is checking. So that second bitmap scan is a complete waste of time and effort, afaict. When I remove the pxmdvalue_datatypeid_idx index, to prevent it using that second bitmap scan, the resulting query is much faster, although its estimated cost is rather higher. Any clues, anyone? Is this indeed a limitation of the query planner, in that it doesn't realise that the partial index is all it needs here? Or is something else going on that is leading the cost estimation astray? Tim -- --- Tim Allen [EMAIL PROTECTED] Proximity Pty Ltd http://www.proximity.com.au/ ---(end of broadcast)--- TIP 6: explain analyze is your friend