Re: [PERFORM] Anyone running Intel S3700 SSDs?
On 2013-03-08 07:48, CSS wrote: http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review Of most interest to me was this: Anyone have any benchmarks to share? I tested the 800GB disk with diskchecker.pl and it was ok. No benchmarks to share yet, as I'm waiting for the disks to be installed in a server. -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Two identical systems, radically different performance
On 2012-10-08 23:45, Craig James wrote: This is driving me crazy. A new server, virtually identical to an old one, has 50% of the performance with pgbench. I've checked everything I can think of. The setups (call the servers "old" and "new"): old: 2 x 4-core Intel Xeon E5620 new: 4 x 4-core Intel Xeon E5606 How are the filesystems formatted and mounted (-o nobarrier?) regards Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] SSD, Postgres and safe write cache
On 2012-06-20 16:51, Michal Szymanski wrote: Hi, We started to think about using SSD drive for our telco system DB. Because we have many "almost" read-only data I think SSD is good candidate for our task. We would like to speed up process of read operation. I've read post (http://blog.2ndquadrant.com/intel_ssd_now_off_the_sherr_sh/) about SSD which have write safe functionality and two drive are recommended Intel 320 and Vertex2 Pro. Both drive are rather inexpensive but both using old SATA II. I tried to find newer faster version of Vertex because production Vertex 2 Pro is stopped but there is no information about new drives that has similar functionality and are cheap. Do you recommend cheap SSD drives that are suitable for DB needs? We were able to get OCZ Deneva 2's (http://www.oczenterprise.com/downloads/solutions/ocz-deneva2-r-mlc-2.5in_Product_Brief.pdf) from our supplier, which were suggested by OCZ as replacement for the vertex 2 pro's. They're marketed as safe under power failure and our tests with the diskchecker tool confirmed that. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Parallel Scaling of a pgplsql problem
On 2012-04-26 04:40, Venki Ramachandran wrote: Thanks Tom, clock_timestamp() worked. Appreciate it!!! and Sorry was hurrying to get this done at work and hence did not read through. Can you comment on how you would solve the original problem? Even if I can get the 11 seconds down to 500 ms for one pair, running it for 300k pairs will take multiple hours. How can one write a combination of a bash script/pgplsql code so as to use all 8 cores of a server. I am seeing that this is just executing in one session/process. You want to compare a calculation on the cross product 'employee x employee'. If employee is partitioned into emp1, emp2, ... emp8, the cross product is equal to the union of emp1 x employee, emp2 x employee, .. emp8 x employee. Each of these 8 cross products on partitions can be executed in parallel. I'd look into dblink to execute each of the 8 cross products in parallel, and then union all of those results. http://www.postgresql.org/docs/9.1/static/contrib-dblink-connect.html regards, Yeb
Re: [PERFORM] Sudden Query slowdown on our Postgresql Server
On 2012-03-23 05:53, Sebastian Melchior wrote: Hi, we already used iostat and iotop during times of the slowdown, there is no sudden drop in I/O workload in the times of the slowdown. Also the iowait does not spike and stays as before. So i do not think that this is I/O related. As the disks are SSDs there also still is some "head room" left. I've seen a ssd completely lock up for a dozen seconds or so after giving it a smartctl command to trim a section of the disk. I'm not sure if that was the vertex 2 pro disk I was testing or the intel 710, but enough reason for us to not mount filesystems with -o discard. regards, Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] SSD and RAID
On 2012-03-07 01:36, Mark Kirkwood wrote: On 06/03/12 21:17, Yeb Havinga wrote: One thing to note is that linux software raid with md doesn't support discard, which might shorten the drive's expected lifetime. To get some numbers I tested the raid 1 of ssd's setup for mediawear under a PostgreSQL load earlier, see http://archives.postgresql.org/pgsql-general/2011-11/msg00141.php Right, which is a bit of a pain - we are considering either formatting the drive with less capacity and using md RAID 1 or else doing the mirror in LVM to enable a working discard/trim. When I measured the write durability without discard on the enterprise disks, I got numbers that in normal production use would outlive the lifetime of the servers. It would be interesting to see durability numbers for the desktop SSDs, even when partitioned to a part of the disk. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] SSD and RAID
On 2012-03-06 09:34, Andrea Suisani wrote: On 03/06/2012 09:17 AM, Yeb Havinga wrote: PS: we applied the same philosophy (different brands) also to motherboards, io controllers and memory, but after testing, we liked one IO controllers software so much more than the other so we chose to have only one. Also stream memory performance of one motherboard showed a significant performance regression in the higher thread counts that we decided to go for the other brand for all servers. care to share motherboard winning model? thanks Andrea On http://i.imgur.com/vfmvu.png is a graph of three systems, made with the multi stream scaling (average of 10 tests if I remember correctly) test. The red and blue are 2 X 12 core opteron 6168 systems with 64 GB DDR3 1333MHz in 8GB dimms Red is a Tyan S8230 Blue is a Supermicro H8DGI-G We tried a lot of things to rule out motherboards, such as swap memory of both systems, ensure BIOS settings are similar (e.g. ECC mode), update to latest BIOS where possible, but none of those settings improved the memory performance drop. Both systems were installed with kickstarted Centos 6.2, so also no kernel setting differences there.. regards, Yeb -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] SSD and RAID
On 2012-03-05 23:37, Mark Kirkwood wrote: Which brings up the question of should it be a pair in RAID 1 or just a singe drive? Traditionally this would have been a no brainer "Of course you want RAID 1 or RAID 10"! However our experience with SSD failure modes points to firmware bugs as primary source of trouble - and these are likely to impact both drives (nearly) simultaneously in a RAID 1 configuration. Also the other major issue to watch - flash write limit exhaustion - is also likely to hit at the same time for a pair of drives in RAID 1. What are other folks who are using SSDs doing? This is exactly the reason why in a set of new hardware I'm currently evaluating two different brands of manufacturers for the spindles (behind bbwc for wal, os, archives etc) and ssds (on mb sata ports). For the ssd's we've chosen the Intel 710 and OCZ Vertex 2 PRO, however that last one was EOL and OCZ offered to replace it by the Deneva 2 (http://www.oczenterprise.com/downloads/solutions/ocz-deneva2-r-mlc-2.5in_Product_Brief.pdf). Still waiting for a test Deneva though. One thing to note is that linux software raid with md doesn't support discard, which might shorten the drive's expected lifetime. To get some numbers I tested the raid 1 of ssd's setup for mediawear under a PostgreSQL load earlier, see http://archives.postgresql.org/pgsql-general/2011-11/msg00141.php I would recommended that for every ssd considered for production use, test the ssd with diskchecker.pl on a filesystem that's mounted the same as you would with your data (e.g. with xfs or ext4 with nobarrier), and also do a mediawear test like the one described in the linked pgsql-general threar above, especially if you're chosing to run on non-enterprise marketed ssds. regards, Yeb PS: we applied the same philosophy (different brands) also to motherboards, io controllers and memory, but after testing, we liked one IO controllers software so much more than the other so we chose to have only one. Also stream memory performance of one motherboard showed a significant performance regression in the higher thread counts that we decided to go for the other brand for all servers. -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Intel 710 pgbench write latencies
On 2011-11-02 16:06, Magnus Hagander wrote: On Wed, Nov 2, 2011 at 16:04, Yeb Havinga wrote: On 2011-11-02 15:06, Kevin Grittner wrote: Yeb Havingawrote: I'm now contemplating not using the 710 at all. Why should I not buy two 6Gbps SSDs without supercap (e.g. Intel 510 and OCZ Vertex 3 Max IOPS) with a IO controller+BBU? Wouldn't the data be subject to loss between the time the IO controller writes to the SSD and the time it makes it from buffers to flash RAM? Good question. My guess would be no, if the raid controller does 'write-throughs' on the attached disks, and the SSD's don't lie about when they've written to RAM. Doesn't most SSDs without supercaps lie about the writes, though? I happened to have a Vertex 3, no supercap, available to test this with diskchecker. On a ext4 filesystem (just mounted with noatime, not barriers=off), this happenend: # /root/diskchecker.pl -s 192.168.73.1 verify testfile verifying: 0.00% verifying: 30.67% verifying: 78.97% verifying: 100.00% Total errors: 0 So I guess that's about as much as I can test without actually hooking it behind a hardware controller and test that. I will soon test the 3ware 9750 with Vertex 3 and Intel 510 - both in the 3ware's ssd compatibility list. More info from testing software raid 1: - with lvm mirroring, discards / trim go through to the disks. This is where the Intel is fast enough, but the vertex 2 pro is busy for ~ 10 seconds. -- Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Intel 710 pgbench write latencies
On 2011-11-03 15:31, Shaun Thomas wrote: On 11/03/2011 04:38 AM, Yeb Havinga wrote: Both comparable near 10K tps. That's another thing I was wondering about. Why are we talking about Vertex 2 Pro's, anyway? The Vertex 3 Pros post much better results and are still capacitor-backed. Not for sale yet.. -- Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Intel 710 pgbench write latencies
On 2011-11-02 22:08, Merlin Moncure wrote: On Wed, Nov 2, 2011 at 3:45 PM, Yeb Havinga wrote: Intel latency graph at http://imgur.com/Hh3xI Ocz latency graph at http://imgur.com/T09LG curious: what were the pgbench results in terms of tps? merlin Both comparable near 10K tps. -- Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Intel 710 pgbench write latencies
On 2011-11-02 16:16, Yeb Havinga wrote: On 2011-11-02 15:26, Merlin Moncure wrote: I would keep at least 20-30% of both drives unpartitioned to leave the controller room to wear level and as well as other stuff. I'd try wiping the drives, reparititoing, and repeating your test. I would also compare times through mdadm and directly to the device. Good idea. Reinstalled system - > 50% drives unpartitioned. /dev/sdb3 19G 5.0G 13G 29% /ocz /dev/sda3 19G 4.8G 13G 28% /intel /dev/sdb3 on /ocz type ext4 (rw,noatime,nobarrier,discard) /dev/sda3 on /intel type ext4 (rw,noatime,nobarrier,discard) Again WAL was put in a ramdisk. pgbench -i -s 300 t # fits in ram pgbench -c 20 -M prepared -T 300 -l t Intel latency graph at http://imgur.com/Hh3xI Ocz latency graph at http://imgur.com/T09LG -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Intel 710 pgbench write latencies
On 2011-11-02 15:26, Merlin Moncure wrote: On Wed, Nov 2, 2011 at 8:05 AM, Yeb Havinga wrote: Hello list, A OCZ Vertex 2 PRO and Intel 710 SSD, both 100GB, in a software raid 1 setup. I was pretty convinced this was the perfect solution to run PostgreSQL on SSDs without a IO controller with BBU. No worries for strange firmware bugs because of two different drives, good write endurance of the 710. Access to the smart attributes. Complete control over the disks: nothing hidden by a hardware raid IO layer. Then I did a pgbench test: - bigger than RAM test (~30GB database with 24GB ram) - and during that test I removed the Intel 710. - during the test I removed the 710 and 10 minutes later inserted it again and added it to the array. The pgbench transaction latency graph is here: http://imgur.com/JSdQd With only the OCZ, latencies are acceptable but with two drives, there are latencies up to 3 seconds! (and 11 seconds at disk remove time) Is this due to software raid, or is it the Intel 710? To figure that out I repeated the test, but now removing the OCZ, latency graph at: http://imgur.com/DQa59 (The 12 seconds maximum was at disk remove time.) So the Intel 710 kind of sucks latency wise. Is it because it is also heavily reading, and maybe WAL should not be put on it? I did another test, same as before but * with 5GB database completely fitting in RAM (24GB) * put WAL on a ramdisk * started on the mirror * during the test mdadm --fail on the Intel SSD Latency graph is at: http://imgur.com/dY0Rk So still: with Intel 710 participating in writes (beginning of graph), some latencies are over 2 seconds, with only the OCZ, max write latencies are near 300ms. I'm now contemplating not using the 710 at all. Why should I not buy two 6Gbps SSDs without supercap (e.g. Intel 510 and OCZ Vertex 3 Max IOPS) with a IO controller+BBU? Benefits: should be faster for all kinds of reads and writes. Concerns: TRIM becomes impossible (which was already impossible with md raid1, lvm / dm based mirroring could work) but is TRIM important for a PostgreSQL io load, without e.g. routine TRUNCATES? Also the write endurance of these drives is probably a lot less than previous setup. software RAID (mdadm) is currently blocking TRIM. the only way to to get TRIM in a raid-ish environment is through LVM mirroring/striping or w/brtfs raid (which is not production ready afaik). Given that, if you do use software raid, it's not a good idea to partition the entire drive because the very first thing the raid driver does is write to the entire device. If that is bad because of a decreased lifetime, I don't think these number of writes are significant - in a few hours of pgbenching I the GBs written are more than 10 times the GB sizes of the drives. Or do you suggest this because then the disk firmware can operate assuming a smaller idema capacity, thereby proloning the drive life? (i.e. the Intel 710 200GB has 200GB idema capacity but 320GB raw flash). I would keep at least 20-30% of both drives unpartitioned to leave the controller room to wear level and as well as other stuff. I'd try wiping the drives, reparititoing, and repeating your test. I would also compare times through mdadm and directly to the device. Good idea. -- Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Intel 710 pgbench write latencies
On 2011-11-02 15:06, Kevin Grittner wrote: Yeb Havinga wrote: I'm now contemplating not using the 710 at all. Why should I not buy two 6Gbps SSDs without supercap (e.g. Intel 510 and OCZ Vertex 3 Max IOPS) with a IO controller+BBU? Wouldn't the data be subject to loss between the time the IO controller writes to the SSD and the time it makes it from buffers to flash RAM? Good question. My guess would be no, if the raid controller does 'write-throughs' on the attached disks, and the SSD's don't lie about when they've written to RAM. I'll put this on my to test list for the new setup. -- Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Intel 710 pgbench write latencies
Hello list, A OCZ Vertex 2 PRO and Intel 710 SSD, both 100GB, in a software raid 1 setup. I was pretty convinced this was the perfect solution to run PostgreSQL on SSDs without a IO controller with BBU. No worries for strange firmware bugs because of two different drives, good write endurance of the 710. Access to the smart attributes. Complete control over the disks: nothing hidden by a hardware raid IO layer. Then I did a pgbench test: - bigger than RAM test (~30GB database with 24GB ram) - and during that test I removed the Intel 710. - during the test I removed the 710 and 10 minutes later inserted it again and added it to the array. The pgbench transaction latency graph is here: http://imgur.com/JSdQd With only the OCZ, latencies are acceptable but with two drives, there are latencies up to 3 seconds! (and 11 seconds at disk remove time) Is this due to software raid, or is it the Intel 710? To figure that out I repeated the test, but now removing the OCZ, latency graph at: http://imgur.com/DQa59 (The 12 seconds maximum was at disk remove time.) So the Intel 710 kind of sucks latency wise. Is it because it is also heavily reading, and maybe WAL should not be put on it? I did another test, same as before but * with 5GB database completely fitting in RAM (24GB) * put WAL on a ramdisk * started on the mirror * during the test mdadm --fail on the Intel SSD Latency graph is at: http://imgur.com/dY0Rk So still: with Intel 710 participating in writes (beginning of graph), some latencies are over 2 seconds, with only the OCZ, max write latencies are near 300ms. I'm now contemplating not using the 710 at all. Why should I not buy two 6Gbps SSDs without supercap (e.g. Intel 510 and OCZ Vertex 3 Max IOPS) with a IO controller+BBU? Benefits: should be faster for all kinds of reads and writes. Concerns: TRIM becomes impossible (which was already impossible with md raid1, lvm / dm based mirroring could work) but is TRIM important for a PostgreSQL io load, without e.g. routine TRUNCATES? Also the write endurance of these drives is probably a lot less than previous setup. Thoughts, ideas are highly appreciated! -- Yeb PS: I checked for proper alignment of partitions as well as md's data offsett, all was well. Ext4 filesystem mounted with barrier=0 /proc/sys/vm/dirty_background_bytes set to 17850 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU still needed with SSD?
On 2011-07-19 13:37, Florian Weimer wrote: Is this "Total_LBAs_Written"? I got the same name "Total_LBAs_Written" on an 5.39 smartmontools, which was renamed to 241 Lifetime_Writes_GiB after upgrade to 5.42. Note that this is smartmontools new interpretation of the values, which happen to match with the OCZ tools interpretation (241: SSD Lifetime writes from host Number of bytes written to SSD: 448 G). So for the Intels it's probably also lifetime writes in GB but you'd have to check with an Intel smart values reader to be absolutely sure. The values appear to be far too low: 241 Total_LBAs_Written 0x0032 100 100 000Old_age Always - 188276 242 Total_LBAs_Read 0x0032 100 100 000Old_age Always - 116800 241 Total_LBAs_Written 0x0032 100 100 000Old_age Always - 189677 242 Total_LBAs_Read 0x0032 100 100 000Old_age Always - 92509 Hmm that would mean 188TB written. Does that value seem right to your use case? If you'd write 100MB/s sustained, it would take 22 days to reach 188TB. The second set of numbers are from the drive which wears more quickly. It's strange that there's such a large difference in lifetime left, when lifetime writes are so similar. Maybe there are more small md metadata updates on the second disk, but without digging into md's internals it's impossible to say anything constructive about it. Off-topic: new cool tool in smartmontools-5.4x: /usr/sbin/update-smart-drivedb :-) -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU still needed with SSD?
On 2011-07-19 12:47, Florian Weimer wrote: It would be interesting to see if the drives also show total xyz written, and if that differs a lot too. Do you know how to check that with smartctl? smartctl -a /dev/ should show all values. If it shows something that looks like garbage, it means that the database of smartmontools doesn't have the correct information yet for these new drives. I know that for the recently new OCZ vertex 2 and 3 SSDs you need at least 5.40 or 5.41 and that's pretty new stuff. (I just happened to install Fedora 15 today and that has smartmontools 5.41, whereas e.g. Scientific Linux 6 has 5.39). -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU still needed with SSD?
On 2011-07-19 09:56, Florian Weimer wrote: * Yeb Havinga: The biggest drawback of 2 SSD's with supercap in hardware raid 1, is that if they are both new and of the same model/firmware, they'd probably reach the end of their write cycles at the same time, thereby failing simultaneously. I thought so too, but I've got two Intel 320s (I suppose, the report device model is "SSDSA2CT040G3") in a RAID 1 configuration, and after about a month of testing, one is down to 89 on the media wearout indicator, and the other is still at 96. Both devices are deteriorating, but one at a significantly faster rate. That's great news if this turns out to be generally true. Is it on mdadm software raid? I searched a bit in the mdadm manual for reasons this can be the case. It isn't the occasional check (echo check > /sys/block/md0/md/sync_action) since that seems to do two reads and compare. Another idea was that the layout of the mirror might not be different, but the manual says that the --layout configuration directive is only for RAID 5,6 and 10, but not RAID 1. Then my eye caught --write-behind, the maximum number of outstanding writes and it has a non-zero default value, but is only done if a drive is marked write-mostly. Maybe it is caused by the initial build of the array? But then a 7% difference seems like an awful lot. It would be interesting to see if the drives also show total xyz written, and if that differs a lot too. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU still needed with SSD?
On 2011-07-18 03:43, Andy wrote: Hi, Is BBU still needed with SSD? SSD has its own cache. And in certain models such as Intel 320 that cache is backed by capacitors. So in a sense that cache acts as a BBU that's backed by capacitors instead of batteries. In this case is BBU still needed? If I put 2 SSD +with supercap? in software RAID 1, would that be any slower than 2 SSD in HW RAID 1 with BBU? What are the pros and cons? The biggest drawback of 2 SSD's with supercap in hardware raid 1, is that if they are both new and of the same model/firmware, they'd probably reach the end of their write cycles at the same time, thereby failing simultaneously. You'd have to start with two SSD's with different remaining life left in the software raid setup. regards, Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Contemplating SSD Hardware RAID
On 2011-06-21 22:10, Yeb Havinga wrote: There's some info burried in http://archives.postgresql.org/pgsql-performance/2011-03/msg00350.php where two Vertex 2 pro's are compared; the first has been really hammered with pgbench, the second had a few months duty in a workstation. The raw value of SSD Available Reserved Space seems to be a good candidate to watch to go to 0, since the pgbenched-drive has 16GB left and the workstation disk 17GB. Would be cool to graph with e.g. symon (http://i.imgur.com/T4NAq.png) I forgot to mention that both newest firmware of the drives as well as svn versions of smartmontools are advisable, before figuring out what all those strange values mean. It's too bad however that OCZ doesn't let the user choose which firmware to run (the tool always picks the newest), so after every upgrade it'll be a surprise what values are supported or if any of the values are reset or differently interpreted. Even when disks in production might not be upgraded eagerly, replacing a faulty drive means that one probably needs to be upgraded first and it would be nice to have a uniform smart value readout for the monitoring tools. -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Contemplating SSD Hardware RAID
to the host for all data access: 0 194: SSD Temperature Monitoring Current: 128 High: 129 Low: 127 195: SSD ECC On-the-fly Count Normalized Rate: 100 196: SSD Reallocation Event Count Total number of reallocated Flash blocks: 0 201: SSD Uncorrectable Soft Read Error Rate Normalized Rate: 100 204: SSD Soft ECC Correction Rate (RAISE) Normalized Rate: 100 230: SSD Life Curve Status Current state of drive operation based upon the Life Curve: 100 231: SSD Life Left Approximate SDD life Remaining: 100% 241: SSD Lifetime writes from host Number of bytes written to SSD: 162 GB 242: SSD Lifetime reads from host Number of bytes read from SSD: 236 GB There's some info burried in http://archives.postgresql.org/pgsql-performance/2011-03/msg00350.php where two Vertex 2 pro's are compared; the first has been really hammered with pgbench, the second had a few months duty in a workstation. The raw value of SSD Available Reserved Space seems to be a good candidate to watch to go to 0, since the pgbenched-drive has 16GB left and the workstation disk 17GB. Would be cool to graph with e.g. symon (http://i.imgur.com/T4NAq.png) -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Contemplating SSD Hardware RAID
On 2011-06-21 09:51, Yeb Havinga wrote: On 2011-06-21 08:33, Greg Smith wrote: On 06/20/2011 11:54 PM, Dan Harris wrote: I'm exploring the combination of an Areca 1880ix-12 controller with 6x OCZ Vertex 3 V3LT-25SAT3 2.5" 240GB SATA III drives in RAID-10. Has anyone tried this combination? What nasty surprise am I overlooking here? I forgot to mention that with an SSD it's important to watch the remaining lifetime. These values can be read with smartctl. When putting the disk behind a hardware raid controller, you might not be able to read them from the OS, and the hardware RAID firmware might be to old to not know about the SSD lifetime indicator or not even show it. -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Contemplating SSD Hardware RAID
On 2011-06-21 08:33, Greg Smith wrote: On 06/20/2011 11:54 PM, Dan Harris wrote: I'm exploring the combination of an Areca 1880ix-12 controller with 6x OCZ Vertex 3 V3LT-25SAT3 2.5" 240GB SATA III drives in RAID-10. Has anyone tried this combination? What nasty surprise am I overlooking here? You can expect database corruption the first time something unexpected interrupts the power to the server. That's nasty, but it's not surprising--that's well documented as what happens when you run PostreSQL on hardware with this feature set. You have to get a Vertex 3 Pro to get one of the reliable 3rd gen designs from them with a supercap. (I don't think those are even out yet though) We've had reports here of the earlier Vertex 2 Pro being fully stress tested and working out well. I wouldn't even bother with a regular Vertex 3, because I don't see any reason to believe it could be reliable for database use, just like the Vertex 2 failed to work in that role. I've tested both the Vertex 2, Vertex 2 Pro and Vertex 3. The vertex 3 pro is not yet available. The vertex 3 I tested with pgbench didn't outperform the vertex 2 (yes, it was attached to a SATA III port). Also, the vertex 3 didn't work in my designated system until a firmware upgrade that came available ~2.5 months after I purchased it. The support call I had with OCZ failed to mention it, and by pure coincidende when I did some more testing at a later time, I ran the firmware upgrade tool (that kind of hides which firmwares are available, if any) and it did an update, after that it was compatible with the designated motherboard. Another disappointment was that after I had purchased the Vertex 3 drive, OCZ announced a max-iops vertex 3. Did that actually mean I bought an inferior version? Talking about a bad out-of-the-box experience. -1 ocz fan boy. When putting such a SSD up for database use I'd only consider a vertex 2 pro (for the supercap), paired with another SSD of a different brand with supercap (i.e. the recent intels). When this is done on a motherboard with > 1 sata controller, you'd have controller redundancy and can also survive single drive failures when a drive wears out. Having two different SSD versions decreases the chance of both wearing out the same time, and make you a bit more resilient against firmware bugs. It would be great if there was yet another supercapped SSD brand, with a modified md software raid that reads all three drives at once and compares results, instead of the occasional check. If at least two drives agree on the contents, return the data. -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Benchmarking a large server
On 2011-05-09 22:32, Chris Hoover wrote: The issue we are running into is how do we benchmark this server, specifically, how do we get valid benchmarks for the Fusion IO card? Normally to eliminate the cache effect, you run iozone and other benchmark suites at 2x the ram. However, we can't do that due to 2TB > 1.3TB. So, does anyone have any suggestions/experiences in benchmarking storage when the storage is smaller then 2x memory? Oracle's Orion test tool has a configurable cache size parameter - it's a separate download and specifically written to benchmark database oltp and olap like io patterns, see http://www.oracle.com/technetwork/topics/index-089595.html -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Intel SSDs that may not suck
c_Soft_Read_Err_Rate 0x001c 120 120 000Old_age Offline - 0/0 204 Soft_ECC_Correct_Rate 0x001c 120 120 000Old_age Offline - 0/0 230 Life_Curve_Status 0x0013 100 100 000Pre-fail Always - 100 231 SSD_Life_Left 0x0013 100 100 010Pre-fail Always - 0 232 Available_Reservd_Space 0x 000 000 010Old_age Offline FAILING_NOW 17 233 SandForce_Internal 0x 000 000 000Old_age Offline - 128 234 SandForce_Internal 0x0032 000 000 000Old_age Always - 448 235 SuperCap_Health 0x0033 100 100 010Pre-fail Always - 0 241 Lifetime_Writes_GiB 0x0032 000 000 000Old_age Always - 448 242 Lifetime_Reads_GiB 0x0032 000 000 000Old_age Always - 192 SMART Error Log not supported SMART Self-test Log not supported SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 100 Not_testing 200 Not_testing 300 Not_testing 400 Not_testing 500 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. -- Yeb Havinga http://www.mgrid.net/ Mastering Medical Data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] CPUs for new databases
Scott Marlowe wrote: There was an earlier thread with Greg and I in it where we posted the memory bandwidth numbers for that machine and it was insane how much data all 48 cores could pump into / out of memory at the same time. Yeah, it was insane. Building a economical 'that generation opteron' database server has been on my wishlist since that thread, my current favorite is the 8-core 6128 opteron, for $275,- at newegg http://www.newegg.com/Product/Product.aspx?Item=N82E16819105266 Ah might as well drop the whole config on my wishlist as well: 2 times that 8 core processor Supermicro H8DGU-F motherboard - 16 dimm slots, dual socket, dual Intel ethernet and additional ethernet for IPMI. 2 times KVR1333D3D4R9SK8/32G memory - 4GB dimms seem to be at the GB/$ sweet spot at the moment for DDR3 1 time OCZ Vertex 2 Pro 100GB (there was a thread about this sandforce disk as well: a SSD with supercap that acts as battery backup) maybe another one or two spindled 2.5" drives for archive/backup. Supermicro 113TQ-563UB chassis At the time I looked this up, I could buy it for just over €3000,- regards Yeb Havinga PS: I'm in no way involved with either of the manufacturers, nor one of their fanboys. I'm just interested, like the OP, what is good hardware/config for a PG related server. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance on new 64bit server compared to my 32bit desktop
Scott Marlowe wrote: On Tue, Aug 31, 2010 at 6:41 AM, Yeb Havinga wrote: export OMP_NUM_THREADS=4 Then I get the following. The rather wierd dip at 5 threads is consistent over multiple tries: I get similar dips on my server. Especially as you make the stream test write a large enough chunk of data to outrun its caches. See attached png. Interesting graph, especially since the overall feeling is a linear like increase in memory bandwidth when more cores are active. Just curious, what is the 8-core cpu? -- Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance on new 64bit server compared to my 32bit desktop
Jose Ildefonso Camargo Tolosa wrote: Ok, this could explain the huge difference. I was planing on getting GigaByte GA-890GPA-UD3H, with a Phenom II X6 and that ram: Crucial CT2KIT25664BA1339, Crucial BL2KIT25664FN1608, or something better I find when I get enough money (depending on my budget at the moment). Why not pair a 8-core magny cours ($280,- at newegg http://www.newegg.com/Product/Product.aspx?Item=N82E16819105266) with a supermicro ATX board http://www.supermicro.com/Aplus/motherboard/Opteron6100/SR56x0/H8SGL-F.cfm ($264 at newegg http://www.newegg.com/Product/Product.aspx?Item=N82E16813182230&Tpk=H8SGL-F) and some memory? regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance on new 64bit server compared to my 32bit desktop
Greg Smith wrote: Yeb Havinga wrote: model name : AMD Phenom(tm) II X4 940 Processor @ 3.00GHz cpu cores : 4 stream compiled with -O3 Function Rate (MB/s) Avg time Min time Max time Triad: 5395.1815 0.0089 0.0089 0.0089 I'm not sure if Yeb's stream was compiled to use MPI correctly though, because I'm not seeing "Number of Threads" in his results. Here's what works for me: gcc -O3 -fopenmp stream.c -o stream And then you can set: export OMP_NUM_THREADS=4 Then I get the following. The rather wierd dip at 5 threads is consistent over multiple tries: Number of Threads requested = 1 Function Rate (MB/s) Avg time Min time Max time Triad: 5378.7495 0.0089 0.0089 0.0090 Number of Threads requested = 2 Function Rate (MB/s) Avg time Min time Max time Triad: 6596.1140 0.0073 0.0073 0.0073 Number of Threads requested = 3 Function Rate (MB/s) Avg time Min time Max time Triad: 7033.9806 0.0069 0.0068 0.0069 Number of Threads requested = 4 Function Rate (MB/s) Avg time Min time Max time Triad: 7007.2950 0.0069 0.0069 0.0069 Number of Threads requested = 5 Function Rate (MB/s) Avg time Min time Max time Triad: 6553.8133 0.0074 0.0073 0.0074 Number of Threads requested = 6 Function Rate (MB/s) Avg time Min time Max time Triad: 6803.6427 0.0071 0.0071 0.0071 Number of Threads requested = 7 Function Rate (MB/s) Avg time Min time Max time Triad: 6895.6909 0.0070 0.0070 0.0071 Number of Threads requested = 8 Function Rate (MB/s) Avg time Min time Max time Triad: 6931.3018 0.0069 0.0069 0.0070 Other info: DDR2 800MHz ECC memory MB: 790FX chipset (Asus m4a78-e) regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] GPU Accelerated Sorting
Greg Smith wrote: This comes up every year or so. The ability of GPU offloading to help with sorting has to overcome the additional latency that comes from copying everything over to it and then getting all the results back. If you look at the typical types of sorting people see in PostgreSQL, it's hard to find ones that are a) big enough to benefit from being offloaded to the GPU like that, while also being b) not so bottlenecked on disk I/O that speeding up the CPU part matters. And if you need to sort something in that category, you probably just put an index on it instead and call it a day. If you made me make a list of things I'd think would be worthwhile to spend effort improving in PostgreSQL, this would be on the research list, but unlikely to even make my personal top 100 things that are work fiddling with. Related is 'Parallelizing query optimization' (http://www.vldb.org/pvldb/1/1453882.pdf) in which they actually experiment with PostgreSQL. Note that their target platform is general purpose CPU, not a SIMD GPU processor. -- Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance on new 64bit server compared to my 32bit desktop
Scott Marlowe wrote: On Mon, Aug 30, 2010 at 1:58 AM, Yeb Havinga wrote: four parallel r...@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out & ./a.out & ./a.out & ./a.out You know you can just do "stream 4" to get 4 parallel streams right? Which version is that? The stream.c source contains no argc/argv usage, though Code/Versions/Experimental has a script called Parallel_jobs that spawns n processes. -- Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance on new 64bit server compared to my 32bit desktop
Jose Ildefonso Camargo Tolosa wrote: Also, nowadays, Intel has better performance than AMD, at least when comparing Athlon 64 vs Core2, I'm still saving to get a Phenom II system in order to benchmark them and see how it goes (does anyone have one of these for testing?). r...@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 4 model name : AMD Phenom(tm) II X4 940 Processor stepping: 2 cpu MHz : 3000.000 cache size : 512 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt bogomips: 6020.46 TLB size: 1024 4K pages clflush size: 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate stream compiled with -O3 r...@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out - STREAM version $Revision: 5.9 $ - This system uses 8 bytes per DOUBLE PRECISION word. - Array size = 200, Offset = 0 Total memory required = 45.8 MB. Each test is run 10 times, but only the *best* time for each is used. - Printing one line per active thread - Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 5031 microseconds. (= 5031 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. - WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. - Function Rate (MB/s) Avg time Min time Max time Copy:5056.0434 0.0064 0.0063 0.0064 Scale: 4950.4916 0.0065 0.0065 0.0065 Add: 5322.0173 0.0091 0.0090 0.0091 Triad: 5395.1815 0.0089 0.0089 0.0089 - Solution Validates - two parallel r...@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out & ./a.out - Function Rate (MB/s) Avg time Min time Max time Copy:2984.2741 0.0108 0.0107 0.0108 Scale: 2945.8261 0.0109 0.0109 0.0110 Add: 3282.4631 0.0147 0.0146 0.0149 Triad: 3321.2893 0.0146 0.0145 0.0148 - Function Rate (MB/s) Avg time Min time Max time Copy:2981.4898 0.0108 0.0107 0.0108 Scale: 2943.3067 0.0109 0.0109 0.0109 Add: 3283.8552 0.0147 0.0146 0.0149 Triad: 3313.9634 0.0147 0.0145 0.0148 four parallel r...@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out & ./a.out & ./a.out & ./a.out - Function Rate (MB/s) Avg time Min time Max time Copy:1567.4880 0.0208 0.0204 0.0210 Scale: 1525.3401 0.0211 0.0210 0.0213 Add: 1739.7735 0.0279 0.0276 0.0282 Triad: 1763.4858 0.0274 0.0272 0.0276 - Function Rate (MB/s) Avg time Min time Max time Copy:1559.0759 0.0208 0.0205 0.0210 Scale: 1536.2520 0.0211 0.0208 0.0212 Add: 1740.4503 0.0279 0.0276 0.0283 Triad: 1758.4951 0.0276 0.0273 0.0279 - Function Rate (MB/s) Avg time Min time Max time Copy:1552.7271 0.0208 0.0206 0.0210 Scale: 1527.5275 0.0211 0.0209 0.0212 Add: 1737.9263 0.0279 0.0276 0.0282 Triad: 1757.3439 0.0276 0.0273 0.0278 -
Re: [PERFORM] write barrier question
Samuel Gendler wrote: When running pgbench on a db which fits easily into RAM (10% of RAM = -s 380), I see transaction counts a little less than 5K. When I go to 90% of RAM (-s 3420), transaction rate dropped to around 1000 ( at a fairly wide range of concurrencies). At that point, I decided to investigate the performance impact of write barriers. At 90% of RAM you're probable reading data as well, not only writing. Watching iostat -xk 1 or vmstat 1 during a test should confirm this. To find the maximum database size that fits comfortably in RAM you could try out http://github.com/gregs1104/pgbench-tools - my experience with it is that it takes less than 10 minutes to setup and run and after some time you get rewarded with nice pictures! :-) regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
Michael March wrote: If anyone is interested I just completed a series of benchmarks of stock Postgresql running on a normal HDD vs a SSD. If you don't want to read the post, the summary is that SSDs are 5 to 7 times faster than a 7200RPM HDD drive under a pgbench load. http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html Is this what everyone else is seeing? I tested a SSD with a capacitor and posted conclusions here http://archives.postgresql.org/pgsql-performance/2010-07/msg00449.php regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Yeb Havinga wrote: Hannu Krosing wrote: Did it fit in shared_buffers, or system cache ? Database was ~5GB, server has 16GB, shared buffers was set to 1920MB. I first noticed this several years ago, when doing a COPY to a large table with indexes took noticably longer (2-3 times longer) when the indexes were in system cache than when they were in shared_buffers. I read this as a hint: try increasing shared_buffers. I'll redo the pgbench run with increased shared_buffers. Shared buffers raised from 1920MB to 3520MB: pgbench -v -l -c 20 -M prepared -T 1800 test starting vacuum...end. starting vacuum pgbench_accounts...end. transaction type: TPC-B (sort of) scaling factor: 300 query mode: prepared number of clients: 20 duration: 1800 s number of transactions actually processed: 12971714 tps = 7206.244065 (including connections establishing) tps = 7206.349947 (excluding connections establishing) :-) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Hannu Krosing wrote: Did it fit in shared_buffers, or system cache ? Database was ~5GB, server has 16GB, shared buffers was set to 1920MB. I first noticed this several years ago, when doing a COPY to a large table with indexes took noticably longer (2-3 times longer) when the indexes were in system cache than when they were in shared_buffers. I read this as a hint: try increasing shared_buffers. I'll redo the pgbench run with increased shared_buffers. so the test is actually how fast the ssd can capture sequential WAL writes and fsync without barriers, mixed with an occasional checkpoint with random write IO on another partition). Since the WAL writing is the same for both block_size setups, I decided to compare random writes to a file of 5GB with Oracle's Orion tool: Are you sure that you are not writing full WAL pages ? I'm not sure I understand this question. Do you have any stats on how much WAL is written for 8kb and 4kb test cases ? Would some iostat -xk 1 for each partition suffice? And for other disk i/o during the tests ? Not existent. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Scott Marlowe wrote: On Mon, Aug 2, 2010 at 6:07 PM, Greg Smith wrote: Josh Berkus wrote: That doesn't make much sense unless there's some special advantage to a 4K blocksize with the hardware itself. Given that pgbench is always doing tiny updates to blocks, I wouldn't be surprised if switching to smaller blocks helps it in a lot of situations if one went looking for them. Also, as you point out, pgbench runtime varies around wildly enough that 10% would need more investigation to really prove that means something. But I think Yeb has done plenty of investigation into the most interesting part here, the durability claims. Please note that the 10% was on a slower CPU. On a more recent CPU the difference was 47%, based on tests that ran for an hour. That's why I absolutely agree with Merlin Moncure that more testing in this department is welcome, preferably by others since after all I could be on the pay roll of OCZ :-) I looked a bit into Bonnie++ but fail to see how I could do a test that somehow matches the PostgreSQL setup during the pgbench tests (db that fits in memory, so the test is actually how fast the ssd can capture sequential WAL writes and fsync without barriers, mixed with an occasional checkpoint with random write IO on another partition). Since the WAL writing is the same for both block_size setups, I decided to compare random writes to a file of 5GB with Oracle's Orion tool: === 4K test summary ORION VERSION 11.1.0.7.0 Commandline: -testname test -run oltp -size_small 4 -size_large 1024 -write 100 This maps to this test: Test: test Small IO size: 4 KB Large IO size: 1024 KB IO Types: Small Random IOs, Large Random IOs Simulated Array Type: CONCAT Write: 100% Cache Size: Not Entered Duration for each Data Point: 60 seconds Small Columns:, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 Large Columns:, 0 Total Data Points: 21 Name: /mnt/data/5gb Size: 524288 1 FILEs found. Maximum Small IOPS=86883 @ Small=8 and Large=0 Minimum Small Latency=0.01 @ Small=1 and Large=0 === 8K test summary ORION VERSION 11.1.0.7.0 Commandline: -testname test -run oltp -size_small 8 -size_large 1024 -write 100 This maps to this test: Test: test Small IO size: 8 KB Large IO size: 1024 KB IO Types: Small Random IOs, Large Random IOs Simulated Array Type: CONCAT Write: 100% Cache Size: Not Entered Duration for each Data Point: 60 seconds Small Columns:, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 Large Columns:, 0 Total Data Points: 21 Name: /mnt/data/5gb Size: 524288 1 FILEs found. Maximum Small IOPS=48798 @ Small=11 and Large=0 Minimum Small Latency=0.02 @ Small=1 and Large=0 Running the tests for longer helps a lot on reducing the noisy results. Also letting them runs longer means that the background writer and autovacuum start getting involved, so the test becomes somewhat more realistic. Yes, that's why I did a lot of the TPC-B tests with -T 3600 so they'd run for an hour. (also the 4K vs 8K blocksize in postgres). regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Merlin Moncure wrote: On Fri, Jul 30, 2010 at 11:01 AM, Yeb Havinga wrote: Postgres settings: 8.4.4 --with-blocksize=4 I saw about 10% increase in performance compared to 8KB blocksizes. That's very interesting -- we need more testing in that department... Definately - that 10% number was on the old-first hardware (the core 2 E6600). After reading my post and the 185MBps with 18500 reads/s number I was a bit suspicious whether I did the tests on the new hardware with 4K, because 185MBps / 18500 reads/s is ~10KB / read, so I thought thats a lot closer to 8KB than 4KB. I checked with show block_size and it was 4K. Then I redid the tests on the new server with the default 8KB blocksize and got about 4700 tps (TPC-B/300)... 67/47 = 1.47. So it seems that on newer hardware, the difference is larger than 10%. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Greg Smith wrote: Greg Smith wrote: Note that not all of the Sandforce drives include a capacitor; I hope you got one that does! I wasn't aware any of the SF drives with a capacitor on them were even shipping yet, all of the ones I'd seen were the chipset that doesn't include one still. Haven't checked in a few weeks though. Answer my own question here: the drive Yeb got was the brand spanking new OCZ Vertex 2 Pro, selling for $649 at Newegg for example: http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and with the supercacitor listed right in the main production specifications there. This is officially the first inexpensive (relatively) SSD with a battery-backed write cache built into it. If Yeb's test results prove it works as it's supposed to under PostgreSQL, I'll be happy to finally have a moderately priced SSD I can recommend to people for database use. And I fear I'll be out of excuses to avoid buying one as a toy for my home system. Hello list, After a week testing I think I can answer the question above: does it work like it's supposed to under PostgreSQL? YES The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro, http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534 * it is safe to mount filesystems with barrier off, since it has a 'supercap backed cache'. That data is not lost is confirmed by a dozen power switch off tests while running either diskchecker.pl or pgbench. * the above implies its also safe to use this SSD with barriers, though that will perform less, since this drive obeys write trough commands. * the highest pgbench tps number for the TPC-B test for a scale 300 database (~5GB) I could get was over 6700. Judging from the iostat average util of ~40% on the xlog partition, I believe that this number is limited by other factors than the SSD, like CPU, core count, core MHz, memory size/speed, 8.4 pgbench without threads. Unfortunately I don't have a faster/more core machines available for testing right now. * pgbench numbers for a larger than RAM database, read only was over 25000 tps (details are at the end of this post), during which iostat reported ~18500 read iops and 100% utilization. * pgbench max reported latencies are 20% of comparable BBWC setups. * how reliable it is over time, and how it performs over time I cannot say, since I tested it only for a week. regards, Yeb Havinga PS: ofcourse all claims I make here are without any warranty. All information in this mail is for reference purposes, I do not claim it is suitable for your database setup. Some info on configuration: BOOT_IMAGE=/boot/vmlinuz-2.6.32-22-server elevator=deadline quad core AMD Phenom(tm) II X4 940 Processor on 3.0GHz 16GB RAM 667MHz DDR2 Disk/ filesystem settings. Model Family: OCZ Vertex SSD Device Model: OCZ VERTEX2-PRO Firmware Version: 1.10 hdparm: did not change standard settings: write cache is on, as well as readahead. hdparm -AW /dev/sdc /dev/sdc: look-ahead= 1 (on) write-caching = 1 (on) Untuned ext4 filesystem. Mount options /dev/sdc2 on /data type ext4 (rw,noatime,nodiratime,relatime,barrier=0,discard) /dev/sdc3 on /xlog type ext4 (rw,noatime,nodiratime,relatime,barrier=0,discard) Note the -o discard: this means use of the automatic SSD trimming on a new linux kernel. Also, per core per filesystem there now is a [ext4-dio-unwrit] process - which suggest something like 'directio'? I haven't investigated this any further. Sysctl: (copied from a larger RAM database machine) kernel.core_uses_pid = 1 fs.file-max = 327679 net.ipv4.ip_local_port_range = 1024 65000 kernel.msgmni = 2878 kernel.msgmax = 8192 kernel.msgmnb = 65536 kernel.sem = 250 32000 100 142 kernel.shmmni = 4096 kernel.sysrq = 1 kernel.shmmax = 33794121728 kernel.shmall = 16777216 net.core.rmem_default = 262144 net.core.rmem_max = 2097152 net.core.wmem_default = 262144 net.core.wmem_max = 262144 fs.aio-max-nr = 3145728 vm.swappiness = 0 vm.dirty_background_ratio = 3 vm.dirty_expire_centisecs = 500 vm.dirty_writeback_centisecs = 100 vm.dirty_ratio = 15 Postgres settings: 8.4.4 --with-blocksize=4 I saw about 10% increase in performance compared to 8KB blocksizes. Postgresql.conf: changed from default config are: maintenance_work_mem = 480MB # pgtune wizard 2010-07-25 checkpoint_completion_target = 0.9 # pgtune wizard 2010-07-25 effective_cache_size = 5632MB # pgtune wizard 2010-07-25 work_mem = 512MB # pgtune wizard 2010-07-25 wal_buffers = 8MB # pgtune wizard 2010-07-25 checkpoint_segments = 128 # pgtune said 16 here shared_buffers = 1920MB # pgtune wizard 2010-07-25 max_connections = 100 initdb with data on sda2 and xlog on sda3, C locale Read write test on ~5GB database: $ pgbench -v -c 20 -M prepared -T 3600 test starting vacuum...end. starting vacuum pgbench_accounts...end. transaction type: TPC-B (sort of) scaling factor: 300 query mode: prep
Re: [PERFORM] Testing Sandforce SSD
Yeb Havinga wrote: Michael Stone wrote: On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote: I know I'm talking development now but is there a case for a pg_xlog block device to remove the file system overhead and guaranteeing your data is written sequentially every time? If you dedicate a partition to xlog, you already get that in practice with no extra devlopment. Due to the LBA remapping of the SSD, I'm not sure of putting files that are sequentially written in a different partition (together with e.g. tables) would make a difference: in the end the SSD will have a set new blocks in it's buffer and somehow arrange them into sets of 128KB of 256KB writes for the flash chips. See also http://www.anandtech.com/show/2899/2 But I ran out of ideas to test, so I'm going to test it anyway. Same machine config as mentioned before, with data and xlog on separate partitions, ext3 with barrier off (save on this SSD). pgbench -c 10 -M prepared -T 3600 -l test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 300 query mode: prepared number of clients: 10 duration: 3600 s number of transactions actually processed: 10856359 tps = 3015.560252 (including connections establishing) tps = 3015.575739 (excluding connections establishing) This is about 25% faster than data and xlog combined on the same filesystem. Below is output from iostat -xk 1 -p /dev/sda, which shows each second per partition statistics. sda2 is data, sda3 is xlog In the third second a checkpoint seems to start. avg-cpu: %user %nice %system %iowait %steal %idle 63.500.00 30.502.500.003.50 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 6518.00 36.00 2211.00 148.00 35524.00 31.75 0.280.12 0.11 25.00 sda1 0.00 2.000.005.00 0.00 636.00 254.40 0.036.00 2.00 1.00 sda2 0.00 218.00 36.00 40.00 148.00 1032.00 31.05 0.000.00 0.00 0.00 sda3 0.00 6298.000.00 2166.00 0.00 33856.00 31.26 0.250.12 0.12 25.00 avg-cpu: %user %nice %system %iowait %steal %idle 60.500.00 37.500.500.001.50 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 6514.00 33.00 2283.00 140.00 35188.00 30.51 0.320.14 0.13 29.00 sda1 0.00 0.000.003.00 0.0012.00 8.00 0.000.00 0.00 0.00 sda2 0.00 0.00 33.002.00 140.00 8.00 8.46 0.030.86 0.29 1.00 sda3 0.00 6514.000.00 2278.00 0.00 35168.00 30.88 0.290.13 0.13 29.00 avg-cpu: %user %nice %system %iowait %steal %idle 33.000.00 34.00 18.000.00 15.00 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 3782.007.00 7235.0028.00 44068.00 12.1869.529.46 0.09 62.00 sda1 0.00 0.000.001.00 0.00 4.00 8.00 0.000.00 0.00 0.00 sda2 0.00 322.007.00 6018.0028.00 25360.00 8.4369.22 11.33 0.08 47.00 sda3 0.00 3460.000.00 1222.00 0.00 18728.00 30.65 0.300.25 0.25 30.00 avg-cpu: %user %nice %system %iowait %steal %idle 9.000.00 36.00 22.500.00 32.50 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 1079.003.00 0.0012.00 49060.00 8.83 120.64 10.95 0.08 86.00 sda1 0.00 2.000.002.00 0.00 320.00 320.00 0.12 60.00 35.00 7.00 sda2 0.0030.003.00 10739.0012.00 43076.00 8.02 120.49 11.30 0.08 83.00 sda3 0.00 1047.000.00 363.00 0.00 5640.00 31.07 0.030.08 0.08 3.00 avg-cpu: %user %nice %system %iowait %steal %idle 62.000.00 31.002.000.005.00 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 6267.00 51.00 2493.00 208.00 35040.00 27.71 1.800.71 0.12 31.00 sda1 0.00 0.000.003.00 0.0012.00 8.00 0.000.00 0.00 0.00 sda2 0.00 123.00 51.00 344.00 208.00 1868.00 10.51 1.503.80 0.10 4.00 sda3 0.00 6144.000.00 2146.00 0.00 33160.00 30.90 0.300.14 0.14 30.00 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Michael Stone wrote: On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote: I know I'm talking development now but is there a case for a pg_xlog block device to remove the file system overhead and guaranteeing your data is written sequentially every time? If you dedicate a partition to xlog, you already get that in practice with no extra devlopment. Due to the LBA remapping of the SSD, I'm not sure of putting files that are sequentially written in a different partition (together with e.g. tables) would make a difference: in the end the SSD will have a set new blocks in it's buffer and somehow arrange them into sets of 128KB of 256KB writes for the flash chips. See also http://www.anandtech.com/show/2899/2 But I ran out of ideas to test, so I'm going to test it anyway. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query using the Cube contrib module.
Liviu Mirea-Ghiban wrote: My question is: Why is it doing a Bitmap Heap Scan / Recheck Cond? I've executed dozens of such queries and not once did the rechecking remove any rows. Is there any way to disable it, or do you have any other suggestions for optimizations (because I'm all out of ideas)? It's probably because the index nodes store data values with a lossy compression, which means that the index scan returns more rows than wanted, and that in turn is filtered out by the rescanning. See the comments for the 'RECHECK' parameter of CREATE OPERATOR CLASS (http://www.postgresql.org/docs/8.4/static/sql-createopclass.html). Its unwise to alter this behaviour without taking a look/modifying the underlying index implementation. The gist index scann part could perhaps be made a bit faster by using a smaller blocksize, but I'm not sure if or how the recheck part can be improved. Maybe rewriting the top query to not do bitmap heap scans in subqueries or inner loops? regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Yeb Havinga wrote: To get similar *average* performance results you'd need to put about 4 drives and a BBU into a server. The Please forget this question, I now see it in the mail i'm replying to. Sorry for the spam! -- Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Greg Smith wrote: Yeb Havinga wrote: Please remember that particular graphs are from a read/write pgbench run on a bigger than RAM database that ran for some time (so with checkpoints), on a *single* $435 50GB drive without BBU raid controller. To get similar *average* performance results you'd need to put about 4 drives and a BBU into a server. The worst-case latency on that solution is pretty bad though, when a lot of random writes are queued up; I suspect that's where the SSD will look much better. By the way: if you want to run a lot more tests in an organized fashion, that's what http://github.com/gregs1104/pgbench-tools was written to do. That will spit out graphs by client and by scale showing how sensitive the test results are to each. Got it, running the default config right now. When you say 'comparable to a small array' - could you give a ballpark figure for 'small'? regards, Yeb Havinga PS: Some update on the testing: I did some ext3,ext4,xfs,jfs and also ext2 tests on the just-in-memory read/write test. (scale 300) No real winners or losers, though ext2 isn't really faster and the manual need for fix (y) during boot makes it impractical in its standard configuration. I did some poweroff tests with barriers explicitily off in ext3, ext4 and xfs, still all recoveries went ok. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Matthew Wakeling wrote: Apologies, I was interpreting the graph as the latency of the device, not all the layers in-between as well. There isn't any indication in the email with the graph as to what the test conditions or software are. That info was in the email preceding the graph mail, but I see now I forgot to mention it was a 8.4.4 postgres version. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Matthew Wakeling wrote: On Sun, 25 Jul 2010, Yeb Havinga wrote: Graph of TPS at http://tinypic.com/r/b96aup/3 and latency at http://tinypic.com/r/x5e846/3 Does your latency graph really have milliseconds as the y axis? Yes If so, this device is really slow - some requests have a latency of more than a second! I try to just give the facts. Please remember that particular graphs are from a read/write pgbench run on a bigger than RAM database that ran for some time (so with checkpoints), on a *single* $435 50GB drive without BBU raid controller. Also, this is a picture with a few million points: the ones above 200ms are perhaps a hundred and hence make up a very small fraction. So far I'm pretty impressed with this drive. Lets be fair to OCZ and the SandForce guys and do not shoot from the hip things like "really slow", without that being backed by a graphed pgbench run together with it's cost, so we can compare numbers with numbers. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Yeb Havinga wrote: Greg Smith wrote: Put it on ext3, toggle on noatime, and move on to testing. The overhead of the metadata writes is the least of the problems when doing write-heavy stuff on Linux. I ran a pgbench run and power failure test during pgbench with a 3 year old computer On the same config more tests. scale 10 read only and read/write tests. note: only 240 s. starting vacuum...end. transaction type: SELECT only scaling factor: 10 query mode: prepared number of clients: 10 duration: 240 s number of transactions actually processed: 8208115 tps = 34197.109896 (including connections establishing) tps = 34200.658720 (excluding connections establishing) y...@client45:~$ pgbench -c 10 -l -M prepared -T 240 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 10 query mode: prepared number of clients: 10 duration: 240 s number of transactions actually processed: 809271 tps = 3371.147020 (including connections establishing) tps = 3371.518611 (excluding connections establishing) -- scale 300 (just fits in RAM) read only and read/write tests pgbench -c 10 -M prepared -T 300 -S test starting vacuum...end. transaction type: SELECT only scaling factor: 300 query mode: prepared number of clients: 10 duration: 300 s number of transactions actually processed: 9219279 tps = 30726.931095 (including connections establishing) tps = 30729.692823 (excluding connections establishing) The test above doesn't really test the drive but shows the CPU/RAM limit. pgbench -c 10 -l -M prepared -T 3600 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 300 query mode: prepared number of clients: 10 duration: 3600 s number of transactions actually processed: 8838200 tps = 2454.994217 (including connections establishing) tps = 2455.012480 (excluding connections establishing) -- scale 2000 pgbench -c 10 -M prepared -T 300 -S test starting vacuum...end. transaction type: SELECT only scaling factor: 2000 query mode: prepared number of clients: 10 duration: 300 s number of transactions actually processed: 755772 tps = 2518.547576 (including connections establishing) tps = 2518.762476 (excluding connections establishing) So the test above tests the random seek performance. Iostat on the drive showed a steady just over 4000 read io's/s: avg-cpu: %user %nice %system %iowait %steal %idle 11.390.00 13.37 60.400.00 14.85 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 4171.000.00 60624.00 0.00 29.0711.812.83 0.24 100.00 sdb 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 pgbench -c 10 -l -M prepared -T 24000 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 2000 query mode: prepared number of clients: 10 duration: 24000 s number of transactions actually processed: 30815691 tps = 1283.979098 (including connections establishing) tps = 1283.980446 (excluding connections establishing) Note the duration of several hours. No long waits occurred - of this last test the latency png is at http://yfrog.com/f/0vlatencywp/ and the TPS graph at http://yfrog.com/f/b5tpsp/ regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Yeb Havinga wrote: 8GB DDR2 something.. (lots of details removed) Graph of TPS at http://tinypic.com/r/b96aup/3 and latency at http://tinypic.com/r/x5e846/3 Thanks http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm for the gnuplot and psql scripts! -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Greg Smith wrote: Put it on ext3, toggle on noatime, and move on to testing. The overhead of the metadata writes is the least of the problems when doing write-heavy stuff on Linux. I ran a pgbench run and power failure test during pgbench with a 3 year old computer 8GB DDR ? Intel Core 2 duo 6600 @ 2.40GHz Intel Corporation 82801IB (ICH9) 2 port SATA IDE Controller 64 bit 2.6.31-22-server (Ubuntu karmic), kernel option elevator=deadline sysctl options besides increasing shm: fs.file-max=327679 fs.aio-max-nr=3145728 vm.swappiness=0 vm.dirty_background_ratio = 3 vm.dirty_expire_centisecs = 500 vm.dirty_writeback_centisecs = 100 vm.dirty_ratio = 15 Filesystem on SSD with postgresql data: ext3 mounted with noatime,nodiratime,relatime Postgresql cluster: did initdb with C locale. Data and pg_xlog together on the same ext3 filesystem. Changed in postgresql.conf: settings with pgtune for OLTP and 15 connections maintenance_work_mem = 480MB # pgtune wizard 2010-07-25 checkpoint_completion_target = 0.9 # pgtune wizard 2010-07-25 effective_cache_size = 5632MB # pgtune wizard 2010-07-25 work_mem = 512MB # pgtune wizard 2010-07-25 wal_buffers = 8MB # pgtune wizard 2010-07-25 checkpoint_segments = 31 # pgtune said 16 here shared_buffers = 1920MB # pgtune wizard 2010-07-25 max_connections = 15 # pgtune wizard 2010-07-25 Initialized with scale 800 with is about 12GB. I especially went beyond an in RAM size for this machine (that would be ~ 5GB), so random reads would weigh in the result. Then let pgbench run the tcp benchmark with -M prepared, 10 clients and -T 3600 (one hour) and 10 clients, after that loaded the logfile in a db and did some queries. Then realized the pgbench result page was not in screen buffer anymore so I cannot copy it here, but hey, those can be edited as well right ;-) select count(*),count(*)/3600,avg(time),stddev(time) from log; count | ?column? | avg | stddev -+--+---+ 4939212 | 1372 | 7282.8581978258880161 | 11253.96967962 (1 row) Judging from the latencys in the logfiles I did not experience serious lagging (time is in microseconds): select * from log order by time desc limit 3; client_id | tx_no | time | file_no | epoch| time_us ---+---+-+-++- 3 | 33100 | 1229503 | 0 | 1280060345 | 866650 9 | 39990 | 1077519 | 0 | 1280060345 | 858702 2 | 55323 | 1071060 | 0 | 1280060519 | 750861 (3 rows) select * from log order by time desc limit 3 OFFSET 1000; client_id | tx_no | time | file_no | epoch| time_us ---+++-++- 5 | 262466 | 245953 | 0 | 1280062074 | 513789 1 | 267519 | 245867 | 0 | 1280062074 | 513301 7 | 273662 | 245532 | 0 | 1280062078 | 378932 (3 rows) select * from log order by time desc limit 3 OFFSET 1; client_id | tx_no | time | file_no | epoch| time_us ---++---+-++- 5 | 123011 | 82854 | 0 | 1280061036 | 743986 6 | 348967 | 82853 | 0 | 1280062687 | 776317 8 | 439789 | 82848 | 0 | 1280063109 | 552928 (3 rows) Then I started pgbench again with the same setting, let it run for a few minutes and in another console did CHECKPOINT and then turned off power. After restarting, the database recovered without a problem. LOG: database system was interrupted; last known up at 2010-07-25 10:14:15 EDT LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at F/98008610 LOG: record with zero length at F/A2BAC040 LOG: redo done at F/A2BAC010 LOG: last completed transaction was at log time 2010-07-25 10:14:16.151037-04 regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Yeb Havinga wrote: Yeb Havinga wrote: diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes; 39/s) Total errors: 0 :-) OTOH, I now notice the 39 write /s .. If that means ~ 39 tps... bummer. When playing with it a bit more, I couldn't get the test_file to be created in the right place on the test system. It turns out I had the diskchecker config switched and 39 write/s was the speed of the not-rebooted system, sorry. I did several diskchecker.pl tests this time with the testfile on the SSD, none of the tests have returned an error :-) Writes/s start low but quickly converge to a number in the range of 1200 to 1800. The writes diskchecker does are 16kB writes. Making this 4kB writes does not increase writes/s. 32kB seems a little less, 64kB is about two third of initial writes/s and 128kB is half. So no BBU speeds here for writes, but still ~ factor 10 improvement of iops for a rotating SATA disk. regards, Yeb Havinga PS: hdparm showed write cache was on. I did tests with both ext2 and xfs, where xfs tests I did with both barrier and nobarrier. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Yeb Havinga wrote: diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes; 39/s) Total errors: 0 :-) OTOH, I now notice the 39 write /s .. If that means ~ 39 tps... bummer. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Greg Smith wrote: Note that not all of the Sandforce drives include a capacitor; I hope you got one that does! I wasn't aware any of the SF drives with a capacitor on them were even shipping yet, all of the ones I'd seen were the chipset that doesn't include one still. Haven't checked in a few weeks though. I think I did, it was expensive enough, though while ordering its very easy to order the wrong one, all names on the product category page look the same. (OCZ Vertex 2 Pro) * How to test for power failure? I've had good results using one of the early programs used to investigate this class of problems: http://brad.livejournal.com/2116715.html?page=2 A great tool, thanks for the link! diskchecker: running 34 sec, 4.10% coverage of 500 MB (1342 writes; 39/s) diskchecker: running 35 sec, 4.24% coverage of 500 MB (1390 writes; 39/s) diskchecker: running 36 sec, 4.35% coverage of 500 MB (1427 writes; 39/s) diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes; 39/s) didn't get 'ok' from server (11387 316950), msg=[] = Connection reset by peer at ./diskchecker.pl line 132. here's where I removed the power and left it off for about a minute. Then started again then did the verify y...@a:~$ ./diskchecker.pl -s client45.eemnes verify test_file verifying: 0.00% Total errors: 0 :-) this was on ext2 * What filesystem to use on the SSD? To minimize writes and maximize chance for seeing errors I'd choose ext2 here. I don't consider there to be any reason to deploy any part of a PostgreSQL database on ext2. The potential for downtime if the fsck doesn't happen automatically far outweighs the minimal performance advantage you'll actually see in real applications. Hmm.. wouldn't that apply for other filesystems as well? I know that JFS also won't mount if booted unclean, it somehow needs a marker from the fsck. Don't know for ext3, xfs etc. All of the benchmarks showing large gains for ext2 over ext3 I have seen been synthetic, not real database performance; the internal ones I've run using things like pgbench do not show a significant improvement. (Yes, I'm already working on finding time to publicly release those findings) The reason I'd choose ext2 on the SSD was mainly to decrease the number of writes, not for performance. Maybe I should ultimately do tests for both journalled and ext2 filesystems and compare the amount of data per x pgbench transactions. Put it on ext3, toggle on noatime, and move on to testing. The overhead of the metadata writes is the least of the problems when doing write-heavy stuff on Linux. Will surely do and post the results. thanks, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Testing Sandforce SSD
Hello list, Probably like many other's I've wondered why no SSD manufacturer puts a small BBU on a SSD drive. Triggered by Greg Smith's mail http://archives.postgresql.org/pgsql-performance/2010-02/msg00291.php here, and also anandtech's review at http://www.anandtech.com/show/2899/1 (see page 6 for pictures of the capacitor) I ordered a SandForce drive and this week it finally arrived. And now I have to test it and was wondering about some things like * How to test for power failure? I thought by running on the same machine a parallel pgbench setup on two clusters where one runs with data and wal on a rotating disk, the other on the SSD, both without BBU controller. Then turn off power. Do that a few times. The problem in this scenario is that even when the SSD would show not data loss and the rotating disk would for a few times, a dozen tests without failure isn't actually proof that the drive can write it's complete buffer to disk after power failure. * How long should the power be turned off? A minute? 15 minutes? * What filesystem to use on the SSD? To minimize writes and maximize chance for seeing errors I'd choose ext2 here. For the sake of not comparing apples with pears I'd have to go with ext2 on the rotating data disk as well. Do you guys have any more ideas to properly 'feel this disk at its teeth' ? regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query optimization problem
Hello Zotov, Somehow the equivalence d2.basedon=d1.id is not used in the slow query, probably because the equivalence constant value would be used inside a not-base expression (the OR). You can see that the equivalence values *are* used by changing the or to an and and compare both queries. The only thing you can do to guarantee the planner has all information to in cases like this it explicitly name the equivalence inside OR expressions, e.g. SELECT d1.ID, d2.ID FROM DocPrimary d1 JOIN DocPrimary d2 ON d2.BasedOn=d1.ID WHERE (d1.ID=234409763 and d2.basedon=234409763) or (d2.ID=234409763) ; regards, Yeb Havinga PS: the analyze time of the slow query showed 0.007ms? Zotov wrote: I have a query: SELECT d1.ID, d2.ID FROM DocPrimary d1 JOIN DocPrimary d2 ON d2.BasedOn=d1.ID WHERE (d1.ID=234409763) or (d2.ID=234409763) i think what QO(Query Optimizer) can make it faster (now it seq scan and on million records works 7 sec) SELECT d1.ID, d2.ID FROM DocPrimary d1 JOIN DocPrimary d2 ON d2.BasedOn=d1.ID WHERE (d2.BasedOn=234409763) or (d2.ID=234409763) -- Slow Query -- test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT d1.ID, d2.ID test-# FROM DocPrimary d1 test-# JOIN DocPrimary d2 ON d2.BasedOn=d1.ID test-# WHERE (d1.ID=234409763) or (d2.ID=234409763); QUERY PLAN Hash Join (cost=58.15..132.35 rows=2 width=8) (actual time=0.007..0.007 rows=0 loops=1) Output: d1.id, d2.id Hash Cond: (d2.basedon = d1.id) Join Filter: ((d1.id = 234409763) OR (d2.id = 234409763)) -> Seq Scan on public.docprimary d2 (cost=0.00..31.40 rows=2140 width=8) (actual time=0.002..0.002 rows=0 loops=1) Output: d2.id, d2.basedon -> Hash (cost=31.40..31.40 rows=2140 width=4) (never executed) Output: d1.id -> Seq Scan on public.docprimary d1 (cost=0.00..31.40 rows=2140 width=4) (never executed) Output: d1.id -- Fast Query -- test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT d1.ID, d2.ID test-# FROM DocPrimary d1 test-# JOIN DocPrimary d2 ON d2.BasedOn=d1.ID test-# WHERE (d2.BasedOn=234409763) or (d2.ID=234409763); QUERY PLAN - Nested Loop (cost=8.60..58.67 rows=12 width=8) (actual time=0.026..0.026 rows=0 loops=1) Output: d1.id, d2.id -> Bitmap Heap Scan on public.docprimary d2 (cost=8.60..19.31 rows=12 width=8) (actual time=0.023..0.023 rows=0 loops=1) Output: d2.id, d2.basedon Recheck Cond: ((d2.basedon = 234409763) OR (d2.id = 234409763)) -> BitmapOr (cost=8.60..8.60 rows=12 width=0) (actual time=0.018..0.018 rows=0 loops=1) -> Bitmap Index Scan on basedon_idx (cost=0.00..4.33 rows=11 width=0) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (d2.basedon = 234409763) -> Bitmap Index Scan on id_pk (cost=0.00..4.26 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=1) Index Cond: (d2.id = 234409763) -> Index Scan using id_pk on public.docprimary d1 (cost=0.00..3.27 rows=1 width=4) (never executed) Output: d1.id, d1.basedon Index Cond: (d1.id = d2.basedon) PGver: PostgreSQL 9.0b x86 OS: Win7 x64 - Create table query: - CREATE TABLE docprimary ( id integer NOT NULL, basedon integer, CONSTRAINT id_pk PRIMARY KEY (id) ); CREATE INDEX basedon_idx ON docprimary USING btree (basedon); -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Two different execution plan for the same request
JOUANIN Nicolas (44) wrote: The strange thing is that this morning explain analyze now gives a much better duration : There were no modification made on the database except a restart yesterday evening and a vacuumdb --analyse ran at night. If the earlier bad query was run on a fresh imported database that was not ANALYZEd, it would explain the different and likely bad plan. If you want to know for sure this is the cause, instead of e.g. faulty hardware, you could verify redoing the import + query without analyze. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Two different execution plan for the same request
JOUANIN Nicolas (44) wrote: Hi, I've trouble with some SQL request which have different execution plans when ran on two different servers. One server is the development environment, the othe rone is th pre-production env. Both servers run postgreSQL 8.3.0 on Linux and : - both databases contains the same data (pg_dump/pg_restore between servers) - instances have the same configuration parameters - vaccum and analyze is run every day. The only difference I can see is the hardware. The pre-preoduction env. has more RAM, more CPU and a RAID5 disk array through LVM where data are stored. Hello Jouanin, Could you give some more information following the guidelines from http://wiki.postgresql.org/wiki/SlowQueryQuestions ? Essential are the contents from both conf files (comments may be removed). regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] order by slowing down a query by 80 times
Rajesh Kumar Mallah wrote: Dear List, just by removing the order by co_name reduces the query time dramatically from ~ 9 sec to 63 ms. Can anyone please help. The 63 ms query result is probably useless since it returns a limit of 25 rows from an unordered result. It is not surprising that this is fast. The pain is here: Index Scan using profile_master_co_name on profile_master b (cost=0.00..1125295.59 rows=6968 width=25) (actual time=0.097..9193.154 rows=2212 loops=1) Filter: ((co_name IS NOT NULL) AND ((co_name_vec)::tsvector @@ to_tsquery('manufacturer'::text))) It looks like seq_scans are disabled, since the index scan has only a filter expression but not an index cond. regards, Yeb Havinga Regds Rajesh Kumar Mallah. explain analyze SELECT * from ( SELECT a.profile_id,a.userid,a.amount,a.category_id,a.catalog_id,a.keywords,b.co_name from general.catalogs a join general.profile_master using(profile_id) where 1=1 and co_name_vec @@ to_tsquery('manufacturer') and b.co_name is not null and a.ifmain is true ) as c order by co_name limit 25 offset 0; Limit (cost=0.00..3659.13 rows=25 width=129) (actual time=721.075..9241.105 rows=25 loops=1) -> Nested Loop (cost=0.00..1215772.28 rows=8307 width=476) (actual time=721.073..9241.050 rows=25 loops=1) -> Nested Loop (cost=0.00..1208212.37 rows=8307 width=476) (actual time=721.052..9240.037 rows=25 loops=1) -> Nested Loop (cost=0.00..1204206.26 rows=6968 width=472) (actual time=721.032..9239.516 rows=25 loops=1) -> Nested Loop (cost=0.00..1154549.19 rows=6968 width=471) (actual time=721.012..9236.523 rows=25 loops=1) -> Index Scan using profile_master_co_name on profile_master b (cost=0.00..1125295.59 rows=6968 width=25) (actual time=0.097..9193.154 rows=2212 loops=1) Filter: ((co_name IS NOT NULL) AND ((co_name_vec)::tsvector @@ to_tsquery('manufacturer'::text))) -> Index Scan using catalog_master_profile_id_fkindex on catalog_master (cost=0.00..4.19 rows=1 width=446) (actual time=0.016..0.016 rows=0 loops=2212) Index Cond: (catalog_master.profile_id = b.profile_id) Filter: ((catalog_master.hide IS FALSE) AND ((catalog_master.hosting_status)::text = 'ACTIVE'::text)) -> Index Scan using profile_master_profile_id_pkey on profile_master (cost=0.00..7.11 rows=1 width=9) (actual time=0.105..0.105 rows=1 loops=25) Index Cond: (profile_master.profile_id = catalog_master.profile_id) -> Index Scan using catalog_categories_pkey_catalog_id_category_id on catalog_categories (cost=0.00..0.56 rows=1 width=8) (actual time=0.014..0.015 rows=1 loops=25) Index Cond: (catalog_categories.catalog_id = catalog_master.catalog_id) Filter: (catalog_categories.ifmain IS TRUE) -> Index Scan using web_category_master_pkey on web_category_master (cost=0.00..0.90 rows=1 width=4) (actual time=0.034..0.036 rows=1 loops=25) Index Cond: (web_category_master.category_id = catalog_categories.category_id) Filter: ((web_category_master.status)::text = 'ACTIVE'::text) Total runtime: 9241.304 ms explain analyze SELECT * from ( SELECT a.profile_id,a.userid,a.amount,a.category_id,a.catalog_id,a.keywords,b.co_name from general.catalogs a join general.profile_master b using(profile_id) where 1=1 and co_name_vec @@ to_tsquery('manufacturer') and b.co_name is not null and a.ifmain is true ) as c limit 25 offset 0; QUERY PLAN -- Limit (cost=0.00..358.85 rows=25 width=476) (actual time=0.680..63.176 rows=25 loops=1) -> Nested Loop (cost=0.00..119238.58 rows=8307 width=476) (actual time=0.677..63.139 rows=25 loops=1) -> Nested Loop (cost=0.00..111678.66 rows=8307 width=476) (actual time=0.649..62.789 rows=25 loops=1) -> Nested Loop (cost=0.00..107672.56 rows=6968 width=472) (actual time=0.626..62.436 rows=25 loops=1) -> Nested Loop (cost=0.00..58015.49 rows=6968 width=471) (actual time=0.606..62.013 rows=25 loops=1) -> Index Scan using profile_master_co_name_vec on profile_master b (cost=0.00..28761.89 rows=6968 width=25) (actual time=0.071..50.576 rows=1160 loops=1) Index Cond: ((co_name_vec)::tsvector @@ to_tsquery('manufacturer'::text)) Fil
Re: [PERFORM] sudden spurt in swap utilization (was:cpu bound postgresql setup.)
Rajesh Kumar Mallah wrote: A scary phenomenon is being exhibited by the server , which is the server is slurping all the swap suddenly 8 1 4192912 906164 6100 2787364000 2277 858 13440 16235 63 8 19 10 0 I reduced work_mem from 4GB to 2GB to 512MB (now). I clearly remember that this abnormal consumption of swap was NOT there even when work_mem was 4GB. eg during happier times swap utilisation was: http://pastebin.com/bnE1pFZ9 the question is whats making postgres slurp the swap? i am posting my current postgresql.conf once again. # cat postgresql.conf | grep -v "^\s*#" | grep -v "^\s*$" listen_addresses = '*' # what IP address(es) to listen on; port = 5432 # (change requires restart) max_connections = 300 # (change requires restart) Hello Rajesh, In constrast with e.g. shared_buffers and effective_cache_size, work_mem is amount of memory per 'thing' (e.g. order/group by) that wants some working memory, so even a single backend can use several pieces of work_mem memory. Looking at your postgresql.conf, other memory values seem a bit too high as well for a 32GB ram server. It is probably a good idea to use pgtune (on pgfoundry) to get some reasonable ball park settings for your hardware. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] B-Heaps
Kevin Grittner wrote: Yeb Havinga wrote: concerning gist indexes: 1) with larger block sizes and hence, larger # entries per gist page, results in more generic keys of those pages. This in turn results in a greater number of hits, when the index is queried, so a larger part of the index is scanned. NB this has nothing to do with caching / cache sizes; it holds for every IO model. Tests performed by me showed performance improvements of over 200%. Since then implementing a speedup has been on my 'want to do list'. As I recall, the better performance in your tests was with *smaller* GiST pages, right? (The above didn't seem entirely clear on that.) Yes, making pages smaller made index scanning faster. -- Yeb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] B-Heaps
Greg Smith wrote: Matthew Wakeling wrote: This sort of thing has been fairly well researched at an academic level, but has not been implemented in that many real world situations. I would encourage its use in Postgres. I guess, but don't forget that work on PostgreSQL is driven by what problems people are actually running into. There's a long list of performance improvements sitting in the TODO list waiting for people to find time to work on them, ones that we're quite certain are useful. That anyone is going to chase after any of these speculative ideas from academic research instead of one of those is unlikely. Your characterization of the potential speed up here is "Using a proper tree inside the index page would improve the CPU usage of the index lookups", which seems quite reasonable. Regardless, when I consider "is that something I have any reason to suspect is a bottleneck on common workloads?", I don't think of any, and return to working on one of things I already know is instead. There are two different things concerning gist indexes: 1) with larger block sizes and hence, larger # entries per gist page, results in more generic keys of those pages. This in turn results in a greater number of hits, when the index is queried, so a larger part of the index is scanned. NB this has nothing to do with caching / cache sizes; it holds for every IO model. Tests performed by me showed performance improvements of over 200%. Since then implementing a speedup has been on my 'want to do list'. 2) there are several approaches to get the # entries per page down. Two have been suggested in the thread referred to by Matthew (virtual pages (but how to order these?) and tree within a page). It is interesting to see if ideas from Prokop's cache oblivous algorithms match with this problem to find a suitable virtual page format. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] B-Heaps
Greg Smith wrote: Eliot Gable wrote: Just curious if this would apply to PostgreSQL: http://queue.acm.org/detail.cfm?id=1814327 It's hard to take this seriously at all when it's so ignorant of actual research in this area. Take a look at http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/papers/BFJ01.pdf for a second Interesting paper, thanks for the reference! PostgreSQL is modeling a much more complicated situation where there are many levels of caches, from CPU to disk. When executing a query, the database tries to manage that by estimating the relative costs for CPU operations, row operations, sequential disk reads, and random disk reads. Those fundamental operations are then added up to build more complicated machinery like sorting. To minimize query execution cost, various query plans are considered, the cost computed for each one, and the cheapest one gets executed. This has to take into account a wide variety of subtle tradeoffs related to whether memory should be used for things that would otherwise happen on disk. There are three primary ways to search for a row, three main ways to do a join, two for how to sort, and they all need to have cost estimates made for them that balance CPU time, memory, and disk access. Do you think that the cache oblivious algorithm described in the paper could speed up index scans hitting the disk Postgres (and os/hardware) multi level memory case? (so e.g. random page cost could go down?) regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
David Jarvis wrote: Hi, Yeb. This is starting to go back to the design I used with MySQL: * YEAR_REF - Has year and station * MONTH_REF - Has month, category, and yea referencer * MEASUREMENT - Has month reference, amount, and day Normalizing by date parts was fast. Partitioning the tables by year won't do much good -- users will probably choose 1900 to 2009, predominately. Ok, in that case it is a bad idea. I thought about splitting the data by station by category, but that's ~73000 tables. My understanding is that PostgreSQL uses files per index, which would be messy at the OS level (Linux 2.6.31). Even by station alone is 12139 tables, which might be tolerable for now, but with an order of magnitude more stations on the distant horizon, it will not scale. Yes, I've read a few times now that PG's partitioning doesn't scale beyond a few 100 partitions. I also thought about splitting the data by station district by category -- there are 79 districts, yielding 474 child tables, which is ~575000 rows per child table. Most of the time I'd imagine only one or two districts would be selected. (Again, hard to know exactly.) I agee with Matthew Wakeling in a different post: its probably wise to first see how fast things can get by using indexes. Only if that fails to be fast, partitioning might be an option. (Though sequentially scanning 0.5M rows is not cheap). I experimented a bit with a doy and year function. -- note: leap year fix must still be added create or replace function doy(timestamptz) RETURNS float8 as 'select extract(doy from $1);' language sql immutable strict; create or replace function year(timestamptz) RETURNS float8 as 'select extract(year from $1);' language sql immutable strict; \d parent Table "public.parent" Column | Type | Modifiers +--+--- t | timestamp with time zone | y | smallint | Indexes: "doy_i" btree (doy(t)) "year_i" btree (year(t)) A plan like the following is probably what you want test=# explain select * from parent where doy(t) between 10 and 20 and year(t) between 1900 and 2009; QUERY PLAN - Bitmap Heap Scan on parent (cost=9.95..14.97 rows=1 width=10) Recheck Cond: ((year(t) >= 1900::double precision) AND (year(t) <= 2009::double precision) AND (doy(t) >= 10::double precision) AND (doy(t) <= 20::double precision)) -> BitmapAnd (cost=9.95..9.95 rows=1 width=0) -> Bitmap Index Scan on year_i (cost=0.00..4.85 rows=10 width=0) Index Cond: ((year(t) >= 1900::double precision) AND (year(t) <= 2009::double precision)) -> Bitmap Index Scan on doy_i (cost=0.00..4.85 rows=10 width=0) Index Cond: ((doy(t) >= 10::double precision) AND (doy(t) <= 20::double precision)) (7 rows) regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
Matthew Wakeling wrote: On Fri, 21 May 2010, Yeb Havinga wrote: For time based data I would for sure go for year based indexing. On the contrary, most of the queries seem to be over many years, but rather restricting on the time of year. Therefore, partitioning by month or some other per-year method would seem sensible. The fact is that at the time I wrote my mail, I had not read a specifion of distribution of parameters (or I missed it). That's why the sentence of my mail before the one you quoted said: "the partitioning is only useful for speed, if it matches how your queries select data.". In most of the databases I've worked with, the recent data was queried most (accounting, medical) but I can see that for climate analysis this might be different. Regarding the leap year problem, you might consider creating a modified day of year field, which always assumes that the year contains a leap day. Then a given number always resolves to a given date, regardless of year. If you then partition (or index) on that field, then you may get a benefit. Shouldn't it be just the other way around - assume all years are non leap years for the doy part field to be indexed. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
There is a thing that might lead to confusion in the previous post: create or replace function yearmod(int) RETURNS int as 'select (($1 >> 2) %32);' language sql immutable strict; is equivalent with create or replace function yearmod(int) RETURNS int as 'select (($1 / 4) %32);' language sql immutable strict; and that is the function that was used with all the other output (it can be seen inlined in the explain output). I did not catch this until after the post. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
) inherits(parent); create table child32(check (yearmod(y)=31)) inherits(parent); (32 rows) Copy and paste output of this query in psql to create child tables. Example with period 1970 to 1980: test=# explain select * from parent where yearmod(y) between yearmod(1970) and yearmod(1980); QUERY PLAN - Result (cost=0.00..305.00 rows=50 width=10) -> Append (cost=0.00..305.00 rows=50 width=10) -> Seq Scan on parent (cost=0.00..61.00 rows=10 width=10) Filter: y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15)) -> Seq Scan on child13 parent (cost=0.00..61.00 rows=10 width=10) Filter: y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15)) -> Seq Scan on child14 parent (cost=0.00..61.00 rows=10 width=10) Filter: y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15)) -> Seq Scan on child15 parent (cost=0.00..61.00 rows=10 width=10) Filter: y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15)) -> Seq Scan on child16 parent (cost=0.00..61.00 rows=10 width=10) Filter: y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15)) (12 rows) This works: query for 11 consecutive years hits only 4 from 31. But the between fails for yearmods that wrap the 31 boundary, what happens here between 1910 and 1920 test=# explain select * from parent where yearmod(y) between yearmod(1910) and yearmod(1920); QUERY PLAN -- Result (cost=0.00..0.01 rows=1 width=0) One-Time Filter: false (2 rows) So for the wraparound case we need a CASE: test=# explain select * from parent where case when yearmod(1910) <= yearmod(1920) then yearmod(y) between yearmod(1910) and yearmod(1920) else (yearmod(y) >= yearmod(1910) or yearmod(y) <= yearmod(1920)) end; QUERY PLAN --- Result (cost=0.00..305.00 rows=5665 width=10) -> Append (cost=0.00..305.00 rows=5665 width=10) -> Seq Scan on parent (cost=0.00..61.00 rows=1133 width=10) Filter: y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0)) -> Seq Scan on child1 parent (cost=0.00..61.00 rows=1133 width=10) Filter: y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0)) -> Seq Scan on child30 parent (cost=0.00..61.00 rows=1133 width=10) Filter: y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0)) -> Seq Scan on child31 parent (cost=0.00..61.00 rows=1133 width=10) Filter: y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0)) -> Seq Scan on child32 parent (cost=0.00..61.00 rows=1133 width=10) Filter: y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0)) (12 rows) This should work for all year ranges and I think is a good solution for partitioning on year with a fixed amount of partitions. From the optimizer perspective I wonder what the best access path for this kind of query would be (if there would be no partitions). Building on ideas from one of Thom Brown's first replies with indexes on year and doy, and Tom Lane's remark about the leap year problem. Suppose the leap years did not exist, having a index on year, and having a different index on doy, sounds like a bitmap and of a scan of both the year and doy indexes could provide a optimal path. Maybe this would still be possible, if the leap year problem could be 'fixed' by a additional condition in the where clause that filters the surplus records. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
Tom Lane wrote: David Jarvis writes: I was hoping to eliminate this part of the query: (cast(extract( YEAR FROM m.taken ) + greatest( -1 * sign( (extract( YEAR FROM m.taken )||'-12-31')::date - (extract( YEAR FROM m.taken )||'-01-01')::date ), 0 ) AS text)||'-12-31')::date That uses functions to create the dates, which is definitely the problem. Well, it's not the functions per se that's the problem, it's the lack of a useful index on the expression. But as somebody remarked upthread, that expression doesn't look correct at all. Doesn't the whole greatest() subexpression reduce to a constant? That somebody was probably me. I still think the whole BETWEEN expression is a tautology. A small test did not provide a counterexample. In the select below everything but the select was copy/pasted. create table m (taken timestamptz); insert into m values (now()); insert into m values ('1900-12-31'); insert into m values ('2000-04-06'); select m.taken BETWEEN /* Start date. */ (extract( YEAR FROM m.taken )||'-01-01')::date AND /* End date. Calculated by checking to see if the end date wraps into the next year. If it does, then add 1 to the current year. */ (cast(extract( YEAR FROM m.taken ) + greatest( -1 * sign( (extract( YEAR FROM m.taken )||'-12-31')::date - (extract( YEAR FROM m.taken )||'-01-01')::date ), 0 ) AS text)||'-12-31')::date from m; ?column? -- t t t (3 rows) Another thing is that IF the climate measurements is partitioned on time (e.g each year?), then a function based index on the year part of m.taken is useless, pardon my french. I'm not sure if it is partitioned that way but it is an interesting thing to inspect, and perhaps rewrite the query to use constraint exclusion. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
Hello David, The table aggregates 237 million rows from its child tables. The sluggishness comes from this part of the query: m.taken BETWEEN /* Start date. */ (extract( YEAR FROM m.taken )||'-01-01')::date AND /* End date. Calculated by checking to see if the end date wraps into the next year. If it does, then add 1 to the current year. */ (cast(extract( YEAR FROM m.taken ) + greatest( -1 * sign( (extract( YEAR FROM m.taken )||'-12-31')::date - (extract( YEAR FROM m.taken )||'-01-01')::date ), 0 ) AS text)||'-12-31')::date Either I had too less coffee and completely misunderstand this expression, or it is always true and can be omitted. Could you explain a bit what this part tries to do and maybe also show it's original counterpart in the source database? regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] old server, new server, same performance
Kevin Grittner wrote: Piotr Legiecki wrote: Why there is no difference in database speed between those two machines? Could you post the contents of the postgresql.conf files for both (stripped of comments) and explain what you're using for your benchmarks? In particular, it would be interesting to know how many concurrent connections are active running what mix of queries. It would be also interesting to know how many disks are there in the new server, and the size of the database (select pg_size_pretty(pg_database_size('yourdb'))). regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first
Tom Lane wrote: =?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= writes: I've thought and someone in this list've told me that this should be done automatically. No, that's not true. We do make deductions about transitive equalities, ie, given WHERE a=b AND b=c the planner will infer a=c and use that if it's helpful. We don't make deductions about inequalities such as a>c. In theory there's enough information available to do so, but overall trying to do that would probably waste more cycles than it would save. You'd need a lot of expensive new planner infrastructure, and in the vast majority of queries it wouldn't produce anything very helpful. New expensive planner infrastructure to support from a>b and b>c infer a>c, yes. But I wonder if something like Leibniz's principle of identity holds for members of the same equivalence class, e.g. like if x,y are both members of the same EC, then for every predicate P, P(x) iff P(y). Probably not for every predicate (like varno = 2 or attname='x'), but for the query evaluation, the object denoted by the variables are the same, since that is the standard meaning of the = operator. I cannot think of any standard (btree) operator where 'Leibniz' would fail in this case. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first
Віталій Тимчишин wrote: BTW: Changing slow query to inner joins do not make it fast I'm interested to see the query andplan of the slow query with inner joins. Another thing is it seems that the number of rows guessed is far off from the actual number of rows, is the number 500 artificial or are you're statistics old or too small histogram/mcv's? Nope, I suppose this is because of limit. If I remove the limit, the estimations are quite correct. There are ~6 millions of row in each table. Yes, that makes sense. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first
Віталій Тимчишин wrote: Hello. I have a query that performs very poor because there is a limit on join column that is not applied to other columns: select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>5000 and this_1_.company_id>5000 order by this_.id asc limit 1000; (plan1.txt) Total runtime: 7794.692 ms At the same time if I apply the limit (>5000) to other columns in query itself it works like a charm: select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>5000 and this_1_.company_id>5000 and companymea2_.company_id>5000 and ces3_.company_id>5000 order by this_.id asc limit 1000; (plan2.txt) Total runtime: 27.547 ms I've thought and someone in this list've told me that this should be done automatically. Yes, if you have in a query a=b and b=c, then the optimizer figures out that a=c as well. (a,b and c are then member of the same equivalence class). However both queries are not the same, since the joins you're using are outer joins. In the first it's possible that records are returned for company records with no matching ces3_ records, the ces3_ records is null in that case. In the second query no NULL ces3_ information may be returned. Another thing is it seems that the number of rows guessed is far off from the actual number of rows, is the number 500 artificial or are you're statistics old or too small histogram/mcv's? regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Some question
Scott Marlowe wrote: 2010/3/31 Ľubomír Varga : Hi, stright to my "problem": If I try to select constant 1 from table with two rows, it will be something like this: explain SELECT * FROM t_route WHERE t_route.route_type_fk = (SELECT id FROM t_route_type WHERE type = 2) limit 4; "Limit (cost=1.02..1.91 rows=4 width=2640)" " InitPlan" "-> Seq Scan on t_route_type (cost=0.00..1.02 rows=1 width=8)" " Filter: ("type" = 2)" " -> Seq Scan on t_route (cost=0.00..118115.25 rows=535090 width=2640)" "Filter: (route_type_fk = $0)" Looking at this it looks like you're using prepared queries, which can't make as good of a decision as regular queries because the values are opaque to the planner. Can you provide us with the output of explain analyze of that query? ISTM that the initplan 'outputs' id as $0, so it is not a prepared query. Maybe EXPLAIN VERBOSE ANALYZE of the query reveals that better. But both plans show seqscans of the large table, so it is surprising that the performance is different, if the filter expression uses the same values. Are you sure the output SELECT id FROM t_route_type WHERE type = 2 is equal to 1? regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] mysql to postgresql, performance questions
Scott Marlowe wrote: On Thu, Mar 25, 2010 at 2:29 PM, Pierre C wrote: Hannu Krosing wrote: Pulling the plug should not corrupt a postgreSQL database, unless it was using disks which lie about write caching. Didn't we recently put the old wife's 'the disks lied' tale to bed in favour of actually admiting that some well known filesystems and saftware raid systems have had trouble with their write barriers? I put a cheap UPS on the home server (which uses Software RAID) precisely because I don't really trust that stuff, and there is also the RAID5 write hole... and maybe the RAID1 write hole too... and installing a UPS takes less time that actually figuring out if the system is power-loss-safe. Very true, a UPS might not cover every possible failure mode, but it sure takes care of an aweful lot of the common ones. Yeah, but the original post was about mythtv boxes, which usually do not have upses. My suggestion about proper setup of the wal was based on some experience of my own. What I did was probably the fastest path to corrupt database files: diskless mythtv box that booted from the fileserver at the attic (with ups btw), but I was too lazy (after x days of lirc / xorg / ivtv / rtc / xmltv etc work) to move the default configured mysql database from the mythtv box (with root filesystem and also mysql on the nfs mount) to a mysql running on the fileserver itself. On top of that I had nfs mounted async for speed. Really after x days of configuration to get things running (my wife thinks it's hobby time but it really isn't) all that is on your mind is: it works good enough? fine, will iron out non essential things when they pop up and if the db becomes corrupt, I had database backups. In the end I had a few times a corrupt table that was always easily repaired with the mysqlcheck tool. Based on this experience I do not think that reliability alone will convince mythtv developers/users to switch to postgresql, and besides that as a developer and user myself, it's always in a way funny to see how creative people can finding ways to not properly use (your) software ;-) regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] mysql to postgresql, performance questions
Yeb Havinga wrote: Greg Smith wrote: Tom Lane wrote: So has anyone looked at porting MythTV to PG? Periodically someone hacks together something that works, last big effort I'm aware of was in 2006, and then it bit rots away. I'm sure we'd get some user uptake on the result--MySQL corruption is one of the top ten cause of a MythTV system crashing. It would be the same with PG, unless the pg cluster configuration with mythtv would come with a properly configured WAL - I had corrupted tables (and a personal wiki entry forgot to add "how to fix the corrupted tables", sorry (the other mysql database in my house) *only* when I sometimes took the risk of not shutting down the machine properly when e.g. the remote was missing). regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] mysql to postgresql, performance questions
Greg Smith wrote: Tom Lane wrote: So has anyone looked at porting MythTV to PG? Periodically someone hacks together something that works, last big effort I'm aware of was in 2006, and then it bit rots away. I'm sure we'd get some user uptake on the result--MySQL corruption is one of the top ten cause of a MythTV system crashing. It would be the same with PG, unless the pg cluster configuration with mythtv would come with a properly configured WAL - I had corrupted tables (and a personal wiki entry (the other mysql database in my house) *only* when I sometimes took the risk of not shutting down the machine properly when e.g. the remote was missing). regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] GiST index performance
Matthew Wakeling wrote: On Sat, 20 Mar 2010, Yeb Havinga wrote: The gist virtual pages would then match more the original blocksizes that were used in Guttman's R-tree paper (first google result, then figure 4.5). Since the nature/characteristics of the underlying datatypes and keys is not changed, it might be that with the disk pages getting larger, gist indexing has therefore become unexpectedly inefficient. Yes, that is certainly a factor. For example, the page size for bioseg which we use here is 130 entries, which is very excessive, and doesn't allow very deep trees. On the other hand, it means that a single disc seek performs quite a lot of work. Yeah, I only did in-memory fitting tests and wondered about increased io's. However I bet that even for bigger than ram db's, the benefit of having to fan out to less pages still outweighs the over-general non leaf nodes and might still result in less disk io's. I redid some earlier benchmarking with other datatypes with a 1kB block size and also multicolumn gist and the multicolumn variant had an ever greater benefit than the single column indexes, both equality and range scans. (Like execution times down to 20% of original). If gist is important to you, I really recommend doing a test with 1kB blocks. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] GiST index performance
On Fri, Mar 19, 2010 at 10:16 PM, Kenneth Marshall wrote: > Hi Yeb, > > I have not looked at the gist code, but would it be possible to > make virtual pages that have a size that is 1/power-of-2 * blocksize. > Then the leaf node could be 1/8 or even 1/16 the size of the full > pagesize. > Hello Ken, The gist virtual pages would then match more the original blocksizes that were used in Guttman's R-tree paper (first google result, then figure 4.5). Since the nature/characteristics of the underlying datatypes and keys is not changed, it might be that with the disk pages getting larger, gist indexing has therefore become unexpectedly inefficient. But I am also not really into the core-gist code, but do have a motivation to dive into it (more than 200% performance increase in Mathew's test case). However I'd like to verify for community support before working on it. Maybe Theodor or Oleg could say something about how easy or hard it is to do? regards, Yeb Havinga > > Regards, > Ken > > On Fri, Mar 19, 2010 at 09:49:30PM +0100, Yeb Havinga wrote: > > Yeb Havinga wrote: > >> > >> Since the gistpagesize is derived from the database blocksize, it might > be > >> wise to set the blocksize low for this case, I'm going to play with this > a > >> bit more. > > Ok, one last mail before it turns into spam: with a 1KB database > blocksize, > > the query now runs in 30 seconds (original 70 on my machine, shared > buffers > > 240MB). > > The per inner loop access time now 24 microsecs compared to on my machine > > original 74 microsecs with 8KB size and 8 for the btree scan. Not a bad > > speedup with such a simple parameter :-) > > > > postgres=# EXPLAIN ANALYSE SELECT * FROM a, b WHERE a.a BETWEEN b.b AND > b.b > > + 2; > >QUERY PLAN > > > > > --- > > Nested Loop (cost=0.00..4169159462.20 rows=09777668 width=8) (actual > > time=0.184..29540.355 rows=297 loops=1) > > -> Seq Scan on b (cost=0.00..47037.62 rows=62 width=4) (actual > > time=0.024..1783.484 rows=100 loops=1) > > -> Index Scan using a_a on a (cost=0.00..2224.78 rows=14 width=4) > > (actual time=0.021..0.024 rows=3 loops=100) > > Index Cond: ((a.a >= b.b) AND (a.a <= (b.b + 2))) > > Total runtime: 30483.303 ms > > (5 rows) > > > > > > postgres=# select gist_stat('a_a'); > > gist_stat > > --- > > Number of levels: 5 + > > Number of pages: 47618 + > > Number of leaf pages: 45454 + > > Number of tuples: 1047617 + > > Number of invalid tuples: 0 + > > Number of leaf tuples: 100 + > > Total size of tuples: 21523756 bytes+ > > Total size of leaf tuples: 20545448 bytes+ > > Total size of index: 48760832 bytes+ > > (1 row) > > > > > > -- > > Sent via pgsql-performance mailing list ( > pgsql-performance@postgresql.org) > > To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-performance > > >
Re: [PERFORM] GiST index performance
Yeb Havinga wrote: Since the gistpagesize is derived from the database blocksize, it might be wise to set the blocksize low for this case, I'm going to play with this a bit more. Ok, one last mail before it turns into spam: with a 1KB database blocksize, the query now runs in 30 seconds (original 70 on my machine, shared buffers 240MB). The per inner loop access time now 24 microsecs compared to on my machine original 74 microsecs with 8KB size and 8 for the btree scan. Not a bad speedup with such a simple parameter :-) postgres=# EXPLAIN ANALYSE SELECT * FROM a, b WHERE a.a BETWEEN b.b AND b.b + 2; QUERY PLAN --- Nested Loop (cost=0.00..4169159462.20 rows=09777668 width=8) (actual time=0.184..29540.355 rows=297 loops=1) -> Seq Scan on b (cost=0.00..47037.62 rows=62 width=4) (actual time=0.024..1783.484 rows=100 loops=1) -> Index Scan using a_a on a (cost=0.00..2224.78 rows=14 width=4) (actual time=0.021..0.024 rows=3 loops=100) Index Cond: ((a.a >= b.b) AND (a.a <= (b.b + 2))) Total runtime: 30483.303 ms (5 rows) postgres=# select gist_stat('a_a'); gist_stat --- Number of levels: 5 + Number of pages: 47618 + Number of leaf pages: 45454 + Number of tuples: 1047617 + Number of invalid tuples: 0 + Number of leaf tuples: 100 + Total size of tuples: 21523756 bytes+ Total size of leaf tuples: 20545448 bytes+ Total size of index: 48760832 bytes+ (1 row) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] GiST index performance
Yeb Havinga wrote: Matthew Wakeling wrote: A second quite distinct issue is the general performance of GiST indexes which is also mentioned in the old thread linked from Open Items. For that, we have a test case at http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php for btree_gist indexes. I have a similar example with the bioseg GiST index. I have completely reimplemented the same algorithms in Java for algorithm investigation and instrumentation purposes, and it runs about a hundred times faster than in Postgres. I think this is a problem, and I'm willing to do some investigation to try and solve it. More gist performance.. Some background: I've been involved in developing several datatypes that make use of gist indexes (we just published a paper on it, see http://arxiv.org/abs/1003.3370), that's the main reason I'm very much interested in possible improvements in gist index speed. One of the datatypes was 'very badly' indexable: non leaf pages were getting very general keys, so in the end we could see from the scanning times compared to sequential scans that the whole index was being scanned. One thing I remember was the idea that somehow it would be nice if the dept of the gist tree could be fiddled with: in that case keys of non leaf nodes would be more specific again. In the original Guttman R-tree paper there was mention of a parameter that determined the size of entries in nodes and thereby indirectly the depth of the tree. I missed that in the PostgreSQL gist api. One thing Gist scanning does very often is key comparisons. So another approach is to try to limit those and again this might be possible by increasing the depth / decrease number of entries per page. I just did a test where in gistfitpage the gistpagesize was divided by 5 and something similar in gistnospace. Scantime before adjustment: about 70 seconds. After adjustment: 45 seconds. With gist_stat from the gevel package confirmed that the depth was now 4 (original 3). Drawback: bigger index size because pages are not filled completely anymore. The explain shows (actual time=0.030..0.032) for the inner loop times, which is less then half of the original scan time. Since the gistpagesize is derived from the database blocksize, it might be wise to set the blocksize low for this case, I'm going to play with this a bit more. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] GiST index performance
Yeb Havinga wrote: Yeb Havinga wrote: Matthew Wakeling wrote: Matthew Wakeling wrote: A second quite distinct issue is the general performance of GiST indexes which is also mentioned in the old thread linked from Open Items. For that, we have a test case at http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php for btree_gist indexes. I have a similar example with the bioseg GiST index. I have completely reimplemented the same algorithms in Java for algorithm investigation and instrumentation purposes, and it runs about a hundred times faster than in Postgres. I think this is a problem, and I'm willing to do some investigation to try and solve it. I have not made any progress on this issue. I think Oleg and Teodor would be better placed working it out. All I can say is that I implemented the exact same indexing algorithm in Java, and it performed 100 times faster than Postgres. Now, Postgres has to do a lot of additional work, like mapping the index onto disc, locking pages, and abstracting to plugin user functions, so I would expect some difference - I'm not sure 100 times is reasonable though. I tried to do some profiling, but couldn't see any one section of code that was taking too much time. Not sure what I can further do. Looked in the code a bit more - only the index nodes are compressed at index creation, the consistent functions does not compress queries, so not pallocs there. However when running Mathews example from http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php with the gist index, the coverage shows in gistget.c: 100 palloc0 's of gistsearchstack at line 152 and 2010982 palloc's also of the gistsearchstack on line 342. Two pfrees are also hit a lot: line 195: 1010926 of a stackentry and line 293: 200056 times. My $0.02 cents is that the pain is here. My knowledge of gistget or the other sources in access/gist is zero, but couldn't it be possible to determine the maximum needed size of the stack and then allocate it at once and use a pop/push kind off api? Waisted some time today on a ghost chase... I though that removing the millions of pallocs would help, so I wrote an alternative of the gistsearchstack-stack to find out that it was not the index scanning itself that caused milltions of pallocs, but the scan being in the inner loop that was called 100 times. The actual scanning time was not changed significantly. The actual scanning time in my vm is for btree (actual time=0.006..0.008) and gist (actual time=0.071..0.074). An error in my searchstack alternative caused pages to be scanned twice, returing twice the amount of rows (6 instead of 3 each time). This resulted in a likewise increase of ms (actual time=0.075..0.150). Somewhere I hit something that causes ~= 0.070 ms twice. For a single index scan, 0.070ms startup time for gist vs 0.006 for btree doesn't seem like a big problem, but yeah when calling it a million times... regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] GiST index performance
Yeb Havinga wrote: Matthew Wakeling wrote: Matthew Wakeling wrote: A second quite distinct issue is the general performance of GiST indexes which is also mentioned in the old thread linked from Open Items. For that, we have a test case at http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php for btree_gist indexes. I have a similar example with the bioseg GiST index. I have completely reimplemented the same algorithms in Java for algorithm investigation and instrumentation purposes, and it runs about a hundred times faster than in Postgres. I think this is a problem, and I'm willing to do some investigation to try and solve it. I have not made any progress on this issue. I think Oleg and Teodor would be better placed working it out. All I can say is that I implemented the exact same indexing algorithm in Java, and it performed 100 times faster than Postgres. Now, Postgres has to do a lot of additional work, like mapping the index onto disc, locking pages, and abstracting to plugin user functions, so I would expect some difference - I'm not sure 100 times is reasonable though. I tried to do some profiling, but couldn't see any one section of code that was taking too much time. Not sure what I can further do. Hello Mathew and list, A lot of time spent in gistget.c code and a lot of functioncall5's to the gist's consistent function which is out of sight for gprof. Something different but related since also gist: we noticed before that gist indexes that use a compressed form for index entries suffer from repeated compress calls on query operands (see http://archives.postgresql.org/pgsql-hackers/2009-05/msg00078.php). The btree_gist int4 compress function calls the generic gbt_num_compress, which does a palloc. Maybe this palloc is allso hit al lot when scanning the index, because the constants that are queries with are repeatedly compressed and palloced. Looked in the code a bit more - only the index nodes are compressed at index creation, the consistent functions does not compress queries, so not pallocs there. However when running Mathews example from http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php with the gist index, the coverage shows in gistget.c: 100 palloc0 's of gistsearchstack at line 152 and 2010982 palloc's also of the gistsearchstack on line 342. Two pfrees are also hit a lot: line 195: 1010926 of a stackentry and line 293: 200056 times. My $0.02 cents is that the pain is here. My knowledge of gistget or the other sources in access/gist is zero, but couldn't it be possible to determine the maximum needed size of the stack and then allocate it at once and use a pop/push kind off api? regards, Yeb Havinga regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] GiST index performance
Matthew Wakeling wrote: Matthew Wakeling wrote: A second quite distinct issue is the general performance of GiST indexes which is also mentioned in the old thread linked from Open Items. For that, we have a test case at http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php for btree_gist indexes. I have a similar example with the bioseg GiST index. I have completely reimplemented the same algorithms in Java for algorithm investigation and instrumentation purposes, and it runs about a hundred times faster than in Postgres. I think this is a problem, and I'm willing to do some investigation to try and solve it. I have not made any progress on this issue. I think Oleg and Teodor would be better placed working it out. All I can say is that I implemented the exact same indexing algorithm in Java, and it performed 100 times faster than Postgres. Now, Postgres has to do a lot of additional work, like mapping the index onto disc, locking pages, and abstracting to plugin user functions, so I would expect some difference - I'm not sure 100 times is reasonable though. I tried to do some profiling, but couldn't see any one section of code that was taking too much time. Not sure what I can further do. Hello Mathew and list, A lot of time spent in gistget.c code and a lot of functioncall5's to the gist's consistent function which is out of sight for gprof. Something different but related since also gist: we noticed before that gist indexes that use a compressed form for index entries suffer from repeated compress calls on query operands (see http://archives.postgresql.org/pgsql-hackers/2009-05/msg00078.php). The btree_gist int4 compress function calls the generic gbt_num_compress, which does a palloc. Maybe this palloc is allso hit al lot when scanning the index, because the constants that are queries with are repeatedly compressed and palloced. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Bad query plan inside EXISTS clause
Yeb Havinga wrote: Kenneth Marshall wrote: EXISTS matches NULLs too and since they are not indexed a sequential scan is needed to check for them. Try using IN instead. This is nonsense in more than one way. Hit ctrl-return a bit too slow - exists does not match null but a set of records, that is either empty or not empty. Also it is possible to index table columns with nulls, and then the indexes can still be used. Besides filtering record sets with expressions, indexes are also used for ordering. There the effect of indexes with nulls can be seen: where to put them: in front or after the non nulls? So indexes can be perfectly used in conjunction with nulls. I found the original mail rather intriguing and played with an example myself a bit, but could not repeat the behavior (9.0 devel version), in my case the exists used an index. Maybe it has something to do with the fact that the planner estimates to return 5 rows, even when the actual numbers list only 1 hit. In the exists case, it can stop at the first hit. In the select all rows case, it must return all rows. Maybe a better plan emerges with better statistics? regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Bad query plan inside EXISTS clause
Kenneth Marshall wrote: EXISTS matches NULLs too and since they are not indexed a sequential scan is needed to check for them. Try using IN instead. This is nonsense in more than one way. regards Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Strange workaround for slow query
o inner. I do agree with this assessment. I'm sort of improving the performance of an existing implementation of ours, for which I'm not aware why they chose for LEFT OUTER. I did, however, test things with INNER as well, with the same results, so I decided to stick with what I encountered in the existing implementation. But it's on my mind as well ;-) Thanks for the formatted output. The difference in speed is caused by the first query that has to read 50k rows from an index, with filter expression is only eventtype_id = 71, and the second has the extra knowledge from the scan of events_event_type in the nestloops outer loop, which returns 0 rows in all cases and is hence a lot faster, even though that scan is called 16 times. But the big question is: why does the planner chooses plan 1 in the first case, or how to fix that? My $0,02 would be to 'help' the planner find a better plan. Ofcourse you did ANALYZE, but maybe another idea is to increase the default_statistics_target if it is still at the default value of 10. More info on http://www.postgresql.org/docs/8.3/static/runtime-config-query.html. And also to 'help' the planner: I'd just change the query to an inner join in this case, since there cannot be null tuples in the right hand side here. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Strange workaround for slow query
sverha...@wps-nl.com wrote: Hi, EXPLAIN SELECT * FROM events_events LEFT OUTER JOIN events_event_types ON eventType_id=events_event_types.id WHERE severity=20 AND (eventType_id IN (71)) ORDER BY datetime DESC limit 50; QUERY PLAN --- Limit (cost=3.23..200.31 rows=50 width=131) -> Nested Loop (cost=3.23..49139.16 rows=12466 width=131) -> Index Scan Backward using events_events_eventtype_id_datetime_ind on events_events (cost=0.00..48886.61 rows=12466 width=93) Index Cond: (eventtype_id = 71) -> Materialize (cost=3.23..3.24 rows=1 width=38) -> Seq Scan on events_event_types (cost=0.00..3.23 rows=1 width=38) Filter: ((id = 71) AND (severity = 20)) EXPLAIN SELECT * FROM events_events LEFT OUTER JOIN events_event_types ON eventType_id=events_event_types.id WHERE severity=70 AND (eventType_id IN (71, 999)) ORDER BY datetime DESC LIMIT 50; QUERY PLAN Limit (cost=27290.24..27290.37 rows=50 width=131) -> Sort (cost=27290.24..27303.15 rows=5164 width=131) Sort Key: events_events.datetime -> Nested Loop (cost=22.95..27118.70 rows=5164 width=131) -> Seq Scan on events_event_types (cost=0.00..3.02 rows=17 width=38) Filter: (severity = 70) -> Bitmap Heap Scan on events_events (cost=22.95..1589.94 rows=408 width=93) Recheck Cond: ((events_events.eventtype_id = ANY ('{71,999}'::bigint[])) AND (events_events.eventtype_id = events_event_types.id)) -> Bitmap Index Scan on test_events_events_eventtype_id_severity_ind (cost=0.00..22.85 rows=408 width=0) Index Cond: ((events_events.eventtype_id = ANY ('{71,999}'::bigint[])) AND (events_events.eventtype_id = events_event_types.id)) Thanks - I'm sorry that I was not more specific earlier, but what would be *really* helpful is the output of explain analyze, since that also shows actual time, # rows and # loops of the inner nestloop. I'm wondering though why you do a left outer join. From the \d output in the previous mail, events_event.eventtype_id has a not null constraint and a fk to events_event_types.id, so an inner join would be appropriate. Outer joins limits the amount of join orders the planner considers, so a better plan might arise when the join is changed to inner. regards Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Strange workaround for slow query
Hello Sander, Can you post the explain plan output of these queries? SELECT * FROM events_events LEFT OUTER JOIN events_event_types ON eventType_id=events_event_types.id WHERE severity=20 AND (eventType_id IN (71)) ORDER BY datetime DESC limit 50; SELECT * FROM events_events LEFT OUTER JOIN events_event_types ON eventType_id=events_event_types.id WHERE severity=70 AND (eventType_id IN (71, 999)) ORDER BY datetime DESC LIMIT 50; regards Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing FusionIO
Devrim GÜNDÜZ wrote: Hi, I have a FusionIO drive Cool!! to test for a few days. I already ran iozone and bonnie++ against it. Does anyone have more suggestions for it? Oracle has a tool to test drives specifically for database loads kinds called orion - its free software and comes with a good manual. Download without registration etc at http://www.oracle.com/technology/software/tech/orion/index.html Quickstart create file named named 'fusion.lun' with the device name, e.g. /dev/sda1 Invoke orion tool with something like -run advanced -testname fusion -num_disks 50 -size_small 4 -size_large 1024 -type rand -simulate concat -verbose -write 25 -duration 15 -matrix detailed -cache_size 256 cache size is in MB's but not so important for random io. num disks doesn't have to match physical disks but it's used by the tool to determine how large the test matrix should be. E.g. 1 disk gives a small matrix with small number of concurrent io requests. So I set it to 50. Another idea: pgbench? regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 10K vs 15k rpm for analytics
Francisco Reyes wrote: Yeb Havinga writes: controllers. Also, I am not sure if it is wise to put the WAL on the same logical disk as the indexes, If I only have two controllers would it then be better to put WAL on the first along with all the data and the indexes on the external? Specially since the external enclosure will have 15K rpm vs 10K rpm in the internal. It sounds like you're going to create a single logical unit / raid array on each of the controllers. Depending on the number of disks, this is a bad idea because if you'd read/write data sequentially, all drive heads will be aligned to roughly the same place ont the disks. If another process wants to read/write as well, this will interfere and bring down both iops and mbps. However, with three concurrent users.. hmm but then again, queries will scan multiple tables/indexes so there will be mixed io to several locations. What would be interesting it so see what the mbps maximum of a single controller is. Then calculate how much disks are needed to feed that, which would give a figure for number of disks per logical unit. The challenge with having a few logical units / raid arrays available, is how to divide the data over it (with tablespaces) What is good for your physical data depends on the schema and queries that are most important. For 2 relations and 2 indexes and 4 arrays, it would be clear. There's not much to say anything general here, except: do not mix table or index data with the wal. In other words: if you could make a separate raid array for the wal (2 disk raid1 probably good enough), that would be ok and doesn't matter on which controller or enclosure it happens, because io to disk is not mixed with the data io. Thanks for pointing that out. With any luck I will actually be able to do some tests for the new hardware. The curernt one I literaly did a few hours stress test and had to put in production right away. I've heard that before ;-) If you do get around to do some tests, I'm interested in the results / hard numbers. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 10K vs 15k rpm for analytics
Francisco Reyes wrote: Going with a 3Ware SAS controller. Have some external enclosures with 16 15Krpm drives. They are older 15K rpms, but they should be good enough. Since the 15K rpms usually have better Transanctions per second I will put WAL and indexes in the external enclosure. It sounds like you have a lot of hardware around - my guess it would be worthwhile to do a test setup with one server hooked up with two 3ware controllers. Also, I am not sure if it is wise to put the WAL on the same logical disk as the indexes, but that is maybe for a different thread (unwise to mix random and sequential io and also the wal has demands when it comes to write cache). regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 10K vs 15k rpm for analytics
Scott Marlowe wrote: On Tue, Mar 2, 2010 at 1:51 PM, Yeb Havinga wrote: With 24 drives it'll probably be the controller that is the limiting factor of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K drives and put the saved money towards the controller (or maybe more than one controller). That's horrifically bad numbers for that many drives. I can get those numbers for write performance on a RAID-6 on our office server. I wonder what's making your SAN setup so slow? Pre scriptum: A few minutes ago I mailed detailed information in the same thread but as reply to an earlier response - it tells more about setup and gives results of a raid1+0 test. I just have to react to "horrifically bad" and "slow" :-) : The HP san can do raid5 on 28 disks also on about 155MBps: 28 disks devided into 7 logical units with raid5, orion results are below. Please note that this time I did sequential large io. The mixed read/write MBps maximum here is comparable: around 155MBps. regards, Yeb Havinga ORION VERSION 10.2.0.1.0 Commandline: -run advanced -testname msa -num_disks 24 -size_small 4 -size_large 1024 -type seq -simulate concat -verbose -write 50 -duration 15 -matrix detailed -cache_size 256 This maps to this test: Test: msa Small IO size: 4 KB Large IO size: 1024 KB IO Types: Small Random IOs, Large Sequential Streams Number of Concurrent IOs Per Stream: 4 Force streams to separate disks: No Simulated Array Type: CONCAT Write: 50% Cache Size: 256 MB Duration for each Data Point: 15 seconds Small Columns:, 0, 1, 2, 3, 4, 5, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96,102,108, 114,120 Large Columns:, 0, 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48 Total Data Points: 416 Name: /dev/sda1 Size: 109256361984 Name: /dev/sdb1 Size: 109256361984 Name: /dev/sdc1 Size: 109256361984 Name: /dev/sdd1 Size: 109256361984 Name: /dev/sde1 Size: 109256361984 Name: /dev/sdf1 Size: 109256361984 Name: /dev/sdg1 Size: 109256361984 7 FILEs found. Maximum Large MBPS=157.75 @ Small=0 and Large=1 Maximum Small IOPS=3595 @ Small=66 and Large=1 Minimum Small Latency=2.81 @ Small=1 and Large=0 MBPS matrix Large/Small, 0, 1, 2, 3, 4, 5, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96,102,108,114,120 1, 157.75, 156.47, 153.56, 153.45, 144.87, 141.78, 140.60, 112.45, 95.23, 72.80, 80.59, 36.91, 29.76, 42.86, 41.82, 33.87, 34.07, 45.62, 42.97, 26.37, 42.85, 45.49, 44.47, 37.26, 45.67, 36.18 2, 137.58, 128.48, 125.78, 133.85, 120.12, 127.86, 127.05, 119.26, 121.23, 115.00, 117.88, 114.35, 108.61, 106.55, 83.50, 78.61, 92.67, 96.01, 44.02, 70.60, 62.84, 46.52, 69.18, 51.84, 57.19, 59.62 3, 143.10, 134.92, 139.30, 138.62, 137.85, 146.17, 140.41, 138.48, 76.00, 138.17, 123.48, 137.45, 126.51, 137.11, 91.94, 90.33, 129.97, 45.35, 115.92, 89.60, 137.22, 72.46, 89.95, 77.40, 119.17, 82.09 4, 138.47, 133.74, 129.99, 122.33, 126.75, 125.22, 132.30, 120.41, 125.88, 132.21, 96.92, 115.70, 131.65, 66.34, 114.06, 113.62, 116.91, 96.97, 98.69, 127.16, 116.67, 111.53, 128.97, 92.38, 118.14, 78.31 8, 126.59, 127.92, 115.51, 125.02, 123.29, 111.94, 124.31, 125.71, 134.48, 126.40, 127.93, 125.36, 121.75, 121.75, 127.17, 116.51, 121.44, 121.12, 112.32, 121.55, 127.93, 124.86, 118.04, 114.59, 121.72, 114.79 12, 112.40, 122.58, 107.61, 125.42, 128.04, 123.80, 127.17, 127.70, 122.37, 96.52, 115.36, 124.49, 124.07, 129.31, 124.62, 124.23, 105.58, 123.55, 115.67, 120.59, 125.61, 123.57, 121.43, 121.45, 121.44, 113.64 16, 108.88, 119.79, 123.80, 120.55, 120.02, 121.66, 125.71, 122.19, 125.77, 122.27, 119.55, 118.44, 120.51, 104.66, 97.55, 115.43, 101.45, 108.99, 122.30, 100.45, 105.82, 119.56, 121.26, 126.59, 119.54, 115.09 20, 103.88, 122.95, 115.86, 114.59, 121.13, 108.52, 116.90, 121.10, 113.91, 108.20, 111.51, 125.64, 117.57, 120.86, 117.66, 100.40, 104.88, 103.15, 98.10, 104.86, 104.69, 102.99, 121.81, 107.22, 122.68, 106.43 24, 102.64, 102.33, 112.95, 110.63, 108.00, 111.53, 124.33, 103.17, 108.16, 112.63, 97.42, 106.22, 102.54, 117.46, 100.66, 99.01, 104.46, 99.02, 116.02, 112.49, 119.05, 104.03, 102.40, 102.44, 111.15, 99.51 28, 101.12, 102.76, 114.14, 109.72, 120.63, 118.09, 119.85, 113.80, 116.58, 110.24, 101.45, 110.31, 116.06, 112.04, 121.63, 91.26, 98.88, 101.55, 104.51, 116.43, 112.98, 119.46, 120.08, 109.46, 106.29, 96.69 32, 103.41, 117.33, 101.33, 102.29, 102.58, 116.18,
Re: [PERFORM] 10K vs 15k rpm for analytics
Greg Smith wrote: Yeb Havinga wrote: With 24 drives it'll probably be the controller that is the limiting factor of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. You should be able to clear 1GB/s on sequential reads with 28 15K drives in a RAID10, given proper read-ahead adjustment. I get over 200MB/s out of the 3-disk RAID0 on my home server without even trying hard. Can you share what HP SAN controller you're using? Yeah I should have mentioned a bit more, to allow for a better picture of the apples and pears. Controller a is the built in controller of the HP MSA1000 SAN - with 14 disks but with extra 14 disks from a MSA30. It is connected through a 2Gbit/s fibrechannel adapter - should give up to roughly 250MB/s bandwidth, maybe a bit less due to frame overhead and gib/gb difference. Controller has 256MB cache It is three years old, however HP still sells it. I performed a few dozen tests with oracle's free and standalone orion tool (http://www.oracle.com/technology/software/tech/orion/index.html) with different raid and controller settings, where I varied - controller read/write cache ratio - logical unit layout (like one big raidx, 3 luns with raid10 (giving stripe width of 4 disks and 4 hot spares), 7 luns with raid10 - stripe size set to maximum - load type (random or sequential large io) - linux io scheduler (deadline / cfq etc) - fibre channel adapter queue depth - ratio between reads and writes by the orion - our production application has about 25% writes. - I also did the short stroking that is talked about further in this thread by only using one partition of about 30% size on each disk. - etc My primary goal was large IOPS for our typical load: mostly OLTP. The orion tool tests in a matrix with on one axis the # concurrent small io's and the other axis the # concurrent large io's. It output numbers are also in a matrix, with MBps, iops and latency. I put several of these numbers in matlab to produce 3d pictures and that showed some interesting stuff - its probably bad netiquette here to post a one of those pictures. One of the striking things was that I could see something that looked like a mountain where the top was neatly cut of - my guess: controller maximum reached. Below is the output data of a recent test, where a 4Gbit/s fc adapter was connected. From this numbers I conclude that in our setup, the controller is maxed out at 155MB/s for raid 1+0 *with this setup*. In a test run I constructed to try and see what the maximum mbps of the controller would be: 100% reads, sequential large io - that went to 170MBps. I'm particularly proud of the iops of this test. Please note: large load is random, not sequential! So to come back at my original claim: controller is important when you have 24 disks. I believe I have backed up this claim by this mail. Also please take notice that for our setup, a database that has a lot of concurrent users on a medium size database (~=160GB), random IO is what we needed, and for this purpose the HP MSA has proved rock solid. But the setup that Francisco mentioned is different: a few users doing mostly sequential IO. For that load, our setup is far from optimal, mainly because of the (single) controller. regards, Yeb Havinga ORION VERSION 10.2.0.1.0 Commandline: -run advanced -testname r10-7 -num_disks 24 -size_small 4 -size_large 1024 -type rand -simulate concat -verbose -write 25 -duration 15 -matrix detailed -cache_size 256 This maps to this test: Test: r10-7 Small IO size: 4 KB Large IO size: 1024 KB IO Types: Small Random IOs, Large Random IOs Simulated Array Type: CONCAT Write: 25% Cache Size: 256 MB Duration for each Data Point: 15 seconds Small Columns:, 0, 1, 2, 3, 4, 5, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96,102,108, 114,120 Large Columns:, 0, 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48 Total Data Points: 416 Name: /dev/sda1 Size: 72834822144 Name: /dev/sdb1 Size: 72834822144 Name: /dev/sdc1 Size: 72834822144 Name: /dev/sdd1 Size: 72834822144 Name: /dev/sde1 Size: 72834822144 Name: /dev/sdf1 Size: 72834822144 Name: /dev/sdg1 Size: 72834822144 7 FILEs found. Maximum Large MBPS=155.05 @ Small=2 and Large=48 Maximum Small IOPS=6261 @ Small=120 and Large=0 Minimum Small Latency=3.93 @ Small=1 and Large=0 Below the MBps matrix - hope this reads well in email clients?? Large/Small, 0, 1, 2, 3, 4, 5, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96,102,108,114,120 1, 76.60, 74.87, 73.24, 70.66, 70.45, 68.36, 67.58, 59.63,
Re: [PERFORM] 10K vs 15k rpm for analytics
Francisco Reyes wrote: Anyone has any experience doing analytics with postgres. In particular if 10K rpm drives are good enough vs using 15K rpm, over 24 drives. Price difference is $3,000. Rarely ever have more than 2 or 3 connections to the machine. So far from what I have seen throughput is more important than TPS for the queries we do. Usually we end up doing sequential scans to do summaries/aggregates. With 24 drives it'll probably be the controller that is the limiting factor of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K drives and put the saved money towards the controller (or maybe more than one controller). regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query slowing down significantly??
Rainer Pruy wrote: Thanks for the hint. I should have been considering that in the first place. (But the obvious is easily left unrecognised..) The prepared statement gives: QUERY PLAN --- Nested Loop (cost=0.00..25.18 rows=2 width=175) (actual time=36.116..49.998 rows=1 loops=1) -> Index Scan using x_context_01 on context c (cost=0.00..10.76 rows=2 width=67) (actual time=0.029..6.947 rows=12706 loops=1) Index Cond: ((contextid)::text = $1) -> Index Scan using x_fk_context_hierarchy_02 on context_hierarchy h (cost=0.00..7.20 rows=1 width=108) (actual time=0.003..0.003 rows=0 loops=12706) Index Cond: (h.contextidx = c.idx) Filter: (((h.hierarchyname)::text = $2) AND (h.parentidx = $3)) Total runtime: 50.064 ms (7 rows) And that is quite a bad plan given the current distribution of values. Another approach might be to rewrite recursion into your hierarchy with the in 8.4 new WITH RECURSIVE option in sql queries. The possible gains there are way beyond anything you can accomplish with optimizing recursive functions. Regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query: table iteration (8.3)
Glenn Maynard wrote: CREATE FUNCTION high_score_for_user(user_id int) RETURNS SETOF INTEGER LANGUAGE SQL AS $$ SELECT t.id FROM test t WHERE t.user_id = $1 ORDER BY t.score DESC LIMIT 1 $$; SELECT high_score_for_user(u.id) FROM test_users u; runs in 100ms. Though it doesn't solve your problem without changing result format, but what about aap=# explain select u.id, ARRAY(select t.id from test t where t.user_id=u.id order by t.score desc limit 2) as high from test_users u; QUERY PLAN - Seq Scan on test_users u (cost=0.00..3290.84 rows=1000 width=4) SubPlan 1 -> Limit (cost=0.00..3.28 rows=2 width=8) -> Index Scan using test_2 on test t (cost=0.00..1637.92 rows=1000 width=8) Index Cond: (user_id = $0) (5 rows) id | high --+- 1 | {641,896} 2 | {1757,1167} 3 | {2765,2168} 4 | {3209,3674} 5 | {4479,4993} regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Why primary key index are not using in joining?
AI Rumman wrote: explain analyze select * from vtiger_crmentity inner JOIN vtiger_users ON vtiger_users.id <http://vtiger_users.id> = vtiger_crmentity.smownerid where vtiger_crmentity.deleted = 0 ; QUERY PLAN Hash Join (cost=3665.17..40019.25 rows=640439 width=1603) (actual time=115.613..3288.436 rows=638081 loops=1) Hash Cond: ("outer".smownerid = "inner".id) -> Bitmap Heap Scan on vtiger_crmentity (cost=3646.54..30394.02 rows=640439 width=258) (actual time=114.763..986.504 rows=638318 loops=1) Recheck Cond: (deleted = 0) -> Bitmap Index Scan on vtiger_crmentity_deleted_idx (cost=0.00..3646.54 rows=640439 width=0) (actual time=107.851..107.851 rows=638318 loops=1) Index Cond: (deleted = 0) -> Hash (cost=18.11..18.11 rows=211 width=1345) (actual time=0.823..0.823 rows=211 loops=1) -> Seq Scan on vtiger_users (cost=0.00..18.11 rows=211 width=1345) (actual time=0.005..0.496 rows=211 loops=1) Total runtime: 3869.022 ms Sequential index is occuring on vtiger_users table while it has primary key index on id. Could anyone please tell me why? From the list of indexes you also supplied it seems to me you very much want index scanning, the reason being that 4secs is too slow? The seqscan is not the reason for that - the main reason is that you process almost all rows of the crmentity table. I bet that if you add a LIMIT, or adding a clause that selects only for a specific vtiger_user, the plan looks different on the access to the crmentity table as well as the kind of join, however if your application really needs to process the 600k rows, I'm not sure if it can get any faster than that. Perhaps it would help a bit to shorten the SELECT * to only the attributes you really need. Regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Immutable table functions
Luiz Angelo Daros de Luca wrote: I have a directed graph, or better, a tree in postgresql 8.3. One table are the nodes and another one are the connections. Given any node, I need to get all nodes down to it(to the leafs) that have relation with anotherTable. Also, this connection change on time, so I have a connectionDate and a disconnectionDate for each connection (which might be null to represent open interval). This way, I wrote a pgsql function (I rename the tables and columns to generic names). These are the tables and the function: Hello Luiz, If you could upgrade to 8.4, you could use WITH RECURSIVE - my experience is that it is several orders of magnitude faster than recursive functions. http://developer.postgresql.org/pgdocs/postgres/queries-with.html regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query: table iteration (8.3)
Glenn Maynard wrote: The function version: CREATE FUNCTION high_score_for_user(user_id int) RETURNS SETOF INTEGER LANGUAGE SQL AS $$ SELECT t.id FROM test t WHERE t.user_id = $1 ORDER BY t.score DESC LIMIT 1 $$; SELECT high_score_for_user(u.id) FROM test_users u; runs in 100ms. Hi Glenn, About cached plans of SQL functions: from the source of function.c 00067 /* 00068 * An SQLFunctionCache record is built during the first call, 00069 * and linked to from the fn_extra field of the FmgrInfo struct. 00070 * 00071 * Note that currently this has only the lifespan of the calling query. 00072 * Someday we might want to consider caching the parse/plan results longer 00073 * than that. 00074 */ So it is planned at every call of SELECT high_score_for_user(u.id) FROM test_users u; and the cache is used between each row of test_users. The plan is with a parameter, that means the optimizer could not make use of an actual value during planning. However, your test case is clever in the sense that there is an index on users and score and the sql function has an order by that matches the index, so the planner can avoid a sort by accessing the test table using the index. In this particular case, that means that the plan is optimal; no unneeded tuples are processed and the (function) plan complexity is logaritmic on the size of the test relation, you can't get it any better than that. In short: the lack of an actual parameter in the test case did not result in an inferior plan. So using a dynamic constructed query string in pl/pgsql to 'force' replanning during iteration cannot be faster than this sql function. It is possible to make the performance if this function worse by disabling indexscans: CREATE FUNCTION high_score_for_user(user_id int) RETURNS SETOF INTEGER LANGUAGE SQL AS $$ SELECT t.id FROM test t WHERE t.user_id = $1 ORDER BY t.score DESC LIMIT 1 $$ SET enable_indexscan = off; Now the query time with test_users is over a second. So maybe the converse could also be true in your production setup using the same technique. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query: table iteration (8.3)
Glenn Maynard wrote: On Mon, Feb 1, 2010 at 6:15 AM, Yeb Havinga wrote: Stomp_steps is analyzed to 2902 rows but when you run the query the actual rows are 0. This means that the highscore function is not called or the number 0 is incorrect. This SELECT returns 0 rows: it calls the function 1500 times, and each time it returns no data, because there simply aren't any results for these parameters. Hmm.. first posting on a pg mailing list and I make this mistake.. What an introduction :-[ Checked the source and indeed for every plan node the number of tuples that result from it are counted. In most cases this is the number of records that match the qualifiers (where clause/join conditions) so that was in my head: actual rows = rows that match where, and without where I'd expected the actual rows to reflect the total number of rows in the table. But with a set returning functions this number is something completely different. below. The truth might be that you probably got that result by explaining the query in the function with actual parameter values. This plan differs from the one that is made when the function is called from sql and is planned (once) without parameters, and in that case the plan is probably different. Yeah. It would help a lot if EXPLAIN could show query plans of functions used by the statement and not just the top-level query. Like subplans are, yes. Sounds like a great future. Squinting at the output, it definitely looks like a less optimized plan; it's using a SEQSCAN instead of BITMAPHEAPSCAN. (I've attached the output.) Does the planner not optimize functions based on context? I believe it does for (re) binding of parameter values to prepared statements, but not in the case of an sql function. To test an idea, there might be a workaround where you could write a pl/pgsql function that makes a string with the query and actual parameter values and executes that new query everytime. It's not as pretty as a sql function, but would give an idea of how fast things would run with each loop replanned. Another idea is that maybe you could 'hint' the planner at planning time of the sql function by giving it some extra set commands (like set search_path but then set enable_seqscan = off) - I don't know if planning of the sql function occurs in the environment given by it's set commands, but its worth a try. Again, certainly not pretty. I can't see how to apply WITH to this. Non-recursive WITH seems like syntax sugar that doesn't do anything a plain SELECT can't do, and I don't think what I'm doing here can be done with a regular SELECT. With indeed is not a solution because the with query is executed once, so it cannot take a parameter. What about a window function on a join of stomp_steps and stomp_round with partition by on steps_id and user_card is and order by score and with row_number() < your third parameter. From the docs I read that window functions cannot be part of the where clause: an extra subselect leven is needed then to filter the correct row numbers. Regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query: table iteration (8.3)
Glenn Maynard wrote: Hitting a performance issues that I'm not sure how to diagnose. SELECT highscores_for_steps_and_card(s.id, 591, 1) FROM stomp_steps s; Seq Scan on stomp_steps s (cost=0.00..793.52 rows=2902 width=4) (actual time=26509.919..26509.919 rows=0 loops=1) Total runtime: 26509.972 ms Hello Glenn, Stomp_steps is analyzed to 2902 rows but when you run the query the actual rows are 0. This means that the highscore function is not called or the number 0 is incorrect. Suppose that the number of rows is 2900, then 26 seconds means 100ms per function call. This still is a lot, compared to the 0.054 ms analyze result below. The truth might be that you probably got that result by explaining the query in the function with actual parameter values. This plan differs from the one that is made when the function is called from sql and is planned (once) without parameters, and in that case the plan is probably different. A way to check the plan of that query is to turn on debug_print_plan and watch the server log. It takes a bit getting used. The plan starts with CONTEXT: SQL function "functionname" during startup and is also recognized because in the opexpr (operator expression) one of the operands is a parameter. Important is the total cost of the top plan node (the limit). I know 8.3 is mentioned in the subject, but I think that a WITH query (http://www.postgresql.org/docs/8.4/interactive/queries-with.html) could be a good solution to your problem and may be worth trying out, if you have the possibility to try out 8.4. Regards, Yeb Havinga The inner function looks like this: CREATE FUNCTION highscores_for_steps_and_card(steps_id int, card_id int, count int) RETURNS SETOF INTEGER LANGUAGE SQL AS $$ SELECT r.id FROM stomp_round r WHERE ($1 IS NULL OR r.steps_id = $1) AND ($2 IS NULL OR r.user_card_id = $2) ORDER BY r.score DESC LIMIT $3 $$ Limit (cost=13.12..13.12 rows=1 width=8) (actual time=0.054..0.054 rows=0 loops=1) -> Sort (cost=13.12..13.12 rows=1 width=8) (actual time=0.051..0.051 rows=0 loops=1) Sort Key: score Sort Method: quicksort Memory: 17kB -> Bitmap Heap Scan on stomp_round r (cost=9.09..13.11 rows=1 width=8) (actual time=0.036..0.036 rows=0 loops=1) Recheck Cond: ((280 = steps_id) AND (user_card_id = 591)) -> BitmapAnd (cost=9.09..9.09 rows=1 width=0) (actual time=0.032..0.032 rows=0 loops=1) -> Bitmap Index Scan on stomp_round_steps_id (cost=0.00..4.40 rows=20 width=0) (actual time=0.030..0.030 rows=0 loops=1) Index Cond: (280 = steps_id) -> Bitmap Index Scan on stomp_round_user_card_id (cost=0.00..4.44 rows=25 width=0) (never executed) Index Cond: (user_card_id = 591) Total runtime: 0.153 ms (12 rows) stomp_steps has about 1500 rows, so it finds 1500 high scores, one for each stage. I expected scalability issues from this on a regular drive, since it'll be doing a ton of index seeking when not working out of cache, so I expected to need to change to an SSD at some point (when it no longer easily fits in cache). However, I/O doesn't seem to be the bottleneck yet. If I run it several times, it consistently takes 26 seconds. The entire database is in OS cache (find | xargs cat: 250ms). I'm not sure why the full query (26s) is orders of magnitude slower than 1500*0.150ms (225ms). It's not a very complex query, and I'd hope it's not being re-planned every iteration through the loop. Any thoughts? Using SELECT to iterate over a table like this is very useful (and I don't know any practical alternative), but it's difficult to profile since it doesn't play nice with EXPLAIN ANALYZE. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance