Re: [PERFORM] Anyone running Intel S3700 SSDs?

2013-03-08 Thread Yeb Havinga

On 2013-03-08 07:48, CSS wrote:


http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review

Of most interest to me was this:


Anyone have any benchmarks to share?

I tested the 800GB disk with diskchecker.pl and it was ok. No benchmarks 
to share yet, as I'm waiting for the disks to be installed in a server.


--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Two identical systems, radically different performance

2012-10-09 Thread Yeb Havinga

On 2012-10-08 23:45, Craig James wrote:
This is driving me crazy.  A new server, virtually identical to an old 
one, has 50% of the performance with pgbench.  I've checked everything 
I can think of.


The setups (call the servers "old" and "new"):

old: 2 x 4-core Intel Xeon E5620
new: 4 x 4-core Intel Xeon E5606


How are the filesystems formatted and mounted (-o nobarrier?)

regards
Yeb



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD, Postgres and safe write cache

2012-06-25 Thread Yeb Havinga

On 2012-06-20 16:51, Michal Szymanski wrote:

Hi,
We started to think about using SSD drive for our telco system DB. Because we have many 
"almost" read-only data I think  SSD is good candidate for our task. We would 
like to speed up process of read operation.
I've read post (http://blog.2ndquadrant.com/intel_ssd_now_off_the_sherr_sh/) 
about SSD which have write safe functionality and two drive are recommended 
Intel 320 and Vertex2 Pro. Both drive are rather inexpensive but both using old 
SATA II.
I tried to find newer faster version of Vertex because production Vertex 2 Pro 
is stopped but there is no information about new drives that has similar 
functionality and are cheap. Do you recommend cheap SSD drives that are 
suitable for DB needs?
We were able to get OCZ Deneva 2's 
(http://www.oczenterprise.com/downloads/solutions/ocz-deneva2-r-mlc-2.5in_Product_Brief.pdf) 
from our supplier, which were suggested by OCZ as replacement for the 
vertex 2 pro's. They're marketed as safe under power failure and our 
tests with the diskchecker tool confirmed that.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Parallel Scaling of a pgplsql problem

2012-04-25 Thread Yeb Havinga

On 2012-04-26 04:40, Venki Ramachandran wrote:
Thanks Tom, clock_timestamp() worked. Appreciate it!!! and Sorry was 
hurrying to get this done at work and hence did not read through.


Can you comment on how you would solve the original problem? Even if I 
can get  the 11 seconds down to 500 ms for one pair, running it for 
300k pairs will take multiple hours. How can one write a combination 
of a bash script/pgplsql code so as to use all 8 cores of a server. I 
am seeing that this is just executing in one session/process.


You want to compare a calculation on the cross product 'employee x 
employee'. If employee is partitioned into emp1, emp2, ... emp8, the 
cross product is equal to the union of emp1 x employee, emp2 x employee, 
.. emp8 x employee. Each of these 8 cross products on partitions can be 
executed in parallel. I'd look into dblink to execute each of the 8 
cross products in parallel, and then union all of those results.


http://www.postgresql.org/docs/9.1/static/contrib-dblink-connect.html

regards,
Yeb



Re: [PERFORM] Sudden Query slowdown on our Postgresql Server

2012-03-23 Thread Yeb Havinga

On 2012-03-23 05:53, Sebastian Melchior wrote:

Hi,

we already used iostat and iotop during times of the slowdown, there is no 
sudden drop in I/O workload in the times of the slowdown. Also the iowait does 
not spike and stays as before.
So i do not think that this is I/O related. As the disks are SSDs there also still is 
some "head room" left.


I've seen a ssd completely lock up for a dozen seconds or so after 
giving it a smartctl command to trim a section of the disk. I'm not sure 
if that was the vertex 2 pro disk I was testing or the intel 710, but 
enough reason for us to not mount filesystems with -o discard.


regards,
Yeb




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD and RAID

2012-03-07 Thread Yeb Havinga

On 2012-03-07 01:36, Mark Kirkwood wrote:

On 06/03/12 21:17, Yeb Havinga wrote:



One thing to note is that linux software raid with md doesn't support 
discard, which might shorten the drive's expected lifetime. To get 
some numbers I tested the raid 1 of ssd's setup for mediawear under a 
PostgreSQL load earlier, see 
http://archives.postgresql.org/pgsql-general/2011-11/msg00141.php





Right, which is a bit of a pain - we are considering either formatting 
the drive with less capacity and using md RAID 1 or else doing the 
mirror in LVM to enable a working discard/trim.


When I measured the write durability without discard on the enterprise 
disks, I got numbers that in normal production use would outlive the 
lifetime of the servers. It would be interesting to see durability 
numbers for the desktop SSDs, even when partitioned to a part of the disk.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD and RAID

2012-03-06 Thread Yeb Havinga

On 2012-03-06 09:34, Andrea Suisani wrote:

On 03/06/2012 09:17 AM, Yeb Havinga wrote:


PS: we applied the same philosophy (different brands) also to 
motherboards, io controllers and memory, but after testing, we liked 
one IO controllers software so much more than the other so we chose 
to have only one. Also stream memory performance of one motherboard 
showed a significant performance
regression in the higher thread counts that we decided to go for the 
other brand for all servers.




care to share motherboard winning model?

thanks
Andrea



On http://i.imgur.com/vfmvu.png is a graph of three systems, made with 
the multi stream scaling (average of 10 tests if I remember correctly) test.


The red and blue are 2 X 12 core opteron 6168 systems with 64 GB DDR3 
1333MHz in 8GB dimms


Red is a Tyan S8230
Blue is a Supermicro H8DGI-G

We tried a lot of things to rule out motherboards, such as swap memory 
of both systems, ensure BIOS settings are similar (e.g. ECC mode), 
update to latest BIOS where possible, but none of those settings 
improved the memory performance drop. Both systems were installed with 
kickstarted Centos 6.2, so also no kernel setting differences there..


regards,
Yeb

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD and RAID

2012-03-06 Thread Yeb Havinga

On 2012-03-05 23:37, Mark Kirkwood wrote:
Which brings up the question of should it be a pair in RAID 1 or just 
a singe drive? Traditionally this would have been a no brainer "Of 
course you want RAID 1 or RAID 10"! However our experience with SSD 
failure modes points to firmware bugs as primary source of trouble - 
and these are likely to impact both drives (nearly) simultaneously in 
a RAID 1 configuration. Also the other major issue to watch - flash 
write limit exhaustion - is also likely to hit at the same time for a 
pair of drives in RAID 1.


What are other folks who are using SSDs doing?


This is exactly the reason why in a set of new hardware I'm currently 
evaluating two different brands of manufacturers for the spindles 
(behind bbwc for wal, os, archives etc) and ssds (on mb sata ports). For 
the ssd's we've chosen the Intel 710 and OCZ Vertex 2 PRO, however that 
last one was EOL and OCZ offered to replace it by the Deneva 2 
(http://www.oczenterprise.com/downloads/solutions/ocz-deneva2-r-mlc-2.5in_Product_Brief.pdf). 
Still waiting for a test Deneva though.


One thing to note is that linux software raid with md doesn't support 
discard, which might shorten the drive's expected lifetime. To get some 
numbers I tested the raid 1 of ssd's setup for mediawear under a 
PostgreSQL load earlier, see 
http://archives.postgresql.org/pgsql-general/2011-11/msg00141.php


I would recommended that for every ssd 
considered for production use, test the ssd with diskchecker.pl on a 
filesystem that's mounted the same as you would with your data (e.g. 
with xfs or ext4 with nobarrier), and also do a mediawear test like the 
one described in the linked pgsql-general threar above, especially if 
you're chosing to run on non-enterprise marketed ssds.


regards,
Yeb

PS: we applied the same philosophy (different brands) also to 
motherboards, io controllers and memory, but after testing, we liked one 
IO controllers software so much more than the other so we chose to have 
only one. Also stream memory performance of one motherboard showed a 
significant performance regression in the higher thread counts that we 
decided to go for the other brand for all servers.


--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intel 710 pgbench write latencies

2011-11-03 Thread Yeb Havinga

On 2011-11-02 16:06, Magnus Hagander wrote:

On Wed, Nov 2, 2011 at 16:04, Yeb Havinga  wrote:

On 2011-11-02 15:06, Kevin Grittner wrote:

Yeb Havingawrote:


I'm now contemplating not using the 710 at all. Why should I not
buy two 6Gbps SSDs without supercap (e.g. Intel 510 and OCZ Vertex
3 Max IOPS) with a IO controller+BBU?

Wouldn't the data be subject to loss between the time the IO
controller writes to the SSD and the time it makes it from buffers
to flash RAM?

Good question. My guess would be no, if the raid controller does
'write-throughs' on the attached disks, and the SSD's don't lie about when
they've written to RAM.

Doesn't most SSDs without supercaps lie about the writes, though?



I happened to have a Vertex 3, no supercap, available to test this with 
diskchecker. On a ext4 filesystem (just mounted with noatime, not 
barriers=off), this happenend:


# /root/diskchecker.pl -s 192.168.73.1 verify testfile
 verifying: 0.00%
 verifying: 30.67%
 verifying: 78.97%
 verifying: 100.00%
Total errors: 0

So I guess that's about as much as I can test without actually hooking 
it behind a hardware controller and test that. I will soon test the 
3ware 9750 with Vertex 3 and Intel 510 - both in the 3ware's ssd 
compatibility list.


More info from testing software raid 1:
- with lvm mirroring, discards / trim go through to the disks. This is 
where the Intel is fast enough, but the vertex 2 pro is busy for ~ 10 
seconds.


-- Yeb


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intel 710 pgbench write latencies

2011-11-03 Thread Yeb Havinga

On 2011-11-03 15:31, Shaun Thomas wrote:

On 11/03/2011 04:38 AM, Yeb Havinga wrote:


Both comparable near 10K tps.


That's another thing I was wondering about. Why are we talking about 
Vertex 2 Pro's, anyway? The Vertex 3 Pros post much better results and 
are still capacitor-backed.




Not for sale yet..

-- Yeb


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intel 710 pgbench write latencies

2011-11-03 Thread Yeb Havinga

On 2011-11-02 22:08, Merlin Moncure wrote:

On Wed, Nov 2, 2011 at 3:45 PM, Yeb Havinga  wrote:

Intel latency graph at http://imgur.com/Hh3xI
Ocz latency graph at http://imgur.com/T09LG

curious: what were the pgbench results in terms of tps?

merlin


Both comparable near 10K tps.

-- Yeb


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intel 710 pgbench write latencies

2011-11-02 Thread Yeb Havinga

On 2011-11-02 16:16, Yeb Havinga wrote:

On 2011-11-02 15:26, Merlin Moncure wrote:


I would keep at least 20-30% of both drives unpartitioned to leave the
controller room to wear level and as well as other stuff.  I'd try
wiping the drives, reparititoing, and repeating your test.  I would
also compare times through mdadm and directly to the device.


Good idea.


Reinstalled system - > 50% drives unpartitioned.
/dev/sdb3  19G  5.0G   13G  29% /ocz
/dev/sda3  19G  4.8G   13G  28% /intel
/dev/sdb3 on /ocz type ext4 (rw,noatime,nobarrier,discard)
/dev/sda3 on /intel type ext4 (rw,noatime,nobarrier,discard)

Again WAL was put in a ramdisk.

pgbench -i -s 300 t # fits in ram
pgbench -c 20 -M prepared -T 300 -l  t

Intel latency graph at http://imgur.com/Hh3xI
Ocz latency graph at http://imgur.com/T09LG



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intel 710 pgbench write latencies

2011-11-02 Thread Yeb Havinga

On 2011-11-02 15:26, Merlin Moncure wrote:

On Wed, Nov 2, 2011 at 8:05 AM, Yeb Havinga  wrote:

Hello list,

A OCZ Vertex 2 PRO and Intel 710 SSD, both 100GB, in a software raid 1
setup. I was pretty convinced this was the perfect solution to run
PostgreSQL on SSDs without a IO controller with BBU. No worries for strange
firmware bugs because of two different drives, good write endurance of the
710. Access to the smart attributes. Complete control over the disks:
nothing hidden by a hardware raid IO layer.

Then I did a pgbench test:
- bigger than RAM test (~30GB database with 24GB ram)
- and during that test I removed the Intel 710.
- during the test I removed the 710 and 10 minutes later inserted it again
and added it to the array.

The pgbench transaction latency graph is here: http://imgur.com/JSdQd

With only the OCZ, latencies are acceptable but with two drives, there are
latencies up to 3 seconds! (and 11 seconds at disk remove time) Is this due
to software raid, or is it the Intel 710? To figure that out I repeated the
test, but now removing the OCZ, latency graph at: http://imgur.com/DQa59
(The 12 seconds maximum was at disk remove time.)

So the Intel 710 kind of sucks latency wise. Is it because it is also
heavily reading, and maybe WAL should not be put on it?

I did another test, same as before but
* with 5GB database completely fitting in RAM (24GB)
* put WAL on a ramdisk
* started on the mirror
* during the test mdadm --fail on the Intel SSD

Latency graph is at: http://imgur.com/dY0Rk

So still: with Intel 710 participating in writes (beginning of graph), some
latencies are over 2 seconds, with only the OCZ, max write latencies are
near 300ms.

I'm now contemplating not using the 710 at all. Why should I not buy two
6Gbps SSDs without supercap (e.g. Intel 510 and OCZ Vertex 3 Max IOPS) with
a IO controller+BBU?

Benefits: should be faster for all kinds of reads and writes.
Concerns: TRIM becomes impossible (which was already impossible with md
raid1, lvm / dm based mirroring could work) but is TRIM important for a
PostgreSQL io load, without e.g. routine TRUNCATES? Also the write endurance
of these drives is probably a lot less than previous setup.

software RAID (mdadm) is currently blocking TRIM.  the only way to to
get TRIM in a raid-ish environment is through LVM mirroring/striping
or w/brtfs raid (which is not production ready afaik).

Given that, if you do use software raid, it's not a good idea to
partition the entire drive because the very first thing the raid
driver does is write to the entire device.


If that is bad because of a decreased lifetime, I don't think these 
number of writes are significant - in a few hours of pgbenching I the 
GBs written are more than 10 times the GB sizes of the drives. Or do you 
suggest this because then the disk firmware can operate assuming a 
smaller idema capacity, thereby proloning the drive life? (i.e. the 
Intel 710 200GB has 200GB idema capacity but 320GB raw flash).



I would keep at least 20-30% of both drives unpartitioned to leave the
controller room to wear level and as well as other stuff.  I'd try
wiping the drives, reparititoing, and repeating your test.  I would
also compare times through mdadm and directly to the device.


Good idea.

-- Yeb


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intel 710 pgbench write latencies

2011-11-02 Thread Yeb Havinga

On 2011-11-02 15:06, Kevin Grittner wrote:

Yeb Havinga  wrote:


I'm now contemplating not using the 710 at all. Why should I not
buy two 6Gbps SSDs without supercap (e.g. Intel 510 and OCZ Vertex
3 Max IOPS) with a IO controller+BBU?


Wouldn't the data be subject to loss between the time the IO
controller writes to the SSD and the time it makes it from buffers
to flash RAM?


Good question. My guess would be no, if the raid controller does 
'write-throughs' on the attached disks, and the SSD's don't lie about 
when they've written to RAM.


I'll put this on my to test list for the new setup.

-- Yeb


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Intel 710 pgbench write latencies

2011-11-02 Thread Yeb Havinga

Hello list,

A OCZ Vertex 2 PRO and Intel 710 SSD, both 100GB, in a software raid 1 
setup. I was pretty convinced this was the perfect solution to run 
PostgreSQL on SSDs without a IO controller with BBU. No worries for 
strange firmware bugs because of two different drives, good write 
endurance of the 710. Access to the smart attributes. Complete control 
over the disks: nothing hidden by a hardware raid IO layer.


Then I did a pgbench test:
- bigger than RAM test (~30GB database with 24GB ram)
- and during that test I removed the Intel 710.
- during the test I removed the 710 and 10 minutes later inserted it 
again and added it to the array.


The pgbench transaction latency graph is here: http://imgur.com/JSdQd

With only the OCZ, latencies are acceptable but with two drives, there 
are latencies up to 3 seconds! (and 11 seconds at disk remove time) Is 
this due to software raid, or is it the Intel 710? To figure that out I 
repeated the test, but now removing the OCZ, latency graph at: 
http://imgur.com/DQa59 (The 12 seconds maximum was at disk remove time.)


So the Intel 710 kind of sucks latency wise. Is it because it is also 
heavily reading, and maybe WAL should not be put on it?


I did another test, same as before but
* with 5GB database completely fitting in RAM (24GB)
* put WAL on a ramdisk
* started on the mirror
* during the test mdadm --fail on the Intel SSD

Latency graph is at: http://imgur.com/dY0Rk

So still: with Intel 710 participating in writes (beginning of graph), 
some latencies are over 2 seconds, with only the OCZ, max write 
latencies are near 300ms.


I'm now contemplating not using the 710 at all. Why should I not buy two 
6Gbps SSDs without supercap (e.g. Intel 510 and OCZ Vertex 3 Max IOPS) 
with a IO controller+BBU?


Benefits: should be faster for all kinds of reads and writes.
Concerns: TRIM becomes impossible (which was already impossible with md 
raid1, lvm / dm based mirroring could work) but is TRIM important for a 
PostgreSQL io load, without e.g. routine TRUNCATES? Also the write 
endurance of these drives is probably a lot less than previous setup.


Thoughts, ideas are highly appreciated!
-- Yeb

PS:
I checked for proper alignment of partitions as well as md's data 
offsett, all was well.

Ext4 filesystem mounted with barrier=0
/proc/sys/vm/dirty_background_bytes set to 17850



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU still needed with SSD?

2011-07-19 Thread Yeb Havinga

On 2011-07-19 13:37, Florian Weimer wrote:

Is this "Total_LBAs_Written"?
I got the same name "Total_LBAs_Written" on an 5.39 smartmontools, which 
was renamed to 241 Lifetime_Writes_GiB after upgrade to 5.42. Note that 
this is smartmontools new interpretation of the values, which happen to 
match with the OCZ tools interpretation (241: SSD Lifetime writes from 
host  Number of bytes written to SSD: 448 G). So for the Intels 
it's probably also lifetime writes in GB but you'd have to check with an 
Intel smart values reader to be absolutely sure.

   The values appear to be far too low:

241 Total_LBAs_Written  0x0032   100   100   000Old_age   Always   
-   188276
242 Total_LBAs_Read 0x0032   100   100   000Old_age   Always   
-   116800

241 Total_LBAs_Written  0x0032   100   100   000Old_age   Always   
-   189677
242 Total_LBAs_Read 0x0032   100   100   000Old_age   Always   
-   92509
Hmm that would mean 188TB written. Does that value seem right to your 
use case? If you'd write 100MB/s sustained, it would take 22 days to 
reach 188TB.

The second set of numbers are from the drive which wears more quickly.
It's strange that there's such a large difference in lifetime left, when 
lifetime writes are so similar. Maybe there are more small md metadata 
updates on the second disk, but without digging into md's internals it's 
impossible to say anything constructive about it.


Off-topic: new cool tool in smartmontools-5.4x: 
/usr/sbin/update-smart-drivedb :-)


--

Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU still needed with SSD?

2011-07-19 Thread Yeb Havinga

On 2011-07-19 12:47, Florian Weimer wrote:



It would be interesting to see if the drives also show total xyz
written, and if that differs a lot too.

Do you know how to check that with smartctl?
smartctl -a /dev/ should show all values. If it shows 
something that looks like garbage, it means that the database of 
smartmontools doesn't have the correct information yet for these new 
drives. I know that for the recently new OCZ vertex 2 and 3 SSDs you 
need at least 5.40 or 5.41 and that's pretty new stuff. (I just happened 
to install Fedora 15 today and that has smartmontools 5.41, whereas e.g. 
Scientific Linux 6 has 5.39).


--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU still needed with SSD?

2011-07-19 Thread Yeb Havinga

On 2011-07-19 09:56, Florian Weimer wrote:

* Yeb Havinga:


The biggest drawback of 2 SSD's with supercap in hardware raid 1, is
that if they are both new and of the same model/firmware, they'd
probably reach the end of their write cycles at the same time, thereby
failing simultaneously.

I thought so too, but I've got two Intel 320s (I suppose, the report
device model is "SSDSA2CT040G3") in a RAID 1 configuration, and after
about a month of testing, one is down to 89 on the media wearout
indicator, and the other is still at 96.  Both devices are
deteriorating, but one at a significantly faster rate.
That's great news if this turns out to be generally true. Is it on mdadm 
software raid?


I searched a bit in the mdadm manual for reasons this can be the case. 
It isn't the occasional check (echo check > 
/sys/block/md0/md/sync_action) since that seems to do two reads and 
compare. Another idea was that the layout of the mirror might not be 
different, but the manual says that the --layout configuration directive 
is only for RAID 5,6 and 10, but not RAID 1. Then my eye caught 
--write-behind, the maximum number of outstanding writes and it has a 
non-zero default value, but is only done if a drive is marked write-mostly.


Maybe it is caused by the initial build of the array? But then a 7% 
difference seems like an awful lot.


It would be interesting to see if the drives also show total xyz 
written, and if that differs a lot too.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU still needed with SSD?

2011-07-18 Thread Yeb Havinga

On 2011-07-18 03:43, Andy wrote:

Hi,

Is BBU still needed with SSD?

SSD has its own cache. And in certain models such as Intel 320 that cache is 
backed by capacitors. So in a sense that cache acts as a BBU that's backed by 
capacitors instead of batteries.

In this case is BBU still needed? If I put 2 SSD

+with supercap?

  in software RAID 1, would that be any slower than 2 SSD in HW RAID 1 with 
BBU? What are the pros and cons?
The biggest drawback of 2 SSD's with supercap in hardware raid 1, is 
that if they are both new and of the same model/firmware, they'd 
probably reach the end of their write cycles at the same time, thereby 
failing simultaneously. You'd have to start with two SSD's with 
different remaining life left in the software raid setup.


regards,
Yeb



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Contemplating SSD Hardware RAID

2011-06-21 Thread Yeb Havinga

On 2011-06-21 22:10, Yeb Havinga wrote:



There's some info burried in 
http://archives.postgresql.org/pgsql-performance/2011-03/msg00350.php 
where two Vertex 2 pro's are compared; the first has been really 
hammered with pgbench, the second had a few months duty in a 
workstation. The raw value of SSD Available Reserved Space seems to be 
a good candidate to watch to go to 0, since the pgbenched-drive has 
16GB left and the workstation disk 17GB. Would be cool to graph with 
e.g. symon (http://i.imgur.com/T4NAq.png)




I forgot to mention that both newest firmware of the drives as well as 
svn versions of smartmontools are advisable, before figuring out what 
all those strange values mean. It's too bad however that OCZ doesn't let 
the user choose which firmware to run (the tool always picks the 
newest), so after every upgrade it'll be a surprise what values are 
supported or if any of the values are reset or differently interpreted. 
Even when disks in production might not be upgraded eagerly, replacing a 
faulty drive means that one probably needs to be upgraded first and it 
would be nice to have a uniform smart value readout for the monitoring 
tools.


--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Contemplating SSD Hardware RAID

2011-06-21 Thread Yeb Havinga
 to the host for all data access: 0
194: SSD Temperature Monitoring Current: 128  High: 129 
Low: 127

195: SSD ECC On-the-fly Count   Normalized Rate: 100
196: SSD Reallocation Event Count   Total number of 
reallocated Flash blocks: 0

201: SSD Uncorrectable Soft Read Error Rate Normalized Rate: 100
204: SSD Soft ECC Correction Rate (RAISE)   Normalized Rate: 100
230: SSD Life Curve Status  Current state of drive 
operation based upon the Life Curve: 100
231: SSD Life Left  Approximate SDD life 
Remaining: 100%
241: SSD Lifetime writes from host  Number of bytes written 
to SSD: 162 GB
242: SSD Lifetime reads from host   Number of bytes read 
from SSD: 236 GB



There's some info burried in 
http://archives.postgresql.org/pgsql-performance/2011-03/msg00350.php 
where two Vertex 2 pro's are compared; the first has been really 
hammered with pgbench, the second had a few months duty in a 
workstation. The raw value of SSD Available Reserved Space seems to be a 
good candidate to watch to go to 0, since the pgbenched-drive has 16GB 
left and the workstation disk 17GB. Would be cool to graph with e.g. 
symon (http://i.imgur.com/T4NAq.png)


--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Contemplating SSD Hardware RAID

2011-06-21 Thread Yeb Havinga

On 2011-06-21 09:51, Yeb Havinga wrote:

On 2011-06-21 08:33, Greg Smith wrote:

On 06/20/2011 11:54 PM, Dan Harris wrote:

I'm exploring the combination of an Areca 1880ix-12 controller with 
6x OCZ Vertex 3 V3LT-25SAT3 2.5" 240GB SATA III drives in RAID-10.  
Has anyone tried this combination?  What nasty surprise am I 
overlooking here?


I forgot to mention that with an SSD it's important to watch the 
remaining lifetime. These values can be read with smartctl. When putting 
the disk behind a hardware raid controller, you might not be able to 
read them from the OS, and the hardware RAID firmware might be to old to 
not know about the SSD lifetime indicator or not even show it.


--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Contemplating SSD Hardware RAID

2011-06-21 Thread Yeb Havinga

On 2011-06-21 08:33, Greg Smith wrote:

On 06/20/2011 11:54 PM, Dan Harris wrote:

I'm exploring the combination of an Areca 1880ix-12 controller with 
6x OCZ Vertex 3 V3LT-25SAT3 2.5" 240GB SATA III drives in RAID-10.  
Has anyone tried this combination?  What nasty surprise am I 
overlooking here?


You can expect database corruption the first time something unexpected 
interrupts the power to the server.  That's nasty, but it's not 
surprising--that's well documented as what happens when you run 
PostreSQL on hardware with this feature set.  You have to get a Vertex 
3 Pro to get one of the reliable 3rd gen designs from them with a 
supercap.  (I don't think those are even out yet though)  We've had 
reports here of the earlier Vertex 2 Pro being fully stress tested and 
working out well.  I wouldn't even bother with a regular Vertex 3, 
because I don't see any reason to believe it could be reliable for 
database use, just like the Vertex 2 failed to work in that role.




I've tested both the Vertex 2, Vertex 2 Pro and Vertex 3. The vertex 3 
pro is not yet available. The vertex 3 I tested with pgbench didn't 
outperform the vertex 2 (yes, it was attached to a SATA III port). Also, 
the vertex 3 didn't work in my designated system until a firmware 
upgrade that came available ~2.5 months after I purchased it. The 
support call I had with OCZ failed to mention it, and by pure 
coincidende when I did some more testing at a later time, I ran the 
firmware upgrade tool (that kind of hides which firmwares are available, 
if any) and it did an update, after that it was compatible with the 
designated motherboard.


Another disappointment was that after I had purchased the Vertex 3 
drive, OCZ announced a max-iops vertex 3. Did that actually mean I 
bought an inferior version? Talking about a bad out-of-the-box 
experience. -1 ocz fan boy.


When putting such a SSD up for database use I'd only consider a vertex 2 
pro (for the supercap), paired with another SSD of a different brand 
with supercap (i.e. the recent intels). When this is done on a 
motherboard with > 1 sata controller, you'd have controller redundancy 
and can also survive single drive failures when a drive wears out. 
Having two different SSD versions decreases the chance of both wearing 
out the same time, and make you a bit more resilient against firmware 
bugs. It would be great if there was yet another supercapped SSD brand, 
with a modified md software raid that reads all three drives at once and 
compares results, instead of the occasional check. If at least two 
drives agree on the contents, return the data.


--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Benchmarking a large server

2011-05-09 Thread Yeb Havinga

On 2011-05-09 22:32, Chris Hoover wrote:


The issue we are running into is how do we benchmark this server, 
specifically, how do we get valid benchmarks for the Fusion IO card? 
 Normally to eliminate the cache effect, you run iozone and other 
benchmark suites at 2x the ram.  However, we can't do that due to 2TB 
> 1.3TB.


So, does anyone have any suggestions/experiences in benchmarking 
storage when the storage is smaller then 2x memory?


Oracle's Orion test tool has a configurable cache size parameter - it's 
a separate download and specifically written to benchmark database oltp 
and olap like io patterns, see 
http://www.oracle.com/technetwork/topics/index-089595.html


--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Intel SSDs that may not suck

2011-03-29 Thread Yeb Havinga
c_Soft_Read_Err_Rate  0x001c   120   120   000Old_age   
Offline  -   0/0
204 Soft_ECC_Correct_Rate   0x001c   120   120   000Old_age   
Offline  -   0/0
230 Life_Curve_Status   0x0013   100   100   000Pre-fail  
Always   -   100
231 SSD_Life_Left   0x0013   100   100   010Pre-fail  
Always   -   0
232 Available_Reservd_Space 0x   000   000   010Old_age   
Offline  FAILING_NOW 17
233 SandForce_Internal  0x   000   000   000Old_age   
Offline  -   128
234 SandForce_Internal  0x0032   000   000   000Old_age   
Always   -   448
235 SuperCap_Health 0x0033   100   100   010Pre-fail  
Always   -   0
241 Lifetime_Writes_GiB 0x0032   000   000   000Old_age   
Always   -   448
242 Lifetime_Reads_GiB  0x0032   000   000   000Old_age   
Always   -   192


SMART Error Log not supported
SMART Self-test Log not supported
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
100  Not_testing
200  Not_testing
300  Not_testing
400  Not_testing
500  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] CPUs for new databases

2010-10-27 Thread Yeb Havinga

Scott Marlowe wrote:

There was an earlier thread with
Greg and I in it where we posted the memory bandwidth numbers for that
machine and it was insane how much data all 48 cores could pump into /
out of memory at the same time.
  
Yeah, it was insane. Building a economical 'that generation opteron' 
database server has been on my wishlist since that thread, my current 
favorite is the 8-core 6128 opteron, for $275,- at newegg 
http://www.newegg.com/Product/Product.aspx?Item=N82E16819105266


Ah might as well drop the whole config on my wishlist as well:

2 times that 8 core processor
Supermicro H8DGU-F motherboard - 16 dimm slots, dual socket, dual Intel 
ethernet and additional ethernet for IPMI.
2 times KVR1333D3D4R9SK8/32G memory - 4GB dimms seem to be at the GB/$ 
sweet spot at the moment for DDR3
1 time OCZ Vertex 2 Pro 100GB (there was a thread about this sandforce 
disk as well: a SSD with supercap that acts as battery backup)

maybe another one or two spindled 2.5" drives for archive/backup.
Supermicro 113TQ-563UB chassis

At the time I looked this up, I could buy it for just over €3000,-

regards
Yeb Havinga

PS: I'm in no way involved with either of the manufacturers, nor one of 
their fanboys. I'm just interested, like the OP, what is good 
hardware/config for a PG related server.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance on new 64bit server compared to my 32bit desktop

2010-08-31 Thread Yeb Havinga

Scott Marlowe wrote:

On Tue, Aug 31, 2010 at 6:41 AM, Yeb Havinga  wrote:
  

export OMP_NUM_THREADS=4
  

Then I get the following. The rather wierd dip at 5 threads is consistent
over multiple tries:




I get similar dips on my server.  Especially as you make the stream
test write a large enough chunk of data to outrun its caches.

See attached png.
  
Interesting graph, especially since the overall feeling is a linear like 
increase in memory bandwidth when more cores are active.


Just curious, what is the 8-core cpu?

-- Yeb


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance on new 64bit server compared to my 32bit desktop

2010-08-31 Thread Yeb Havinga

Jose Ildefonso Camargo Tolosa wrote:

Ok, this could explain the huge difference.  I was planing on getting
GigaByte GA-890GPA-UD3H, with a Phenom II X6 and that ram: Crucial
CT2KIT25664BA13​39, Crucial BL2KIT25664FN1608, or something better I
find when I get enough money (depending on my budget at the moment).
  
Why not pair a 8-core magny cours ($280,- at newegg 
http://www.newegg.com/Product/Product.aspx?Item=N82E16819105266) with a 
supermicro ATX board 
http://www.supermicro.com/Aplus/motherboard/Opteron6100/SR56x0/H8SGL-F.cfm 
($264 at newegg 
http://www.newegg.com/Product/Product.aspx?Item=N82E16813182230&Tpk=H8SGL-F) 
and some memory?


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance on new 64bit server compared to my 32bit desktop

2010-08-31 Thread Yeb Havinga

Greg Smith wrote:

Yeb Havinga wrote:

model name  : AMD Phenom(tm) II X4 940 Processor @ 3.00GHz
cpu cores : 4
stream compiled with -O3
Function  Rate (MB/s)   Avg time Min time Max time
Triad:   5395.1815   0.0089   0.0089   0.0089
I'm not sure if Yeb's stream was compiled to use MPI correctly though, 
because I'm not seeing "Number of Threads" in his results.  Here's 
what works for me:


 gcc -O3 -fopenmp stream.c -o stream

And then you can set:

export OMP_NUM_THREADS=4
Then I get the following. The rather wierd dip at 5 threads is 
consistent over multiple tries:


Number of Threads requested = 1
Function  Rate (MB/s)   Avg time Min time Max time
Triad:   5378.7495   0.0089   0.0089   0.0090

Number of Threads requested = 2
Function  Rate (MB/s)   Avg time Min time Max time
Triad:   6596.1140   0.0073   0.0073   0.0073

Number of Threads requested = 3
Function  Rate (MB/s)   Avg time Min time Max time
Triad:   7033.9806   0.0069   0.0068   0.0069

Number of Threads requested = 4
Function  Rate (MB/s)   Avg time Min time Max time
Triad:   7007.2950   0.0069   0.0069   0.0069

Number of Threads requested = 5
Function  Rate (MB/s)   Avg time Min time Max time
Triad:   6553.8133   0.0074   0.0073   0.0074

Number of Threads requested = 6
Function  Rate (MB/s)   Avg time Min time Max time
Triad:   6803.6427   0.0071   0.0071   0.0071

Number of Threads requested = 7
Function  Rate (MB/s)   Avg time Min time Max time
Triad:   6895.6909   0.0070   0.0070   0.0071

Number of Threads requested = 8
Function  Rate (MB/s)   Avg time Min time Max time
Triad:   6931.3018   0.0069   0.0069   0.0070

Other info: DDR2 800MHz ECC memory
MB: 790FX chipset (Asus m4a78-e)

regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] GPU Accelerated Sorting

2010-08-30 Thread Yeb Havinga

Greg Smith wrote:
This comes up every year or so.  The ability of GPU offloading to help 
with sorting has to overcome the additional latency that comes from 
copying everything over to it and then getting all the results back.  
If you look at the typical types of sorting people see in PostgreSQL, 
it's hard to find ones that are a) big enough to benefit from being 
offloaded to the GPU like that, while also being b) not so 
bottlenecked on disk I/O that speeding up the CPU part matters.  And 
if you need to sort something in that category, you probably just put 
an index on it instead and call it a day.


If you made me make a list of things I'd think would be worthwhile to 
spend effort improving in PostgreSQL, this would be on the research 
list, but unlikely to even make my personal top 100 things that are 
work fiddling with.
Related is 'Parallelizing query optimization' 
(http://www.vldb.org/pvldb/1/1453882.pdf) in which they actually 
experiment with PostgreSQL. Note that their target platform is general 
purpose CPU, not a SIMD GPU processor.


-- Yeb


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance on new 64bit server compared to my 32bit desktop

2010-08-30 Thread Yeb Havinga

Scott Marlowe wrote:

On Mon, Aug 30, 2010 at 1:58 AM, Yeb Havinga  wrote:
  

four parallel
r...@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out & ./a.out & ./a.out
& ./a.out



You know you can just do "stream 4" to get 4 parallel streams right?
  
Which version is that? The stream.c source contains no argc/argv usage, 
though Code/Versions/Experimental has a script called Parallel_jobs that 
spawns n processes.


-- Yeb


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance on new 64bit server compared to my 32bit desktop

2010-08-30 Thread Yeb Havinga

Jose Ildefonso Camargo Tolosa wrote:

Also, nowadays, Intel has better performance than AMD, at least when
comparing Athlon 64 vs Core2, I'm still saving to get a Phenom II
system in order to benchmark them and see how it goes (does anyone
have one of these for testing?).

r...@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# cat /proc/cpuinfo
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 16
model   : 4
model name  : AMD Phenom(tm) II X4 940 Processor
stepping: 2
cpu MHz : 3000.000
cache size  : 512 KB
physical id : 0
siblings: 4
core id : 0
cpu cores   : 4
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext 
fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good 
nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm 
extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt

bogomips: 6020.46
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate


stream compiled with -O3

r...@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out
-
STREAM version $Revision: 5.9 $
-
This system uses 8 bytes per DOUBLE PRECISION word.
-
Array size = 200, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-
Printing one line per active thread
-
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 5031 microseconds.
  (= 5031 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-
Function  Rate (MB/s)   Avg time Min time Max time
Copy:5056.0434   0.0064   0.0063   0.0064
Scale:   4950.4916   0.0065   0.0065   0.0065
Add: 5322.0173   0.0091   0.0090   0.0091
Triad:   5395.1815   0.0089   0.0089   0.0089
-
Solution Validates
-

two parallel
r...@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out & ./a.out

-
Function  Rate (MB/s)   Avg time Min time Max time
Copy:2984.2741   0.0108   0.0107   0.0108
Scale:   2945.8261   0.0109   0.0109   0.0110
Add: 3282.4631   0.0147   0.0146   0.0149
Triad:   3321.2893   0.0146   0.0145   0.0148
-
Function  Rate (MB/s)   Avg time Min time Max time
Copy:2981.4898   0.0108   0.0107   0.0108
Scale:   2943.3067   0.0109   0.0109   0.0109
Add: 3283.8552   0.0147   0.0146   0.0149
Triad:   3313.9634   0.0147   0.0145   0.0148


four parallel
r...@p:~/ff/www.cs.virginia.edu/stream/FTP/Code# ./a.out & ./a.out & 
./a.out & ./a.out


-
Function  Rate (MB/s)   Avg time Min time Max time
Copy:1567.4880   0.0208   0.0204   0.0210
Scale:   1525.3401   0.0211   0.0210   0.0213
Add: 1739.7735   0.0279   0.0276   0.0282
Triad:   1763.4858   0.0274   0.0272   0.0276
-
Function  Rate (MB/s)   Avg time Min time Max time
Copy:1559.0759   0.0208   0.0205   0.0210
Scale:   1536.2520   0.0211   0.0208   0.0212
Add: 1740.4503   0.0279   0.0276   0.0283
Triad:   1758.4951   0.0276   0.0273   0.0279
-
Function  Rate (MB/s)   Avg time Min time Max time
Copy:1552.7271   0.0208   0.0206   0.0210
Scale:   1527.5275   0.0211   0.0209   0.0212
Add: 1737.9263   0.0279   0.0276   0.0282
Triad:   1757.3439   0.0276   0.0273   0.0278
-

Re: [PERFORM] write barrier question

2010-08-18 Thread Yeb Havinga

Samuel Gendler wrote:

When running pgbench on a db which fits easily into RAM (10% of RAM =
-s 380), I see transaction counts a little less than 5K.  When I go to
90% of RAM (-s 3420), transaction rate dropped to around 1000 ( at a
fairly wide range of concurrencies).  At that point, I decided to
investigate the performance impact of write barriers.
At 90% of RAM you're probable reading data as well, not only writing. 
Watching iostat -xk 1 or vmstat 1 during a test should confirm this. To 
find the maximum database size that fits comfortably in RAM you could 
try out http://github.com/gregs1104/pgbench-tools - my experience with 
it is that it takes less than 10 minutes to setup and run and after some 
time you get rewarded with nice pictures! :-)


regards,
Yeb Havinga

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD

2010-08-08 Thread Yeb Havinga

Michael March wrote:
If anyone is interested I just completed a series of benchmarks of 
stock Postgresql running on a normal HDD vs a SSD.  

If you don't want to read the post, the summary is that SSDs are 5 to 
7 times faster than a 7200RPM HDD drive under a pgbench load.


http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html

Is this what everyone else is seeing?
I tested a SSD with a capacitor and posted conclusions here 
http://archives.postgresql.org/pgsql-performance/2010-07/msg00449.php


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-08-03 Thread Yeb Havinga

Yeb Havinga wrote:

Hannu Krosing wrote:

Did it fit in shared_buffers, or system cache ?
  

Database was ~5GB, server has 16GB, shared buffers was set to 1920MB.

I first noticed this several years ago, when doing a COPY to a large
table with indexes took noticably longer (2-3 times longer) when the
indexes were in system cache than when they were in shared_buffers.
  
I read this as a hint: try increasing shared_buffers. I'll redo the 
pgbench run with increased shared_buffers.

Shared buffers raised from 1920MB to 3520MB:

pgbench -v -l -c 20 -M prepared -T 1800 test
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 20
duration: 1800 s
number of transactions actually processed: 12971714
tps = 7206.244065 (including connections establishing)
tps = 7206.349947 (excluding connections establishing)

:-)

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-08-03 Thread Yeb Havinga

Hannu Krosing wrote:

Did it fit in shared_buffers, or system cache ?
  

Database was ~5GB, server has 16GB, shared buffers was set to 1920MB.

I first noticed this several years ago, when doing a COPY to a large
table with indexes took noticably longer (2-3 times longer) when the
indexes were in system cache than when they were in shared_buffers.
  
I read this as a hint: try increasing shared_buffers. I'll redo the 
pgbench run with increased shared_buffers.
so the test is actually how fast the ssd can capture 
sequential WAL writes and fsync without barriers, mixed with an 
occasional checkpoint with random write IO on another partition). Since 
the WAL writing is the same for both block_size setups, I decided to 
compare random writes to a file of 5GB with Oracle's Orion tool:



Are you sure that you are not writing full WAL pages ?
  

I'm not sure I understand this question.

Do you have any stats on how much WAL is written for 8kb and 4kb test
cases ?
  

Would some iostat -xk 1 for each partition suffice?

And for other disk i/o during the tests ?
  

Not existent.

regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-08-03 Thread Yeb Havinga

Scott Marlowe wrote:

On Mon, Aug 2, 2010 at 6:07 PM, Greg Smith  wrote:
  

Josh Berkus wrote:


That doesn't make much sense unless there's some special advantage to a
4K blocksize with the hardware itself.
  

Given that pgbench is always doing tiny updates to blocks, I wouldn't be
surprised if switching to smaller blocks helps it in a lot of situations if
one went looking for them.  Also, as you point out, pgbench runtime varies
around wildly enough that 10% would need more investigation to really prove
that means something.  But I think Yeb has done plenty of investigation into
the most interesting part here, the durability claims.

Please note that the 10% was on a slower CPU. On a more recent CPU the 
difference was 47%, based on tests that ran for an hour. That's why I 
absolutely agree with Merlin Moncure that more testing in this 
department is welcome, preferably by others since after all I could be 
on the pay roll of OCZ :-)


I looked a bit into Bonnie++ but fail to see how I could do a test that 
somehow matches the PostgreSQL setup during the pgbench tests (db that 
fits in memory, so the test is actually how fast the ssd can capture 
sequential WAL writes and fsync without barriers, mixed with an 
occasional checkpoint with random write IO on another partition). Since 
the WAL writing is the same for both block_size setups, I decided to 
compare random writes to a file of 5GB with Oracle's Orion tool:


=== 4K test summary 
ORION VERSION 11.1.0.7.0

Commandline:
-testname test -run oltp -size_small 4 -size_large 1024 -write 100

This maps to this test:
Test: test
Small IO size: 4 KB
Large IO size: 1024 KB
IO Types: Small Random IOs, Large Random IOs
Simulated Array Type: CONCAT
Write: 100%
Cache Size: Not Entered
Duration for each Data Point: 60 seconds
Small Columns:,  1,  2,  3,  4,  5,  6,  
7,  8,  9, 10, 11, 12, 13, 14, 15, 
16, 17, 18, 19, 20

Large Columns:,  0
Total Data Points: 21

Name: /mnt/data/5gb Size: 524288
1 FILEs found.

Maximum Small IOPS=86883 @ Small=8 and Large=0
Minimum Small Latency=0.01 @ Small=1 and Large=0

=== 8K test summary 

ORION VERSION 11.1.0.7.0

Commandline:
-testname test -run oltp -size_small 8 -size_large 1024 -write 100

This maps to this test:
Test: test
Small IO size: 8 KB
Large IO size: 1024 KB
IO Types: Small Random IOs, Large Random IOs
Simulated Array Type: CONCAT
Write: 100%
Cache Size: Not Entered
Duration for each Data Point: 60 seconds
Small Columns:,  1,  2,  3,  4,  5,  6,  
7,  8,  9, 10, 11, 12, 13, 14, 15, 
16, 17, 18, 19, 20

Large Columns:,  0
Total Data Points: 21

Name: /mnt/data/5gb Size: 524288
1 FILEs found.

Maximum Small IOPS=48798 @ Small=11 and Large=0
Minimum Small Latency=0.02 @ Small=1 and Large=0

Running the tests for longer helps a lot on reducing the noisy
results.  Also letting them runs longer means that the background
writer and autovacuum start getting involved, so the test becomes
somewhat more realistic.
  
Yes, that's why I did a lot of the TPC-B tests with -T 3600 so they'd 
run for an hour. (also the 4K vs 8K blocksize in postgres).


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-08-02 Thread Yeb Havinga

Merlin Moncure wrote:

On Fri, Jul 30, 2010 at 11:01 AM, Yeb Havinga  wrote:
  

Postgres settings:
8.4.4
--with-blocksize=4
I saw about 10% increase in performance compared to 8KB blocksizes.



That's very interesting -- we need more testing in that department...
  
Definately - that 10% number was on the old-first hardware (the core 2 
E6600). After reading my post and the 185MBps with 18500 reads/s number 
I was a bit suspicious whether I did the tests on the new hardware with 
4K, because 185MBps / 18500 reads/s is ~10KB / read, so I thought thats 
a lot closer to 8KB than 4KB. I checked with show block_size and it was 
4K. Then I redid the tests on the new server with the default 8KB 
blocksize and got about 4700 tps (TPC-B/300)... 67/47 = 1.47. So it 
seems that on newer hardware, the difference is larger than 10%.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-30 Thread Yeb Havinga

Greg Smith wrote:

Greg Smith wrote:
Note that not all of the Sandforce drives include a capacitor; I hope 
you got one that does!  I wasn't aware any of the SF drives with a 
capacitor on them were even shipping yet, all of the ones I'd seen 
were the chipset that doesn't include one still.  Haven't checked in 
a few weeks though.


Answer my own question here:  the drive Yeb got was the brand spanking 
new OCZ Vertex 2 Pro, selling for $649 at Newegg for example:  
http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and 
with the supercacitor listed right in the main production 
specifications there.  This is officially the first inexpensive 
(relatively) SSD with a battery-backed write cache built into it.  If 
Yeb's test results prove it works as it's supposed to under 
PostgreSQL, I'll be happy to finally have a moderately priced SSD I 
can recommend to people for database use.  And I fear I'll be out of 
excuses to avoid buying one as a toy for my home system.



Hello list,

After a week testing I think I can answer the question above: does it 
work like it's supposed to under PostgreSQL?


YES

The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro, 
http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534


* it is safe to mount filesystems with barrier off, since it has a 
'supercap backed cache'. That data is not lost is confirmed by a dozen 
power switch off tests while running either diskchecker.pl or pgbench.
* the above implies its also safe to use this SSD with barriers, though 
that will perform less, since this drive obeys write trough commands.
* the highest pgbench tps number for the TPC-B test for a scale 300 
database (~5GB) I could get was over 6700. Judging from the iostat 
average util of ~40% on the xlog partition, I believe that this number 
is limited by other factors than the SSD, like CPU, core count, core 
MHz, memory size/speed, 8.4 pgbench without threads. Unfortunately I 
don't have a faster/more core machines available for testing right now.
* pgbench numbers for a larger than RAM database, read only was over 
25000 tps (details are at the end of this post), during which iostat 
reported ~18500 read iops and 100% utilization.

* pgbench max reported latencies are 20% of comparable BBWC setups.
* how reliable it is over time, and how it performs over time I cannot 
say, since I tested it only for a week.


regards,
Yeb Havinga

PS: ofcourse all claims I make here are without any warranty. All 
information in this mail is for reference purposes, I do not claim it is 
suitable for your database setup.


Some info on configuration:
BOOT_IMAGE=/boot/vmlinuz-2.6.32-22-server  elevator=deadline
quad core AMD Phenom(tm) II X4 940 Processor on 3.0GHz
16GB RAM 667MHz DDR2

Disk/ filesystem settings.
Model Family: OCZ Vertex SSD
Device Model: OCZ VERTEX2-PRO
Firmware Version: 1.10

hdparm: did not change standard settings: write cache is on, as well as 
readahead.

hdparm -AW /dev/sdc
/dev/sdc:
look-ahead=  1 (on)
write-caching =  1 (on)

Untuned ext4 filesystem.
Mount options
/dev/sdc2 on /data type ext4 
(rw,noatime,nodiratime,relatime,barrier=0,discard)
/dev/sdc3 on /xlog type ext4 
(rw,noatime,nodiratime,relatime,barrier=0,discard)
Note the -o discard: this means use of the automatic SSD trimming on a 
new linux kernel.
Also, per core per filesystem there now is a [ext4-dio-unwrit] process - 
which suggest something like 'directio'? I haven't investigated this any 
further.


Sysctl:
(copied from a larger RAM database machine)
kernel.core_uses_pid = 1
fs.file-max = 327679
net.ipv4.ip_local_port_range = 1024 65000
kernel.msgmni = 2878
kernel.msgmax = 8192
kernel.msgmnb = 65536
kernel.sem = 250 32000 100 142
kernel.shmmni = 4096
kernel.sysrq = 1
kernel.shmmax = 33794121728
kernel.shmall = 16777216
net.core.rmem_default = 262144
net.core.rmem_max = 2097152
net.core.wmem_default = 262144
net.core.wmem_max = 262144
fs.aio-max-nr = 3145728
vm.swappiness = 0
vm.dirty_background_ratio = 3
vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs = 100
vm.dirty_ratio = 15

Postgres settings:
8.4.4
--with-blocksize=4
I saw about 10% increase in performance compared to 8KB blocksizes.

Postgresql.conf:
changed from default config are:
maintenance_work_mem = 480MB # pgtune wizard 2010-07-25
checkpoint_completion_target = 0.9 # pgtune wizard 2010-07-25
effective_cache_size = 5632MB # pgtune wizard 2010-07-25
work_mem = 512MB # pgtune wizard 2010-07-25
wal_buffers = 8MB # pgtune wizard 2010-07-25
checkpoint_segments = 128 # pgtune said 16 here
shared_buffers = 1920MB # pgtune wizard 2010-07-25
max_connections = 100

initdb with data on sda2 and xlog on sda3, C locale

Read write test on ~5GB database:
$ pgbench -v -c 20 -M prepared -T 3600 test
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prep

Re: [PERFORM] Testing Sandforce SSD

2010-07-28 Thread Yeb Havinga

Yeb Havinga wrote:

Michael Stone wrote:

On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote:
I know I'm talking development now but is there a case for a pg_xlog 
block

device to remove the file system overhead and guaranteeing your data is
written sequentially every time?


If you dedicate a partition to xlog, you already get that in practice 
with no extra devlopment.
Due to the LBA remapping of the SSD, I'm not sure of putting files 
that are sequentially written in a different partition (together with 
e.g. tables) would make a difference: in the end the SSD will have a 
set new blocks in it's buffer and somehow arrange them into sets of 
128KB of 256KB writes for the flash chips. See also 
http://www.anandtech.com/show/2899/2


But I ran out of ideas to test, so I'm going to test it anyway.
Same machine config as mentioned before, with data and xlog on separate 
partitions, ext3 with barrier off (save on this SSD).


pgbench -c 10 -M prepared -T 3600 -l test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 10
duration: 3600 s
number of transactions actually processed: 10856359
tps = 3015.560252 (including connections establishing)
tps = 3015.575739 (excluding connections establishing)

This is about 25% faster than data and xlog combined on the same filesystem.

Below is output from iostat -xk 1 -p /dev/sda, which shows each second 
per partition statistics.

sda2 is data, sda3 is xlog In the third second a checkpoint seems to start.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 63.500.00   30.502.500.003.50

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00  6518.00   36.00 2211.00   148.00 35524.00
31.75 0.280.12   0.11  25.00
sda1  0.00 2.000.005.00 0.00   636.00   
254.40 0.036.00   2.00   1.00
sda2  0.00   218.00   36.00   40.00   148.00  1032.00
31.05 0.000.00   0.00   0.00
sda3  0.00  6298.000.00 2166.00 0.00 33856.00
31.26 0.250.12   0.12  25.00


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 60.500.00   37.500.500.001.50

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00  6514.00   33.00 2283.00   140.00 35188.00
30.51 0.320.14   0.13  29.00
sda1  0.00 0.000.003.00 0.0012.00 
8.00 0.000.00   0.00   0.00
sda2  0.00 0.00   33.002.00   140.00 8.00 
8.46 0.030.86   0.29   1.00
sda3  0.00  6514.000.00 2278.00 0.00 35168.00
30.88 0.290.13   0.13  29.00


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 33.000.00   34.00   18.000.00   15.00

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00  3782.007.00 7235.0028.00 44068.00
12.1869.529.46   0.09  62.00
sda1  0.00 0.000.001.00 0.00 4.00 
8.00 0.000.00   0.00   0.00
sda2  0.00   322.007.00 6018.0028.00 25360.00 
8.4369.22   11.33   0.08  47.00
sda3  0.00  3460.000.00 1222.00 0.00 18728.00
30.65 0.300.25   0.25  30.00


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  9.000.00   36.00   22.500.00   32.50

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00  1079.003.00 0.0012.00 49060.00 
8.83   120.64   10.95   0.08  86.00
sda1  0.00 2.000.002.00 0.00   320.00   
320.00 0.12   60.00  35.00   7.00
sda2  0.0030.003.00 10739.0012.00 43076.00 
8.02   120.49   11.30   0.08  83.00
sda3  0.00  1047.000.00  363.00 0.00  5640.00
31.07 0.030.08   0.08   3.00


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 62.000.00   31.002.000.005.00

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00  6267.00   51.00 2493.00   208.00 35040.00
27.71 1.800.71   0.12  31.00
sda1  0.00 0.000.003.00 0.0012.00 
8.00 0.000.00   0.00   0.00
sda2  0.00   123.00   51.00  344.00   208.00  1868.00
10.51 1.503.80   0.10   4.00
sda3  0.00  6144.000.00 2146.00 0.00 33160.00
30.90 0.300.14   0.14  30.00



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-28 Thread Yeb Havinga

Michael Stone wrote:

On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote:
I know I'm talking development now but is there a case for a pg_xlog 
block

device to remove the file system overhead and guaranteeing your data is
written sequentially every time?


If you dedicate a partition to xlog, you already get that in practice 
with no extra devlopment.
Due to the LBA remapping of the SSD, I'm not sure of putting files that 
are sequentially written in a different partition (together with e.g. 
tables) would make a difference: in the end the SSD will have a set new 
blocks in it's buffer and somehow arrange them into sets of 128KB of 
256KB writes for the flash chips. See also 
http://www.anandtech.com/show/2899/2


But I ran out of ideas to test, so I'm going to test it anyway.

regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query using the Cube contrib module.

2010-07-27 Thread Yeb Havinga

Liviu Mirea-Ghiban wrote:


My question is: Why is it doing a Bitmap Heap Scan / Recheck Cond? 
I've executed dozens of such queries and not once did the rechecking 
remove any rows. Is there any way to disable it, or do you have any 
other suggestions for optimizations (because I'm all out of ideas)?
It's probably because the index nodes store data values with a lossy 
compression, which means that the index scan returns more rows than 
wanted, and that in turn is filtered out by the rescanning. See the 
comments for the 'RECHECK' parameter of CREATE OPERATOR CLASS 
(http://www.postgresql.org/docs/8.4/static/sql-createopclass.html). Its 
unwise to alter this behaviour without taking a look/modifying the 
underlying index implementation. The gist index scann part could perhaps 
be made a bit faster by using a smaller blocksize, but I'm not sure if 
or how the recheck part can be improved. Maybe rewriting the top query 
to not do bitmap heap scans in subqueries or inner loops?


regards,
Yeb Havinga

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-26 Thread Yeb Havinga

Yeb Havinga wrote:
To get similar *average* performance results you'd need to put about 
4 drives and a BBU into a server.  The 


Please forget this question, I now see it in the mail i'm replying to. 
Sorry for the spam!


-- Yeb


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-26 Thread Yeb Havinga

Greg Smith wrote:

Yeb Havinga wrote:
Please remember that particular graphs are from a read/write pgbench 
run on a bigger than RAM database that ran for some time (so with 
checkpoints), on a *single* $435 50GB drive without BBU raid controller.


To get similar *average* performance results you'd need to put about 4 
drives and a BBU into a server.  The worst-case latency on that 
solution is pretty bad though, when a lot of random writes are queued 
up; I suspect that's where the SSD will look much better.


By the way:  if you want to run a lot more tests in an organized 
fashion, that's what http://github.com/gregs1104/pgbench-tools was 
written to do.  That will spit out graphs by client and by scale 
showing how sensitive the test results are to each.

Got it, running the default config right now.

When you say 'comparable to a small array' - could you give a ballpark 
figure for 'small'?


regards,
Yeb Havinga

PS: Some update on the testing: I did some ext3,ext4,xfs,jfs and also 
ext2 tests on the just-in-memory read/write test. (scale 300) No real 
winners or losers, though ext2 isn't really faster and the manual need 
for fix (y) during boot makes it impractical in its standard 
configuration. I did some poweroff tests with barriers explicitily off 
in ext3, ext4 and xfs, still all recoveries went ok.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-26 Thread Yeb Havinga

Matthew Wakeling wrote:
Apologies, I was interpreting the graph as the latency of the device, 
not all the layers in-between as well. There isn't any indication in 
the email with the graph as to what the test conditions or software are.
That info was in the email preceding the graph mail, but I see now I 
forgot to mention it was a 8.4.4 postgres version.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-26 Thread Yeb Havinga

Matthew Wakeling wrote:

On Sun, 25 Jul 2010, Yeb Havinga wrote:
Graph of TPS at http://tinypic.com/r/b96aup/3 and latency at 
http://tinypic.com/r/x5e846/3


Does your latency graph really have milliseconds as the y axis?

Yes
If so, this device is really slow - some requests have a latency of 
more than a second!
I try to just give the facts. Please remember that particular graphs are 
from a read/write pgbench run on a bigger than RAM database that ran for 
some time (so with checkpoints), on a *single* $435 50GB drive without 
BBU raid controller. Also, this is a picture with a few million points: 
the ones above 200ms are perhaps a hundred and hence make up a very 
small fraction.


So far I'm pretty impressed with this drive. Lets be fair to OCZ and the 
SandForce guys and do not shoot from the hip things like "really slow", 
without that being backed by a graphed pgbench run together with it's 
cost, so we can compare numbers with numbers.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-26 Thread Yeb Havinga

Yeb Havinga wrote:

Greg Smith wrote:
Put it on ext3, toggle on noatime, and move on to testing.  The 
overhead of the metadata writes is the least of the problems when 
doing write-heavy stuff on Linux.
I ran a pgbench run and power failure test during pgbench with a 3 
year old computer



On the same config more tests.

scale 10 read only and read/write tests. note: only 240 s.

starting vacuum...end.
transaction type: SELECT only
scaling factor: 10
query mode: prepared
number of clients: 10
duration: 240 s
number of transactions actually processed: 8208115
tps = 34197.109896 (including connections establishing)
tps = 34200.658720 (excluding connections establishing)

y...@client45:~$ pgbench -c 10 -l -M prepared -T 240 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: prepared
number of clients: 10
duration: 240 s
number of transactions actually processed: 809271
tps = 3371.147020 (including connections establishing)
tps = 3371.518611 (excluding connections establishing)

--
scale 300 (just fits in RAM) read only and read/write tests

pgbench -c 10 -M prepared -T 300 -S test
starting vacuum...end.
transaction type: SELECT only
scaling factor: 300
query mode: prepared
number of clients: 10
duration: 300 s
number of transactions actually processed: 9219279
tps = 30726.931095 (including connections establishing)
tps = 30729.692823 (excluding connections establishing)

The test above doesn't really test the drive but shows the CPU/RAM limit.

pgbench -c 10 -l -M prepared -T 3600 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 10
duration: 3600 s
number of transactions actually processed: 8838200
tps = 2454.994217 (including connections establishing)
tps = 2455.012480 (excluding connections establishing)

--
scale 2000

pgbench -c 10 -M prepared -T 300 -S test
starting vacuum...end.
transaction type: SELECT only
scaling factor: 2000
query mode: prepared
number of clients: 10
duration: 300 s
number of transactions actually processed: 755772
tps = 2518.547576 (including connections establishing)
tps = 2518.762476 (excluding connections establishing)

So the test above tests the random seek performance. Iostat on the drive 
showed a steady just over 4000 read io's/s:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 11.390.00   13.37   60.400.00   14.85
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sda   0.00 0.00 4171.000.00 60624.00 0.00
29.0711.812.83   0.24 100.00
sdb   0.00 0.000.000.00 0.00 0.00 
0.00 0.000.00   0.00   0.00


pgbench -c 10 -l -M prepared -T 24000 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 2000
query mode: prepared
number of clients: 10
duration: 24000 s
number of transactions actually processed: 30815691
tps = 1283.979098 (including connections establishing)
tps = 1283.980446 (excluding connections establishing)

Note the duration of several hours. No long waits occurred - of this 
last test the latency png is at http://yfrog.com/f/0vlatencywp/ and the 
TPS graph at http://yfrog.com/f/b5tpsp/


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-25 Thread Yeb Havinga

Yeb Havinga wrote:


8GB DDR2 something..

(lots of details removed)

Graph of TPS at http://tinypic.com/r/b96aup/3 and latency at 
http://tinypic.com/r/x5e846/3


Thanks http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm for 
the gnuplot and psql scripts!



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-25 Thread Yeb Havinga

Greg Smith wrote:
Put it on ext3, toggle on noatime, and move on to testing.  The 
overhead of the metadata writes is the least of the problems when 
doing write-heavy stuff on Linux.
I ran a pgbench run and power failure test during pgbench with a 3 year 
old computer


8GB DDR ?
Intel Core 2 duo 6600 @ 2.40GHz
Intel Corporation 82801IB (ICH9) 2 port SATA IDE Controller
64 bit 2.6.31-22-server (Ubuntu karmic), kernel option elevator=deadline
sysctl options besides increasing shm:
fs.file-max=327679
fs.aio-max-nr=3145728
vm.swappiness=0
vm.dirty_background_ratio = 3
vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs = 100
vm.dirty_ratio = 15

Filesystem on SSD with postgresql data: ext3 mounted with 
noatime,nodiratime,relatime
Postgresql cluster: did initdb with C locale. Data and pg_xlog together 
on the same ext3 filesystem.


Changed in postgresql.conf: settings with pgtune for OLTP and 15 connections
maintenance_work_mem = 480MB # pgtune wizard 2010-07-25
checkpoint_completion_target = 0.9 # pgtune wizard 2010-07-25
effective_cache_size = 5632MB # pgtune wizard 2010-07-25
work_mem = 512MB # pgtune wizard 2010-07-25
wal_buffers = 8MB # pgtune wizard 2010-07-25
checkpoint_segments = 31 # pgtune said 16 here
shared_buffers = 1920MB # pgtune wizard 2010-07-25
max_connections = 15 # pgtune wizard 2010-07-25

Initialized with scale 800 with is about 12GB. I especially went beyond 
an in RAM size for this machine (that would be ~ 5GB), so random reads 
would weigh in the result. Then let pgbench run the tcp benchmark with 
-M prepared, 10 clients and -T 3600 (one hour) and 10 clients, after 
that loaded the logfile in a db and did some queries. Then realized the 
pgbench result page was not in screen buffer anymore so I cannot copy it 
here, but hey, those can be edited as well right ;-)


select count(*),count(*)/3600,avg(time),stddev(time) from log;
 count  | ?column? |  avg  | stddev
-+--+---+

4939212 | 1372 | 7282.8581978258880161 | 11253.96967962
(1 row)

Judging from the latencys in the logfiles I did not experience serious 
lagging (time is in microseconds):


select * from log order by time desc limit 3;
client_id | tx_no |  time   | file_no |   epoch| time_us
---+---+-+-++-
3 | 33100 | 1229503 |   0 | 1280060345 |  866650
9 | 39990 | 1077519 |   0 | 1280060345 |  858702
2 | 55323 | 1071060 |   0 | 1280060519 |  750861
(3 rows)

select * from log order by time desc limit 3 OFFSET 1000;
client_id | tx_no  |  time  | file_no |   epoch| time_us
---+++-++-
5 | 262466 | 245953 |   0 | 1280062074 |  513789
1 | 267519 | 245867 |   0 | 1280062074 |  513301
7 | 273662 | 245532 |   0 | 1280062078 |  378932
(3 rows)

select * from log order by time desc limit 3 OFFSET 1;
client_id | tx_no  | time  | file_no |   epoch| time_us
---++---+-++-
5 | 123011 | 82854 |   0 | 1280061036 |  743986
6 | 348967 | 82853 |   0 | 1280062687 |  776317
8 | 439789 | 82848 |   0 | 1280063109 |  552928
(3 rows)

Then I started pgbench again with the same setting, let it run for a few 
minutes and in another console did CHECKPOINT and then turned off power. 
After restarting, the database recovered without a problem.


LOG:  database system was interrupted; last known up at 2010-07-25 
10:14:15 EDT
LOG:  database system was not properly shut down; automatic recovery in 
progress

LOG:  redo starts at F/98008610
LOG:  record with zero length at F/A2BAC040
LOG:  redo done at F/A2BAC010
LOG:  last completed transaction was at log time 2010-07-25 
10:14:16.151037-04


regards,
Yeb Havinga

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-24 Thread Yeb Havinga

Yeb Havinga wrote:

Yeb Havinga wrote:
diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes; 
39/s)

Total errors: 0

:-)

OTOH, I now notice the 39 write /s .. If that means ~ 39 tps... bummer.
When playing with it a bit more, I couldn't get the test_file to be 
created in the right place on the test system. It turns out I had the 
diskchecker config switched and 39 write/s was the speed of the 
not-rebooted system, sorry.


I did several diskchecker.pl tests this time with the testfile on the 
SSD, none of the tests have returned an error :-)


Writes/s start low but quickly converge to a number in the range of 1200 
to 1800. The writes diskchecker does are 16kB writes. Making this 4kB 
writes does not increase writes/s. 32kB seems a little less, 64kB is 
about two third of initial writes/s and 128kB is half.


So no BBU speeds here for writes, but still ~ factor 10 improvement of 
iops for a rotating SATA disk.


regards,
Yeb Havinga

PS: hdparm showed write cache was on. I did tests with both ext2 and 
xfs, where xfs tests I did with both barrier and nobarrier.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-24 Thread Yeb Havinga

Yeb Havinga wrote:

diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes; 39/s)
Total errors: 0

:-)

OTOH, I now notice the 39 write /s .. If that means ~ 39 tps... bummer.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-24 Thread Yeb Havinga

Greg Smith wrote:
Note that not all of the Sandforce drives include a capacitor; I hope 
you got one that does!  I wasn't aware any of the SF drives with a 
capacitor on them were even shipping yet, all of the ones I'd seen 
were the chipset that doesn't include one still.  Haven't checked in a 
few weeks though.
I think I did, it was expensive enough, though while ordering its very 
easy to order the wrong one, all names on the product category page look 
the same. (OCZ Vertex 2 Pro)

* How to test for power failure?


I've had good results using one of the early programs used to 
investigate this class of problems:  
http://brad.livejournal.com/2116715.html?page=2

A great tool, thanks for the link!

 diskchecker: running 34 sec, 4.10% coverage of 500 MB (1342 writes; 39/s)
 diskchecker: running 35 sec, 4.24% coverage of 500 MB (1390 writes; 39/s)
 diskchecker: running 36 sec, 4.35% coverage of 500 MB (1427 writes; 39/s)
 diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes; 39/s)
didn't get 'ok' from server (11387 316950), msg=[] = Connection reset by 
peer at ./diskchecker.pl line 132.


here's where I removed the power and left it off for about a minute. 
Then started again then did the verify


y...@a:~$ ./diskchecker.pl -s client45.eemnes verify test_file
verifying: 0.00%
Total errors: 0

:-)
this was on ext2

* What filesystem to use on the SSD? To minimize writes and maximize 
chance for seeing errors I'd choose ext2 here. 


I don't consider there to be any reason to deploy any part of a 
PostgreSQL database on ext2.  The potential for downtime if the fsck 
doesn't happen automatically far outweighs the minimal performance 
advantage you'll actually see in real applications.
Hmm.. wouldn't that apply for other filesystems as well? I know that JFS 
also won't mount if booted unclean, it somehow needs a marker from the 
fsck. Don't know for ext3, xfs etc.
All of the benchmarks showing large gains for ext2 over ext3 I have 
seen been synthetic, not real database performance; the internal ones 
I've run using things like pgbench do not show a significant 
improvement.  (Yes, I'm already working on finding time to publicly 
release those findings)
The reason I'd choose ext2 on the SSD was mainly to decrease the number 
of writes, not for performance. Maybe I should ultimately do tests for 
both journalled and ext2 filesystems and compare the amount of data per 
x pgbench transactions.
Put it on ext3, toggle on noatime, and move on to testing.  The 
overhead of the metadata writes is the least of the problems when 
doing write-heavy stuff on Linux.

Will surely do and post the results.

thanks,
Yeb Havinga

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Testing Sandforce SSD

2010-07-24 Thread Yeb Havinga

Hello list,

Probably like many other's I've wondered why no SSD manufacturer puts a 
small BBU on a SSD drive. Triggered by Greg Smith's mail 
http://archives.postgresql.org/pgsql-performance/2010-02/msg00291.php 
here, and also anandtech's review at 
http://www.anandtech.com/show/2899/1 (see page 6 for pictures of the 
capacitor) I ordered a SandForce drive and this week it finally arrived.


And now I have to test it and was wondering about some things like

* How to test for power failure? I thought by running on the same 
machine a parallel pgbench setup on two clusters where one runs with 
data and wal on a rotating disk, the other on the SSD, both without BBU 
controller. Then turn off power. Do that a few times. The problem in 
this scenario is that even when the SSD would show not data loss and the 
rotating disk would for a few times, a dozen tests without failure isn't 
actually proof that the drive can write it's complete buffer to disk 
after power failure.


* How long should the power be turned off? A minute? 15 minutes?

* What filesystem to use on the SSD? To minimize writes and maximize 
chance for seeing errors I'd choose ext2 here. For the sake of not 
comparing apples with pears I'd have to go with ext2 on the rotating 
data disk as well.


Do you guys have any more ideas to properly 'feel this disk at its teeth' ?

regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query optimization problem

2010-07-15 Thread Yeb Havinga

Hello Zotov,

Somehow the equivalence d2.basedon=d1.id is not used in the slow query, 
probably because the equivalence constant value would be used inside a 
not-base expression (the OR). You can see that the equivalence values 
*are* used by changing the or to an and and compare both queries. The 
only thing you can do to guarantee the planner has all information to in 
cases like this it explicitly name the equivalence inside OR 
expressions, e.g.


SELECT d1.ID, d2.ID
FROM DocPrimary d1
  JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
WHERE (d1.ID=234409763 and d2.basedon=234409763) or (d2.ID=234409763) ;

regards,
Yeb Havinga

PS: the analyze time of the slow query showed 0.007ms?

Zotov wrote:

I have a query:

 SELECT d1.ID, d2.ID
 FROM DocPrimary d1
   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 WHERE (d1.ID=234409763) or (d2.ID=234409763)

i think what QO(Query Optimizer) can make it faster (now it seq scan 
and on

million records works 7 sec)

 SELECT d1.ID, d2.ID
 FROM DocPrimary d1
   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 WHERE (d2.BasedOn=234409763) or (d2.ID=234409763)


 --
 Slow Query
 --
 test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT 
d1.ID,

 d2.ID
 test-# FROM DocPrimary d1
 test-#   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 test-# WHERE (d1.ID=234409763) or (d2.ID=234409763);
 QUERY PLAN
  

   Hash Join  (cost=58.15..132.35 rows=2 width=8) (actual 
time=0.007..0.007

 rows=0 loops=1)
 Output: d1.id, d2.id
 Hash Cond: (d2.basedon = d1.id)
 Join Filter: ((d1.id = 234409763) OR (d2.id = 234409763))
 ->   Seq Scan on public.docprimary d2  (cost=0.00..31.40 rows=2140
 width=8) (actual time=0.002..0.002 rows=0 loops=1)
   Output: d2.id, d2.basedon
 ->   Hash  (cost=31.40..31.40 rows=2140 width=4) (never executed)
   Output: d1.id
   ->   Seq Scan on public.docprimary d1  (cost=0.00..31.40 
rows=2140

 width=4) (never executed)
 Output: d1.id

 --
 Fast Query
 --
 test=# EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS off )SELECT 
d1.ID,

 d2.ID
 test-# FROM DocPrimary d1
 test-#   JOIN DocPrimary d2 ON d2.BasedOn=d1.ID
 test-# WHERE (d2.BasedOn=234409763) or (d2.ID=234409763);
 QUERY PLAN
 - 

   Nested Loop  (cost=8.60..58.67 rows=12 width=8) (actual 
time=0.026..0.026

 rows=0 loops=1)
 Output: d1.id, d2.id
 ->   Bitmap Heap Scan on public.docprimary d2  (cost=8.60..19.31 
rows=12

 width=8) (actual time=0.023..0.023 rows=0 loops=1)
   Output: d2.id, d2.basedon
   Recheck Cond: ((d2.basedon = 234409763) OR (d2.id = 
234409763))

   ->   BitmapOr  (cost=8.60..8.60 rows=12 width=0) (actual
 time=0.018..0.018 rows=0 loops=1)
 ->   Bitmap Index Scan on basedon_idx  (cost=0.00..4.33
 rows=11 width=0) (actual time=0.008..0.008 rows=0 loops=1)
   Index Cond: (d2.basedon = 234409763)
 ->   Bitmap Index Scan on id_pk  (cost=0.00..4.26 rows=1
 width=0) (actual time=0.003..0.003 rows=0 loops=1)
   Index Cond: (d2.id = 234409763)
 ->   Index Scan using id_pk on public.docprimary d1  
(cost=0.00..3.27

 rows=1 width=4) (never executed)
   Output: d1.id, d1.basedon
   Index Cond: (d1.id = d2.basedon)



PGver: PostgreSQL 9.0b x86
OS: Win7 x64

-
Create table query:
-

CREATE TABLE docprimary
(
  id integer NOT NULL,
  basedon integer,
  CONSTRAINT id_pk PRIMARY KEY (id)
);
CREATE INDEX basedon_idx
  ON docprimary
  USING btree
  (basedon);





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Two different execution plan for the same request

2010-07-07 Thread Yeb Havinga

JOUANIN Nicolas (44) wrote:


The strange thing is that this morning explain analyze now gives a 
much better duration :


There were no modification made on the database except a restart 
yesterday evening and a vacuumdb --analyse ran at night.
If the earlier bad query was run on a fresh imported database that was 
not ANALYZEd, it would explain the different and likely bad plan. If you 
want to know for sure this is the cause, instead of e.g. faulty 
hardware, you could verify redoing the import + query without analyze.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Two different execution plan for the same request

2010-07-07 Thread Yeb Havinga

JOUANIN Nicolas (44) wrote:

Hi,

I've trouble with some SQL request which have different execution 
plans when ran on two different servers. One server is the development 
environment, the othe rone is th pre-production env.

Both servers run postgreSQL 8.3.0 on Linux and :
 - both databases contains the same data (pg_dump/pg_restore between 
servers)

 - instances have the same configuration parameters
 - vaccum and analyze is run every day.
The only difference I can see is the hardware. The pre-preoduction 
env. has more RAM, more CPU and a RAID5 disk array through LVM where 
data are stored.

Hello Jouanin,

Could you give some more information following the guidelines from 
http://wiki.postgresql.org/wiki/SlowQueryQuestions ?


Essential are the contents from both conf files (comments may be removed).

regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] order by slowing down a query by 80 times

2010-06-28 Thread Yeb Havinga

Rajesh Kumar Mallah wrote:

Dear List,

just by removing the order by co_name reduces the query time dramatically
from  ~ 9 sec  to 63 ms. Can anyone please help.
The 63 ms query result is probably useless since it returns a limit of 
25 rows from an unordered result. It is not surprising that this is fast.


The pain is here:
Index Scan using profile_master_co_name on profile_master b  
(cost=0.00..1125295.59 rows=6968 width=25) (actual time=0.097..9193.154 
rows=2212 loops=1)
Filter: ((co_name IS NOT NULL) AND 
((co_name_vec)::tsvector @@ to_tsquery('manufacturer'::text)))



It looks like seq_scans are disabled, since the index scan has only a 
filter expression but not an index cond.


regards,
Yeb Havinga



Regds
Rajesh Kumar Mallah.


explain analyze SELECT * from   ( SELECT  
a.profile_id,a.userid,a.amount,a.category_id,a.catalog_id,a.keywords,b.co_name  
from general.catalogs a join general.profile_master using(profile_id) 
where  1=1  and co_name_vec @@   to_tsquery('manufacturer')  and  
b.co_name is not null and a.ifmain is true ) as c order by co_name 
limit 25 offset 0;



Limit  (cost=0.00..3659.13 rows=25 width=129) (actual 
time=721.075..9241.105 rows=25 loops=1)
   ->  Nested Loop  (cost=0.00..1215772.28 rows=8307 width=476) 
(actual time=721.073..9241.050 rows=25 loops=1)
 ->  Nested Loop  (cost=0.00..1208212.37 rows=8307 width=476) 
(actual time=721.052..9240.037 rows=25 loops=1)
   ->  Nested Loop  (cost=0.00..1204206.26 rows=6968 
width=472) (actual time=721.032..9239.516 rows=25 loops=1)
 ->  Nested Loop  (cost=0.00..1154549.19 rows=6968 
width=471) (actual time=721.012..9236.523 rows=25 loops=1)
   ->  Index Scan using profile_master_co_name 
on profile_master b  (cost=0.00..1125295.59 rows=6968 width=25) 
(actual time=0.097..9193.154 rows=2212 loops=1)
 Filter: ((co_name IS NOT NULL) AND 
((co_name_vec)::tsvector @@ to_tsquery('manufacturer'::text)))
   ->  Index Scan using 
catalog_master_profile_id_fkindex on catalog_master  (cost=0.00..4.19 
rows=1 width=446) (actual time=0.016..0.016 rows=0 loops=2212)
 Index Cond: 
(catalog_master.profile_id = b.profile_id)
 Filter: ((catalog_master.hide IS 
FALSE) AND ((catalog_master.hosting_status)::text = 'ACTIVE'::text))
 ->  Index Scan using 
profile_master_profile_id_pkey on profile_master  (cost=0.00..7.11 
rows=1 width=9) (actual time=0.105..0.105 rows=1 loops=25)
   Index Cond: (profile_master.profile_id = 
catalog_master.profile_id)
   ->  Index Scan using 
catalog_categories_pkey_catalog_id_category_id on catalog_categories  
(cost=0.00..0.56 rows=1 width=8) (actual time=0.014..0.015 rows=1 
loops=25)
 Index Cond: (catalog_categories.catalog_id = 
catalog_master.catalog_id)

 Filter: (catalog_categories.ifmain IS TRUE)
 ->  Index Scan using web_category_master_pkey on 
web_category_master  (cost=0.00..0.90 rows=1 width=4) (actual 
time=0.034..0.036 rows=1 loops=25)
   Index Cond: (web_category_master.category_id = 
catalog_categories.category_id)
   Filter: ((web_category_master.status)::text = 
'ACTIVE'::text)

Total runtime: 9241.304 ms

explain analyze SELECT * from   ( SELECT  
a.profile_id,a.userid,a.amount,a.category_id,a.catalog_id,a.keywords,b.co_name  
from general.catalogs a join general.profile_master b 
using(profile_id) where  1=1  and co_name_vec @@   
to_tsquery('manufacturer')  and  b.co_name is not null and a.ifmain is 
true ) as c  limit 25 offset 0;
  
QUERY PLAN   


--
 Limit  (cost=0.00..358.85 rows=25 width=476) (actual 
time=0.680..63.176 rows=25 loops=1)
   ->  Nested Loop  (cost=0.00..119238.58 rows=8307 width=476) (actual 
time=0.677..63.139 rows=25 loops=1)
 ->  Nested Loop  (cost=0.00..111678.66 rows=8307 width=476) 
(actual time=0.649..62.789 rows=25 loops=1)
   ->  Nested Loop  (cost=0.00..107672.56 rows=6968 
width=472) (actual time=0.626..62.436 rows=25 loops=1)
 ->  Nested Loop  (cost=0.00..58015.49 rows=6968 
width=471) (actual time=0.606..62.013 rows=25 loops=1)
   ->  Index Scan using 
profile_master_co_name_vec on profile_master b  (cost=0.00..28761.89 
rows=6968 width=25) (actual time=0.071..50.576 rows=1160 loops=1)
 Index Cond: ((co_name_vec)::tsvector 
@@ to_tsquery('manufacturer'::text))

 Fil

Re: [PERFORM] sudden spurt in swap utilization (was:cpu bound postgresql setup.)

2010-06-25 Thread Yeb Havinga

Rajesh Kumar Mallah wrote:

A scary phenomenon is being exhibited by the server , which is the server
is slurping all the swap suddenly
  
 8  1 4192912 906164   6100 2787364000  2277   858 13440 16235

63  8 19 10  0

I reduced work_mem from 4GB to 2GB to 512MB (now). I clearly remember that this
abnormal consumption of swap was NOT there even when work_mem was 4GB.
eg during happier times swap utilisation was: http://pastebin.com/bnE1pFZ9
  
the question is whats making postgres slurp the swap? i am posting my

current postgresql.conf
once again.

# cat postgresql.conf  | grep -v "^\s*#" | grep -v "^\s*$"
listen_addresses = '*'  # what IP address(es) to listen on;
port = 5432 # (change requires restart)
max_connections = 300   # (change requires restart)
  

Hello Rajesh,

In constrast with e.g. shared_buffers and effective_cache_size, work_mem 
is amount of memory per 'thing' (e.g. order/group by) that wants some 
working memory, so even a single backend can use several pieces of 
work_mem memory.


Looking at your postgresql.conf, other memory values seem a bit too high 
as well for a 32GB ram server. It is probably a good idea to use pgtune 
(on pgfoundry) to get some reasonable ball park settings for your hardware.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] B-Heaps

2010-06-18 Thread Yeb Havinga

Kevin Grittner wrote:

Yeb Havinga  wrote:
 
  

concerning gist indexes:

1) with larger block sizes and hence, larger # entries per gist
page, results in more generic keys of those pages. This in turn
results in a greater number of hits, when the index is queried, so
a larger part of the index is scanned. NB this has nothing to do
with caching / cache sizes; it holds for every IO model. Tests
performed by me showed performance improvements of over 200%.
Since then implementing a speedup has been on my 'want to do
list'.

 
As I recall, the better performance in your tests was with *smaller*

GiST pages, right?  (The above didn't seem entirely clear on that.)
  

Yes, making pages smaller made index scanning faster.

-- Yeb


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] B-Heaps

2010-06-18 Thread Yeb Havinga

Greg Smith wrote:

Matthew Wakeling wrote:
This sort of thing has been fairly well researched at an academic 
level, but has not been implemented in that many real world 
situations. I would encourage its use in Postgres.


I guess, but don't forget that work on PostgreSQL is driven by what 
problems people are actually running into.  There's a long list of 
performance improvements sitting in the TODO list waiting for people 
to find time to work on them, ones that we're quite certain are 
useful.  That anyone is going to chase after any of these speculative 
ideas from academic research instead of one of those is unlikely.  
Your characterization of the potential speed up here is "Using a 
proper tree inside the index page would improve the CPU usage of the 
index lookups", which seems quite reasonable.  Regardless, when I 
consider "is that something I have any reason to suspect is a 
bottleneck on common workloads?", I don't think of any, and return to 
working on one of things I already know is instead.



There are two different things concerning gist indexes:

1) with larger block sizes and hence, larger # entries per gist page, 
results in more generic keys of those pages. This in turn results in a 
greater number of hits, when the index is queried, so a larger part of 
the index is scanned. NB this has nothing to do with caching / cache 
sizes; it holds for every IO model. Tests performed by me showed 
performance improvements of over 200%. Since then implementing a speedup 
has been on my 'want to do list'.


2) there are several approaches to get the # entries per page down. Two 
have been suggested in the thread referred to by Matthew (virtual pages 
(but how to order these?) and tree within a page). It is interesting to 
see if ideas from Prokop's cache oblivous algorithms match with this 
problem to find a suitable virtual page format.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] B-Heaps

2010-06-15 Thread Yeb Havinga

Greg Smith wrote:

Eliot Gable wrote:
Just curious if this would apply to PostgreSQL: 
http://queue.acm.org/detail.cfm?id=1814327


It's hard to take this seriously at all when it's so ignorant of 
actual research in this area.  Take a look at 
http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/papers/BFJ01.pdf 
for a second

Interesting paper, thanks for the reference!
PostgreSQL is modeling a much more complicated situation where there 
are many levels of caches, from CPU to disk.  When executing a query, 
the database tries to manage that by estimating the relative costs for 
CPU operations, row operations, sequential disk reads, and random disk 
reads.  Those fundamental operations are then added up to build more 
complicated machinery like sorting.  To minimize query execution cost, 
various query plans are considered, the cost computed for each one, 
and the cheapest one gets executed.  This has to take into account a 
wide variety of subtle tradeoffs related to whether memory should be 
used for things that would otherwise happen on disk.  There are three 
primary ways to search for a row, three main ways to do a join, two 
for how to sort, and they all need to have cost estimates made for 
them that balance CPU time, memory, and disk access.
Do you think that the cache oblivious algorithm described in the paper 
could speed up index scans hitting the disk Postgres (and os/hardware) 
multi level memory case? (so e.g. random page cost could go down?)


regards,
Yeb Havinga

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread Yeb Havinga

David Jarvis wrote:

Hi, Yeb.

This is starting to go back to the design I used with MySQL:

* YEAR_REF - Has year and station
* MONTH_REF - Has month, category, and yea referencer
* MEASUREMENT - Has month reference, amount, and day

Normalizing by date parts was fast. Partitioning the tables by year 
won't do much good -- users will probably choose 1900 to 2009, 
predominately.

Ok, in that case it is a bad idea.
I thought about splitting the data by station by category, but that's 
~73000 tables. My understanding is that PostgreSQL uses files per 
index, which would be messy at the OS level (Linux 2.6.31). Even by 
station alone is 12139 tables, which might be tolerable for now, but 
with an order of magnitude more stations on the distant horizon, it 
will not scale.
Yes, I've read a few times now that PG's partitioning doesn't scale 
beyond a few 100 partitions.
I also thought about splitting the data by station district by 
category -- there are 79 districts, yielding 474 child tables, which 
is ~575000 rows per child table. Most of the time I'd imagine only one 
or two districts would be selected. (Again, hard to know exactly.)
I agee with Matthew Wakeling in a different post: its probably wise to 
first see how fast things can get by using indexes. Only if that fails 
to be fast, partitioning might be an option. (Though sequentially 
scanning 0.5M rows is not cheap).


I experimented a bit with a doy and year function.

-- note: leap year fix must still be added
create or replace function doy(timestamptz) RETURNS float8
as 'select extract(doy from $1);'
language sql
immutable
strict;
create or replace function year(timestamptz) RETURNS float8
as 'select extract(year from $1);'
language sql
immutable
strict;

\d parent
Table "public.parent"
Column |   Type   | Modifiers
+--+---
t  | timestamp with time zone |
y  | smallint |
Indexes:
   "doy_i" btree (doy(t))
   "year_i" btree (year(t))

A plan like the following is probably what you want

test=# explain select * from parent where doy(t) between 10 and 20 and 
year(t) between 1900 and 2009;
  
QUERY 
PLAN


-
Bitmap Heap Scan on parent  (cost=9.95..14.97 rows=1 width=10)
  Recheck Cond: ((year(t) >= 1900::double precision) AND (year(t) <= 
2009::double precision) AND (doy(t) >= 10::double precision) AND (doy(t) 
<= 20::double precision))

  ->  BitmapAnd  (cost=9.95..9.95 rows=1 width=0)
->  Bitmap Index Scan on year_i  (cost=0.00..4.85 rows=10 width=0)
  Index Cond: ((year(t) >= 1900::double precision) AND 
(year(t) <= 2009::double precision))

->  Bitmap Index Scan on doy_i  (cost=0.00..4.85 rows=10 width=0)
  Index Cond: ((doy(t) >= 10::double precision) AND (doy(t) 
<= 20::double precision))

(7 rows)

regards,
Yeb Havinga




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread Yeb Havinga

Matthew Wakeling wrote:

On Fri, 21 May 2010, Yeb Havinga wrote:

For time based data I would for sure go for year based indexing.


On the contrary, most of the queries seem to be over many years, but 
rather restricting on the time of year. Therefore, partitioning by 
month or some other per-year method would seem sensible.
The fact is that at the time I wrote my mail, I had not read a specifion 
of distribution of parameters (or I missed it). That's why the sentence 
of my mail before the one you quoted said: "the partitioning is only 
useful for speed, if it matches how your queries select data.". In most 
of the databases I've worked with, the recent data was queried most 
(accounting, medical) but I can see that for climate analysis this might 
be different.
Regarding the leap year problem, you might consider creating a 
modified day of year field, which always assumes that the year 
contains a leap day. Then a given number always resolves to a given 
date, regardless of year. If you then partition (or index) on that 
field, then you may get a benefit.
Shouldn't it be just the other way around - assume all years are non 
leap years for the doy part field to be indexed.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread Yeb Havinga

There is a thing that might lead to confusion in the previous post:

create or replace function yearmod(int) RETURNS int
as 'select (($1 >> 2) %32);'
language sql
immutable
strict;

is equivalent with

create or replace function yearmod(int) RETURNS int
as 'select (($1 / 4) %32);'
language sql
immutable
strict;

and that is the function that was used with all the other output (it can 
be seen inlined in the explain output). I did not catch this until after 
the post.


regards,
Yeb Havinga



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread Yeb Havinga
) inherits(parent);
create table child32(check (yearmod(y)=31)) inherits(parent);
(32 rows)

Copy and paste output of this query in psql to create child tables.

Example with period 1970 to 1980:

test=# explain select * from parent where yearmod(y) between 
yearmod(1970) and yearmod(1980);
QUERY 
PLAN 
-

Result  (cost=0.00..305.00 rows=50 width=10)
  ->  Append  (cost=0.00..305.00 rows=50 width=10)
->  Seq Scan on parent  (cost=0.00..61.00 rows=10 width=10)
  Filter: y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
->  Seq Scan on child13 parent  (cost=0.00..61.00 rows=10 width=10)
  Filter: y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
->  Seq Scan on child14 parent  (cost=0.00..61.00 rows=10 width=10)
  Filter: y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
->  Seq Scan on child15 parent  (cost=0.00..61.00 rows=10 width=10)
  Filter: y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
->  Seq Scan on child16 parent  (cost=0.00..61.00 rows=10 width=10)
  Filter: y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
(12 rows)

This works: query for 11 consecutive years hits only 4 from 31.

But the between fails for yearmods that wrap the 31 boundary, what 
happens here between 1910 and 1920


test=# explain select * from parent where yearmod(y) between 
yearmod(1910) and yearmod(1920);
   QUERY PLAN   
--

Result  (cost=0.00..0.01 rows=1 width=0)
  One-Time Filter: false
(2 rows)

So for the wraparound case we need a CASE:

test=# explain select * from parent where case when yearmod(1910) <= 
yearmod(1920)

then yearmod(y) between yearmod(1910) and yearmod(1920)
else (yearmod(y) >= yearmod(1910) or yearmod(y) <= yearmod(1920)) end;
 QUERY 
PLAN  
---

Result  (cost=0.00..305.00 rows=5665 width=10)
  ->  Append  (cost=0.00..305.00 rows=5665 width=10)
->  Seq Scan on parent  (cost=0.00..61.00 rows=1133 width=10)
  Filter: y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
->  Seq Scan on child1 parent  (cost=0.00..61.00 rows=1133 
width=10)

  Filter: y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
->  Seq Scan on child30 parent  (cost=0.00..61.00 rows=1133 
width=10)

  Filter: y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
->  Seq Scan on child31 parent  (cost=0.00..61.00 rows=1133 
width=10)

  Filter: y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
->  Seq Scan on child32 parent  (cost=0.00..61.00 rows=1133 
width=10)

  Filter: y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
(12 rows)

This should work for all year ranges and I think is a good solution for 
partitioning on year with a fixed amount of partitions.


From the optimizer perspective I wonder what the best access path for 
this kind of query would be (if there would be no partitions). Building 
on ideas from one of Thom Brown's first replies with indexes on year and 
doy, and Tom Lane's remark about the leap year problem. Suppose the leap 
years did not exist, having a index on year, and having a different 
index on doy, sounds like a bitmap and of a scan of both the year and 
doy indexes could provide a optimal path. Maybe this would still be 
possible, if the leap year problem could be 'fixed' by a additional 
condition in the where clause that filters the surplus records.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Yeb Havinga

Tom Lane wrote:

David Jarvis  writes:
  

I was hoping to eliminate this part of the query:
(cast(extract( YEAR FROM m.taken ) + greatest( -1 *
  sign(
(extract( YEAR FROM m.taken )||'-12-31')::date -
(extract( YEAR FROM m.taken )||'-01-01')::date ), 0
) AS text)||'-12-31')::date

That uses functions to create the dates, which is definitely the problem.



Well, it's not the functions per se that's the problem, it's the lack of
a useful index on the expression.  But as somebody remarked upthread,
that expression doesn't look correct at all.  Doesn't the whole
greatest() subexpression reduce to a constant?
  
That somebody was probably me. I still think the whole BETWEEN 
expression is a tautology. A small test did not provide a 
counterexample. In the select below everything but the select was 
copy/pasted.


create table m (taken timestamptz);
insert into m values (now());
insert into m values ('1900-12-31');
insert into m values ('2000-04-06');
select m.taken BETWEEN
   /* Start date. */
 (extract( YEAR FROM m.taken )||'-01-01')::date AND
   /* End date. Calculated by checking to see if the end date wraps
 into the next year. If it does, then add 1 to the current year.
   */
   (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
 sign(
   (extract( YEAR FROM m.taken )||'-12-31')::date -
   (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
   ) AS text)||'-12-31')::date from m;
 ?column?
--
t
t
t
(3 rows)

Another thing is that IF the climate measurements is partitioned on time 
(e.g each year?), then a function based index on the year part of 
m.taken is useless, pardon my french. I'm not sure if it is partitioned 
that way but it is an interesting thing to inspect, and perhaps rewrite 
the query to use constraint exclusion.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Yeb Havinga

Hello David,
The table aggregates 237 million rows from its child tables. The 
sluggishness comes from this part of the query:


  m.taken BETWEEN
/* Start date. */
  (extract( YEAR FROM m.taken )||'-01-01')::date AND
/* End date. Calculated by checking to see if the end date wraps
  into the next year. If it does, then add 1 to the current year.
*/
(cast(extract( YEAR FROM m.taken ) + greatest( -1 *
  sign(
(extract( YEAR FROM m.taken )||'-12-31')::date -
(extract( YEAR FROM m.taken )||'-01-01')::date ), 0
) AS text)||'-12-31')::date
Either I had too less coffee and completely misunderstand this 
expression, or it is always true and can be omitted. Could you explain a 
bit what this part tries to do and maybe also show it's original 
counterpart in the source database?


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] old server, new server, same performance

2010-05-14 Thread Yeb Havinga

Kevin Grittner wrote:

Piotr Legiecki  wrote:
 
  

Why there is no difference in database speed between those two
machines?

 
Could you post the contents of the postgresql.conf files for both

(stripped of comments) and explain what you're using for your
benchmarks?  In particular, it would be interesting to know how many
concurrent connections are active running what mix of queries.
  
It would be also interesting to know how many disks are there in the new 
server, and the size of the database (select  
pg_size_pretty(pg_database_size('yourdb'))).


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first

2010-04-16 Thread Yeb Havinga

Tom Lane wrote:

=?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?=  writes:
  

I've thought and someone in this list've told me that this should be done
automatically.



No, that's not true.  We do make deductions about transitive equalities,
ie, given WHERE a=b AND b=c the planner will infer a=c and use that if
it's helpful.  We don't make deductions about inequalities such as a>c.
In theory there's enough information available to do so, but overall
trying to do that would probably waste more cycles than it would save.
You'd need a lot of expensive new planner infrastructure, and in the
vast majority of queries it wouldn't produce anything very helpful.
  
New expensive planner infrastructure to support from a>b and b>c infer 
a>c, yes.


But I wonder if something like Leibniz's principle of identity holds for 
members of the same equivalence class, e.g. like if x,y are both members 
of the same EC, then for every predicate P, P(x) iff P(y). Probably not 
for every predicate (like varno = 2 or attname='x'), but for the query 
evaluation, the object denoted by the variables are the same, since that 
is the standard meaning of the = operator. I cannot think of any 
standard (btree) operator where 'Leibniz' would fail in this case.


regards,
Yeb Havinga



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first

2010-04-16 Thread Yeb Havinga

Віталій Тимчишин wrote:


BTW: Changing slow query to inner joins do not make it fast

I'm interested to see the query andplan of the slow query with inner joins.




Another thing is it seems that the number of rows guessed is far
off from the actual number of rows, is the number 500
artificial or are you're statistics old or too small histogram/mcv's?


Nope, I suppose this is because of limit. If I remove the limit, the 
estimations are quite correct. There are ~6 millions of row in each table.

Yes, that makes sense.

regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first

2010-04-16 Thread Yeb Havinga

Віталій Тимчишин wrote:

Hello.

I have a query that performs very poor because there is a limit on 
join column that is not applied to other columns:


select * from company this_ left outer join company_tag this_1_ on 
this_.id=this_1_.company_id left outer join company_measures 
companymea2_ on this_.id=companymea2_.company_id left outer join 
company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id 
= 7 and this_.id>5000 
and this_1_.company_id>5000

order by this_.id asc limit 1000;

(plan1.txt)
Total runtime: 7794.692 ms

At the same time if I apply the limit (>5000) to other columns in 
query itself it works like a charm:


select * from company this_ left outer join company_tag this_1_ on 
this_.id=this_1_.company_id left outer join company_measures 
companymea2_ on this_.id=companymea2_.company_id left outer join 
company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id 
= 7 and this_.id>5000 
and this_1_.company_id>5000

and companymea2_.company_id>5000 and ces3_.company_id>5000
order by this_.id asc limit 1000;

(plan2.txt)
Total runtime: 27.547 ms

I've thought and someone in this list've told me that this should be 
done automatically.
Yes, if you have in a query a=b and b=c, then the optimizer figures out 
that a=c as well. (a,b and c are then member of the same equivalence class).


However both queries are not the same, since the joins you're using are 
outer joins. In the first it's possible that records are returned for 
company records with no matching ces3_ records, the ces3_ records is 
null in that case. In the second query no NULL ces3_ information may be 
returned.


Another thing is it seems that the number of rows guessed is far off 
from the actual number of rows, is the number 500 artificial or are 
you're statistics old or too small histogram/mcv's?


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Some question

2010-04-07 Thread Yeb Havinga

Scott Marlowe wrote:

2010/3/31 Ľubomír Varga :
  

Hi, stright to my "problem":
If I try to select constant 1 from table with two rows, it will be something
like this:

explain
SELECT * FROM t_route
   WHERE t_route.route_type_fk = (SELECT id FROM t_route_type WHERE type = 
2)
   limit 4;

"Limit  (cost=1.02..1.91 rows=4 width=2640)"
"  InitPlan"
"->  Seq Scan on t_route_type  (cost=0.00..1.02 rows=1 width=8)"
"  Filter: ("type" = 2)"
"  ->  Seq Scan on t_route  (cost=0.00..118115.25 rows=535090 width=2640)"
"Filter: (route_type_fk = $0)"




Looking at this it looks like you're using prepared queries, which
can't make as good of a decision as regular queries because the values
are opaque to the planner.

Can you provide us with the output of explain analyze of that query?
  
ISTM that the initplan 'outputs' id as $0, so it is not a prepared 
query. Maybe EXPLAIN VERBOSE ANALYZE of the query reveals that better. 
But both plans show seqscans of the large table, so it is surprising 
that the performance is different, if the filter expression uses the 
same values. Are you sure the output SELECT id FROM t_route_type WHERE 
type = 2 is equal to 1?


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] mysql to postgresql, performance questions

2010-03-25 Thread Yeb Havinga

Scott Marlowe wrote:

On Thu, Mar 25, 2010 at 2:29 PM, Pierre C  wrote:
  

Hannu Krosing wrote:
  

Pulling the plug should not corrupt a postgreSQL database, unless it was
using disks which lie about write caching.



Didn't we recently put the old wife's 'the disks lied' tale to bed in
favour of actually admiting that some well known filesystems and saftware
raid systems have had trouble with their write barriers?
  

I put a cheap UPS on the home server (which uses Software RAID) precisely
because I don't really trust that stuff, and there is also the RAID5 write
hole... and maybe the RAID1 write hole too... and installing a UPS takes
less time that actually figuring out if the system is power-loss-safe.



Very true, a UPS might not cover every possible failure mode, but it
sure takes care of an aweful lot of the common ones.
  
Yeah, but the original post was about mythtv boxes, which usually do not 
have upses. My suggestion about proper setup of the wal was based on 
some experience of my own. What I did was probably the fastest path to 
corrupt database files: diskless mythtv box that booted from the 
fileserver at the attic (with ups btw), but I was too lazy (after x days 
of lirc / xorg / ivtv / rtc / xmltv etc work) to move the default 
configured mysql database from the mythtv box (with root filesystem and 
also mysql on the nfs mount) to a mysql running on the fileserver 
itself. On top of that I had nfs mounted async for speed. Really after x 
days of configuration to get things running (my wife thinks it's hobby 
time but it really isn't) all that is on your mind is: it works good 
enough? fine, will iron out non essential things when they pop up and if 
the db becomes corrupt, I had database backups. In the end I had a few 
times a corrupt table that was always easily repaired with the 
mysqlcheck tool.


Based on this experience I do not think that reliability alone will 
convince mythtv developers/users to switch to postgresql, and besides 
that as a developer and user myself, it's always in a way funny to see 
how creative people can finding ways to not properly use (your) software ;-)


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] mysql to postgresql, performance questions

2010-03-24 Thread Yeb Havinga

Yeb Havinga wrote:

Greg Smith wrote:

Tom Lane wrote:

So has anyone looked at porting MythTV to PG?
  


Periodically someone hacks together something that works, last big 
effort I'm aware of was in 2006, and then it bit rots away.  I'm sure 
we'd get some user uptake on the result--MySQL corruption is one of 
the top ten cause of a MythTV system crashing.
It would be the same with PG, unless the pg cluster configuration with 
mythtv would come with a properly configured WAL - I had corrupted 
tables (and a personal wiki entry  

forgot to add "how to fix the corrupted tables", sorry
(the other mysql database in my house) *only* when I sometimes took 
the risk of not shutting down the machine properly when e.g. the 
remote was missing).


regards,
Yeb Havinga



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] mysql to postgresql, performance questions

2010-03-24 Thread Yeb Havinga

Greg Smith wrote:

Tom Lane wrote:

So has anyone looked at porting MythTV to PG?
  


Periodically someone hacks together something that works, last big 
effort I'm aware of was in 2006, and then it bit rots away.  I'm sure 
we'd get some user uptake on the result--MySQL corruption is one of 
the top ten cause of a MythTV system crashing.
It would be the same with PG, unless the pg cluster configuration with 
mythtv would come with a properly configured WAL - I had corrupted 
tables (and a personal wiki entry  (the other mysql database in my 
house) *only* when I sometimes took the risk of not shutting down the 
machine properly when e.g. the remote was missing).


regards,
Yeb Havinga

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] GiST index performance

2010-03-22 Thread Yeb Havinga

Matthew Wakeling wrote:

On Sat, 20 Mar 2010, Yeb Havinga wrote:
The gist virtual pages would then match more the original blocksizes 
that
were used in Guttman's R-tree paper (first google result, then figure 
4.5).
Since the nature/characteristics of the underlying datatypes and keys 
is not
changed, it might be that with the disk pages getting larger, gist 
indexing

has therefore become unexpectedly inefficient.


Yes, that is certainly a factor. For example, the page size for bioseg 
which we use here is 130 entries, which is very excessive, and doesn't 
allow very deep trees. On the other hand, it means that a single disc 
seek performs quite a lot of work.
Yeah, I only did in-memory fitting tests and wondered about increased 
io's. However I bet that even for bigger than ram db's, the benefit of 
having to fan out to less pages still outweighs the over-general non 
leaf nodes and might still result in less disk io's. I redid some 
earlier benchmarking with other datatypes with a 1kB block size and also 
multicolumn gist and the multicolumn variant had an ever greater benefit 
than the single column indexes, both equality and range scans. (Like 
execution times down to 20% of original). If gist is important to you, I 
really recommend doing a test with 1kB blocks.


regards,
Yeb Havinga

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] GiST index performance

2010-03-20 Thread Yeb Havinga
On Fri, Mar 19, 2010 at 10:16 PM, Kenneth Marshall  wrote:

> Hi Yeb,
>
> I have not looked at the gist code, but would it be possible to
> make virtual pages that have a size that is 1/power-of-2 * blocksize.
> Then the leaf node could be 1/8 or even 1/16 the size of the full
> pagesize.
>

Hello Ken,

The gist virtual pages would then match more the original blocksizes that
were used in Guttman's R-tree paper (first google result, then figure 4.5).
Since the nature/characteristics of the underlying datatypes and keys is not
changed, it might be that with the disk pages getting larger, gist indexing
has therefore become unexpectedly inefficient.

But I am also not really into the core-gist code, but do have a motivation
to dive into it (more than 200% performance increase in Mathew's test case).
However I'd like to verify for community support before working on it.

Maybe Theodor or Oleg could say something about how easy or hard it is to
do?

regards,
Yeb Havinga


>
> Regards,
> Ken
>
> On Fri, Mar 19, 2010 at 09:49:30PM +0100, Yeb Havinga wrote:
> > Yeb Havinga wrote:
> >>
> >> Since the gistpagesize is derived from the database blocksize, it might
> be
> >> wise to set the blocksize low for this case, I'm going to play with this
> a
> >> bit more.
> > Ok, one last mail before it turns into spam: with a 1KB database
> blocksize,
> > the query now runs in 30 seconds (original 70 on my machine, shared
> buffers
> > 240MB).
> > The per inner loop access time now 24 microsecs compared to on my machine
> > original 74 microsecs with 8KB size and 8 for the btree scan. Not a bad
> > speedup with such a simple parameter :-)
> >
> > postgres=# EXPLAIN ANALYSE SELECT * FROM a, b WHERE a.a BETWEEN b.b AND
> b.b
> > + 2;
> >QUERY PLAN
> >
> >
> ---
> > Nested Loop  (cost=0.00..4169159462.20 rows=09777668 width=8) (actual
> > time=0.184..29540.355 rows=297 loops=1)
> >   ->  Seq Scan on b  (cost=0.00..47037.62 rows=62 width=4) (actual
> > time=0.024..1783.484 rows=100 loops=1)
> >   ->  Index Scan using a_a on a  (cost=0.00..2224.78 rows=14 width=4)
> > (actual time=0.021..0.024 rows=3 loops=100)
> > Index Cond: ((a.a >= b.b) AND (a.a <= (b.b + 2)))
> > Total runtime: 30483.303 ms
> > (5 rows)
> >
> >
> > postgres=# select gist_stat('a_a');
> > gist_stat
> > ---
> > Number of levels:  5 +
> > Number of pages:   47618 +
> > Number of leaf pages:  45454 +
> > Number of tuples:  1047617   +
> > Number of invalid tuples:  0 +
> > Number of leaf tuples: 100   +
> > Total size of tuples:  21523756 bytes+
> > Total size of leaf tuples: 20545448 bytes+
> > Total size of index:   48760832 bytes+
> > (1 row)
> >
> >
> > --
> > Sent via pgsql-performance mailing list (
> pgsql-performance@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-performance
> >
>


Re: [PERFORM] GiST index performance

2010-03-19 Thread Yeb Havinga

Yeb Havinga wrote:


Since the gistpagesize is derived from the database blocksize, it 
might be wise to set the blocksize low for this case, I'm going to 
play with this a bit more.
Ok, one last mail before it turns into spam: with a 1KB database 
blocksize, the query now runs in 30 seconds (original 70 on my machine, 
shared buffers 240MB).
The per inner loop access time now 24 microsecs compared to on my 
machine original 74 microsecs with 8KB size and 8 for the btree scan. 
Not a bad speedup with such a simple parameter :-)


postgres=# EXPLAIN ANALYSE SELECT * FROM a, b WHERE a.a BETWEEN b.b AND 
b.b + 2;
   QUERY 
PLAN
---
Nested Loop  (cost=0.00..4169159462.20 rows=09777668 width=8) 
(actual time=0.184..29540.355 rows=297 loops=1)
  ->  Seq Scan on b  (cost=0.00..47037.62 rows=62 width=4) (actual 
time=0.024..1783.484 rows=100 loops=1)
  ->  Index Scan using a_a on a  (cost=0.00..2224.78 rows=14 
width=4) (actual time=0.021..0.024 rows=3 loops=100)

Index Cond: ((a.a >= b.b) AND (a.a <= (b.b + 2)))
Total runtime: 30483.303 ms
(5 rows)


postgres=# select gist_stat('a_a');
gist_stat 
---

Number of levels:  5 +
Number of pages:   47618 +
Number of leaf pages:  45454 +
Number of tuples:  1047617   +
Number of invalid tuples:  0 +
Number of leaf tuples: 100   +
Total size of tuples:  21523756 bytes+
Total size of leaf tuples: 20545448 bytes+
Total size of index:   48760832 bytes+

(1 row)


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] GiST index performance

2010-03-19 Thread Yeb Havinga

Yeb Havinga wrote:

Matthew Wakeling wrote:
A second quite distinct issue is the general performance of GiST 
indexes
which is also mentioned in the old thread linked from Open Items. 
For

that, we have a test case at
http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php 
for
btree_gist indexes. I have a similar example with the bioseg GiST 
index. I
have completely reimplemented the same algorithms in Java for 
algorithm
investigation and instrumentation purposes, and it runs about a 
hundred
times faster than in Postgres. I think this is a problem, and I'm 
willing

to do some investigation to try and solve it.

More gist performance..

Some background: I've been involved in developing several datatypes that 
make use of gist indexes (we just published a paper on it, see 
http://arxiv.org/abs/1003.3370), that's the main reason I'm very much 
interested in possible improvements in gist index speed. One of the 
datatypes was 'very badly' indexable: non leaf pages were getting very 
general keys, so in the end we could see from the scanning times 
compared to sequential scans that the whole index was being scanned. One 
thing I remember was the idea that somehow it would be nice if the dept 
of the gist tree could be fiddled with: in that case keys of non leaf 
nodes would be more specific again. In the original Guttman R-tree paper 
there was mention of a parameter that determined the size of entries in 
nodes and thereby indirectly the depth of the tree. I missed that in the 
PostgreSQL gist api.


One thing Gist scanning does very often is key comparisons. So another 
approach is to try to limit those and again this might be possible by 
increasing the depth / decrease number of entries per page. I just did a 
test where in gistfitpage the gistpagesize was divided by 5 and 
something similar in gistnospace.


Scantime before adjustment: about 70 seconds.
After adjustment: 45 seconds.

With gist_stat from the gevel package confirmed that the depth was now 4 
(original 3). Drawback: bigger index size because pages are not filled 
completely anymore.


The explain shows  (actual time=0.030..0.032) for the inner loop times, 
which is less then half of the original scan time.


Since the gistpagesize is derived from the database blocksize, it might 
be wise to set the blocksize low for this case, I'm going to play with 
this a bit more.


regards,
Yeb Havinga



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] GiST index performance

2010-03-19 Thread Yeb Havinga

Yeb Havinga wrote:

Yeb Havinga wrote:

Matthew Wakeling wrote:

Matthew Wakeling wrote:
A second quite distinct issue is the general performance of GiST 
indexes

which is also mentioned in the old thread linked from Open Items. For
that, we have a test case at
http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php 
for
btree_gist indexes. I have a similar example with the bioseg GiST 
index. I
have completely reimplemented the same algorithms in Java for 
algorithm
investigation and instrumentation purposes, and it runs about a 
hundred
times faster than in Postgres. I think this is a problem, and I'm 
willing

to do some investigation to try and solve it.
I have not made any progress on this issue. I think Oleg and Teodor 
would be better placed working it out. All I can say is that I 
implemented the exact same indexing algorithm in Java, and it 
performed 100 times faster than Postgres. Now, Postgres has to do a 
lot of additional work, like mapping the index onto disc, locking 
pages, and abstracting to plugin user functions, so I would expect 
some difference - I'm not sure 100 times is reasonable though. I 
tried to do some profiling, but couldn't see any one section of code 
that was taking too much time. Not sure what I can further do.
Looked in the code a bit more - only the index nodes are compressed at 
index creation, the consistent functions does not compress queries, so 
not pallocs there. However when running Mathews example from 
http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php 
with the gist index, the coverage shows in gistget.c: 100 palloc0 
's of gistsearchstack at line 152 and 2010982 palloc's also of the 
gistsearchstack on line 342. Two pfrees are also hit a lot: line 195: 
1010926 of a stackentry and line 293: 200056 times. My $0.02 cents is 
that the pain is here. My knowledge of gistget or the other sources in 
access/gist is zero, but couldn't it be possible to determine the 
maximum needed size of the stack and then allocate it at once and use 
a pop/push kind off api?
Waisted some time today on a ghost chase... I though that removing the 
millions of pallocs would help, so I wrote an alternative of the 
gistsearchstack-stack to find out that it was not the index scanning 
itself that caused milltions of pallocs, but the scan being in the inner 
loop that was called 100 times. The actual scanning time was not 
changed significantly.
The actual scanning time in my vm is for btree (actual 
time=0.006..0.008) and gist (actual time=0.071..0.074). An error in my 
searchstack alternative caused pages to be scanned twice, returing twice 
the amount of rows (6 instead of 3 each time). This resulted in a 
likewise increase of ms (actual time=0.075..0.150). Somewhere I hit 
something that causes ~= 0.070 ms twice. For a single index scan, 
0.070ms startup time for gist vs 0.006 for btree doesn't seem like a big 
problem, but yeah when calling it a million times...


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] GiST index performance

2010-03-17 Thread Yeb Havinga

Yeb Havinga wrote:

Matthew Wakeling wrote:

Matthew Wakeling wrote:
A second quite distinct issue is the general performance of GiST 
indexes

which is also mentioned in the old thread linked from Open Items. For
that, we have a test case at
http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php 
for
btree_gist indexes. I have a similar example with the bioseg GiST 
index. I
have completely reimplemented the same algorithms in Java for 
algorithm
investigation and instrumentation purposes, and it runs about a 
hundred
times faster than in Postgres. I think this is a problem, and I'm 
willing

to do some investigation to try and solve it.
I have not made any progress on this issue. I think Oleg and Teodor 
would be better placed working it out. All I can say is that I 
implemented the exact same indexing algorithm in Java, and it 
performed 100 times faster than Postgres. Now, Postgres has to do a 
lot of additional work, like mapping the index onto disc, locking 
pages, and abstracting to plugin user functions, so I would expect 
some difference - I'm not sure 100 times is reasonable though. I 
tried to do some profiling, but couldn't see any one section of code 
that was taking too much time. Not sure what I can further do.

Hello Mathew and list,

A lot of time spent in gistget.c code and a lot of functioncall5's to 
the gist's consistent function which is out of sight for gprof.
Something different but related since also gist: we noticed before 
that gist indexes that use a compressed form for index entries suffer 
from repeated compress calls on query operands (see 
http://archives.postgresql.org/pgsql-hackers/2009-05/msg00078.php).


The btree_gist int4 compress function calls the generic 
gbt_num_compress, which does a palloc. Maybe this palloc is allso hit 
al lot when scanning the index, because the constants that are queries 
with are repeatedly compressed and palloced.
Looked in the code a bit more - only the index nodes are compressed at 
index creation, the consistent functions does not compress queries, so 
not pallocs there. However when running Mathews example from 
http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php 
with the gist index, the coverage shows in gistget.c: 100 palloc0 's 
of gistsearchstack at line 152 and 2010982 palloc's also of the 
gistsearchstack on line 342. Two pfrees are also hit a lot: line 195: 
1010926 of a stackentry and line 293: 200056 times. My $0.02 cents is 
that the pain is here. My knowledge of gistget or the other sources in 
access/gist is zero, but couldn't it be possible to determine the 
maximum needed size of the stack and then allocate it at once and use a 
pop/push kind off api?


regards,
Yeb Havinga







regards,
Yeb Havinga





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] GiST index performance

2010-03-16 Thread Yeb Havinga

Matthew Wakeling wrote:

Matthew Wakeling wrote:
A second quite distinct issue is the general performance of GiST 
indexes

which is also mentioned in the old thread linked from Open Items. For
that, we have a test case at
http://archives.postgresql.org/pgsql-performance/2009-04/msg00276.php 
for
btree_gist indexes. I have a similar example with the bioseg GiST 
index. I

have completely reimplemented the same algorithms in Java for algorithm
investigation and instrumentation purposes, and it runs about a hundred
times faster than in Postgres. I think this is a problem, and I'm 
willing

to do some investigation to try and solve it.
I have not made any progress on this issue. I think Oleg and Teodor 
would be better placed working it out. All I can say is that I 
implemented the exact same indexing algorithm in Java, and it 
performed 100 times faster than Postgres. Now, Postgres has to do a 
lot of additional work, like mapping the index onto disc, locking 
pages, and abstracting to plugin user functions, so I would expect 
some difference - I'm not sure 100 times is reasonable though. I tried 
to do some profiling, but couldn't see any one section of code that 
was taking too much time. Not sure what I can further do.

Hello Mathew and list,

A lot of time spent in gistget.c code and a lot of functioncall5's to 
the gist's consistent function which is out of sight for gprof.
Something different but related since also gist: we noticed before that 
gist indexes that use a compressed form for index entries suffer from 
repeated compress calls on query operands (see 
http://archives.postgresql.org/pgsql-hackers/2009-05/msg00078.php).


The btree_gist int4 compress function calls the generic 
gbt_num_compress, which does a palloc. Maybe this palloc is allso hit al 
lot when scanning the index, because the constants that are queries with 
are repeatedly compressed and palloced.


regards,
Yeb Havinga



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Bad query plan inside EXISTS clause

2010-03-10 Thread Yeb Havinga

Yeb Havinga wrote:

Kenneth Marshall wrote:

EXISTS matches NULLs too and since they are not indexed a
sequential scan is needed to check for them. Try using
IN instead.
  

This is nonsense in more than one way.
Hit ctrl-return a bit too slow - exists does not match null but a set of 
records, that is either empty or not empty. Also it is possible to index 
table columns with nulls, and then the indexes can still be used. 
Besides filtering record sets with expressions, indexes are also used 
for ordering. There the effect of indexes with nulls can be seen: where 
to put them: in front or after the non nulls? So indexes can be 
perfectly used in conjunction with nulls. I found the original mail 
rather intriguing and played with an example myself a bit, but could not 
repeat the behavior (9.0 devel version), in my case the exists used an 
index. Maybe it has something to do with the fact that the planner 
estimates to return 5 rows, even when the actual numbers list only 1 
hit. In the exists case, it can stop at the first hit. In the select all 
rows case, it must return all rows. Maybe a better plan emerges with 
better statistics?


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Bad query plan inside EXISTS clause

2010-03-10 Thread Yeb Havinga

Kenneth Marshall wrote:

EXISTS matches NULLs too and since they are not indexed a
sequential scan is needed to check for them. Try using
IN instead.
  

This is nonsense in more than one way.

regards
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Strange workaround for slow query

2010-03-10 Thread Yeb Havinga
o inner.

I do agree with this assessment. I'm sort of improving the performance 
of an existing implementation of ours, for which I'm not aware why 
they chose for LEFT OUTER. I did, however, test things with INNER as 
well, with the same results, so I decided to stick with what I 
encountered in the existing implementation. But it's on my mind as 
well ;-)


Thanks for the formatted output. The difference in speed is caused by 
the first query that has to read 50k rows from an index, with filter 
expression is only eventtype_id = 71, and the second has the extra 
knowledge from the scan of events_event_type in the nestloops outer 
loop, which returns 0 rows in all cases and is hence a lot faster, even 
though that scan is called 16 times.


But the big question is: why does the planner chooses plan 1 in the 
first case, or how to fix that? My $0,02 would be to 'help' the planner 
find a better plan. Ofcourse you did ANALYZE, but maybe another idea is 
to increase the default_statistics_target if it is still at the default 
value of 10. More info on 
http://www.postgresql.org/docs/8.3/static/runtime-config-query.html. And 
also to 'help' the planner: I'd just change the query to an inner join 
in this case, since there cannot be null tuples in the right hand side here.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Strange workaround for slow query

2010-03-10 Thread Yeb Havinga

sverha...@wps-nl.com wrote:


Hi,


EXPLAIN SELECT * FROM events_events LEFT OUTER JOIN events_event_types ON
eventType_id=events_event_types.id WHERE severity=20 AND (eventType_id
IN (71)) ORDER BY datetime DESC limit 50;
QUERY PLAN
---
Limit (cost=3.23..200.31 rows=50 width=131)
-> Nested Loop (cost=3.23..49139.16 rows=12466 width=131)
-> Index Scan Backward using events_events_eventtype_id_datetime_ind 
on events_events (cost=0.00..48886.61 rows=12466 width=93)

Index Cond: (eventtype_id = 71)
-> Materialize (cost=3.23..3.24 rows=1 width=38)
-> Seq Scan on events_event_types (cost=0.00..3.23 rows=1 width=38)
Filter: ((id = 71) AND (severity = 20))


EXPLAIN SELECT * FROM events_events LEFT OUTER JOIN events_event_types ON
eventType_id=events_event_types.id WHERE severity=70 AND (eventType_id
IN (71, 999)) ORDER BY datetime DESC LIMIT 50;
QUERY PLAN

Limit (cost=27290.24..27290.37 rows=50 width=131)
-> Sort (cost=27290.24..27303.15 rows=5164 width=131)
Sort Key: events_events.datetime
-> Nested Loop (cost=22.95..27118.70 rows=5164 width=131)
-> Seq Scan on events_event_types (cost=0.00..3.02 rows=17 width=38)
Filter: (severity = 70)
-> Bitmap Heap Scan on events_events (cost=22.95..1589.94 rows=408 
width=93)
Recheck Cond: ((events_events.eventtype_id = ANY 
('{71,999}'::bigint[])) AND (events_events.eventtype_id = 
events_event_types.id))
-> Bitmap Index Scan on test_events_events_eventtype_id_severity_ind 
(cost=0.00..22.85 rows=408 width=0)
Index Cond: ((events_events.eventtype_id = ANY ('{71,999}'::bigint[])) 
AND (events_events.eventtype_id = events_event_types.id))


Thanks - I'm sorry that I was not more specific earlier, but what would 
be *really* helpful is the output of explain analyze, since that also 
shows actual time, # rows and # loops of the inner nestloop. I'm 
wondering though why you do a left outer join. From the \d output in the 
previous mail, events_event.eventtype_id has a not null constraint and a 
fk to events_event_types.id, so an inner join would be appropriate. 
Outer joins limits the amount of join orders the planner considers, so a 
better plan might arise when the join is changed to inner.


regards
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Strange workaround for slow query

2010-03-10 Thread Yeb Havinga

Hello Sander,

Can you post the explain plan output of these queries?

SELECT * FROM events_events LEFT OUTER JOIN events_event_types ON 
eventType_id=events_event_types.id WHERE severity=20 AND (eventType_id 
IN (71)) ORDER BY datetime DESC limit 50;


SELECT * FROM events_events LEFT OUTER JOIN events_event_types ON 
eventType_id=events_event_types.id WHERE severity=70 AND (eventType_id 
IN (71, 999)) ORDER BY datetime DESC LIMIT 50;



regards
Yeb Havinga

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing FusionIO

2010-03-08 Thread Yeb Havinga

Devrim GÜNDÜZ wrote:

Hi,

I have a FusionIO drive

Cool!!

 to test for a few days. I already ran iozone and
bonnie++ against it. Does anyone have more suggestions for it?
  
Oracle has a tool to test drives specifically for database loads kinds 
called orion - its free software and comes with a good manual. Download 
without registration etc at 
http://www.oracle.com/technology/software/tech/orion/index.html


Quickstart

create file named named 'fusion.lun' with the device name, e.g.
/dev/sda1

Invoke orion tool with something like
 -run advanced -testname fusion -num_disks 50 -size_small 
4 -size_large 1024 -type rand -simulate concat -verbose -write 25 
-duration 15 -matrix detailed -cache_size 256


cache size is in MB's but not so important for random io.
num disks doesn't have to match physical disks but it's used by the tool 
to determine how large the test matrix should be. E.g. 1 disk gives a 
small matrix with small number of concurrent io requests. So I set it to 50.


Another idea: pgbench?

regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 10K vs 15k rpm for analytics

2010-03-03 Thread Yeb Havinga

Francisco Reyes wrote:

Yeb Havinga writes:

controllers. Also, I am not sure if it is wise to put the WAL on the 
same logical disk as the indexes,


If I only have two controllers would it then be better to put WAL on 
the first along with all the data and the indexes on the external? 
Specially since the external enclosure will have 15K rpm vs 10K rpm in 
the internal. 
It sounds like you're going to create a single logical unit / raid array 
on each of the controllers. Depending on the number of disks, this is a 
bad idea because if you'd read/write data sequentially, all drive heads 
will be aligned to roughly the same place ont the disks. If another 
process wants to read/write as well, this will interfere and bring down 
both iops and mbps. However, with three concurrent users.. hmm but then 
again, queries will scan multiple tables/indexes so there will be mixed 
io to several locations. What would be interesting it so see what the 
mbps maximum of a single controller is. Then calculate how much disks 
are needed to feed that, which would give a figure for number of disks 
per logical unit.


The challenge with having a few logical units / raid arrays available, 
is how to divide the data over it (with tablespaces) What is good for 
your physical data depends on the schema and queries that are most 
important. For 2 relations and 2 indexes and 4 arrays, it would be 
clear. There's not much to say anything general here, except: do not mix 
table or index data with the wal. In other words: if you could make a 
separate raid array for the wal (2 disk raid1 probably good enough), 
that would be ok and doesn't matter on which controller or enclosure it 
happens, because io to disk is not mixed with the data io.


Thanks for pointing that out.
With any luck I will actually be able to do some tests for the new 
hardware. The curernt one I literaly did a few hours stress test and 
had to put in production right away.
I've heard that before ;-) If you do get around to do some tests, I'm 
interested in the results / hard numbers.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 10K vs 15k rpm for analytics

2010-03-03 Thread Yeb Havinga

Francisco Reyes wrote:


Going with a 3Ware SAS controller.


Have some external enclosures with 16 15Krpm drives. They are older  
15K rpms, but they should be good enough.
Since the 15K rpms usually have better Transanctions per second I will 
put WAL and indexes in the external enclosure.


It sounds like you have a lot of hardware around - my guess it would be 
worthwhile to do a test setup with one server hooked up with two 3ware 
controllers. Also, I am not sure if it is wise to put the WAL on the 
same logical disk as the indexes, but that is maybe for a different 
thread (unwise to mix random and sequential io and also the wal has 
demands when it comes to write cache).


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 10K vs 15k rpm for analytics

2010-03-03 Thread Yeb Havinga

Scott Marlowe wrote:

On Tue, Mar 2, 2010 at 1:51 PM, Yeb Havinga  wrote:
  

With 24 drives it'll probably be the controller that is the limiting factor
of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at
maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K
drives and put the saved money towards the controller (or maybe more than
one controller).



That's horrifically bad numbers for that many drives.  I can get those
numbers for write performance on a RAID-6 on our office server.  I
wonder what's making your SAN setup so slow?
  

Pre scriptum:
A few minutes ago I mailed detailed information in the same thread but 
as reply to an earlier response - it tells more about setup and gives 
results of a raid1+0 test.


I just have to react to "horrifically bad" and "slow" :-) : The HP san 
can do raid5 on 28 disks also on about 155MBps:


28 disks devided into 7 logical units with raid5, orion results are 
below. Please note that this time I did sequential large io. The mixed 
read/write MBps maximum here is comparable: around 155MBps.


regards,
Yeb Havinga


ORION VERSION 10.2.0.1.0

Commandline:
-run advanced -testname msa -num_disks 24 -size_small 4 -size_large 1024 
-type seq -simulate concat -verbose -write 50 -duration 15 -matrix 
detailed -cache_size 256


This maps to this test:
Test: msa
Small IO size: 4 KB
Large IO size: 1024 KB
IO Types: Small Random IOs, Large Sequential Streams
Number of Concurrent IOs Per Stream: 4
Force streams to separate disks: No
Simulated Array Type: CONCAT
Write: 50%
Cache Size: 256 MB
Duration for each Data Point: 15 seconds
Small Columns:,  0,  1,  2,  3,  4,  5,  
6, 12, 18, 24, 30, 36, 42, 48, 54, 
60, 66, 72, 78, 84, 90, 96,102,108,
114,120
Large Columns:,  0,  1,  2,  3,  4,  8, 
12, 16, 20, 24, 28, 32, 36, 40, 44, 48

Total Data Points: 416

Name: /dev/sda1 Size: 109256361984
Name: /dev/sdb1 Size: 109256361984
Name: /dev/sdc1 Size: 109256361984
Name: /dev/sdd1 Size: 109256361984
Name: /dev/sde1 Size: 109256361984
Name: /dev/sdf1 Size: 109256361984
Name: /dev/sdg1 Size: 109256361984
7 FILEs found.

Maximum Large MBPS=157.75 @ Small=0 and Large=1
Maximum Small IOPS=3595 @ Small=66 and Large=1
Minimum Small Latency=2.81 @ Small=1 and Large=0

MBPS matrix

Large/Small,  0,  1,  2,  3,  4,  5,  6, 
12, 18, 24, 30, 36, 42, 48, 54, 60, 
66, 72, 78, 84, 90, 96,102,108,114,120
 1, 157.75, 156.47, 153.56, 153.45, 144.87, 141.78, 140.60, 
112.45,  95.23,  72.80,  80.59,  36.91,  29.76,  42.86,  41.82,  33.87,  
34.07,  45.62,  42.97,  26.37,  42.85,  45.49,  44.47,  37.26,  45.67,  
36.18
 2, 137.58, 128.48, 125.78, 133.85, 120.12, 127.86, 127.05, 
119.26, 121.23, 115.00, 117.88, 114.35, 108.61, 106.55,  83.50,  78.61,  
92.67,  96.01,  44.02,  70.60,  62.84,  46.52,  69.18,  51.84,  57.19,  
59.62
 3, 143.10, 134.92, 139.30, 138.62, 137.85, 146.17, 140.41, 
138.48,  76.00, 138.17, 123.48, 137.45, 126.51, 137.11,  91.94,  90.33, 
129.97,  45.35, 115.92,  89.60, 137.22,  72.46,  89.95,  77.40, 119.17,  
82.09
 4, 138.47, 133.74, 129.99, 122.33, 126.75, 125.22, 132.30, 
120.41, 125.88, 132.21,  96.92, 115.70, 131.65,  66.34, 114.06, 113.62, 
116.91,  96.97,  98.69, 127.16, 116.67, 111.53, 128.97,  92.38, 118.14,  
78.31
 8, 126.59, 127.92, 115.51, 125.02, 123.29, 111.94, 124.31, 
125.71, 134.48, 126.40, 127.93, 125.36, 121.75, 121.75, 127.17, 116.51, 
121.44, 121.12, 112.32, 121.55, 127.93, 124.86, 118.04, 114.59, 121.72, 
114.79
12, 112.40, 122.58, 107.61, 125.42, 128.04, 123.80, 127.17, 
127.70, 122.37,  96.52, 115.36, 124.49, 124.07, 129.31, 124.62, 124.23, 
105.58, 123.55, 115.67, 120.59, 125.61, 123.57, 121.43, 121.45, 121.44, 
113.64
16, 108.88, 119.79, 123.80, 120.55, 120.02, 121.66, 125.71, 
122.19, 125.77, 122.27, 119.55, 118.44, 120.51, 104.66,  97.55, 115.43, 
101.45, 108.99, 122.30, 100.45, 105.82, 119.56, 121.26, 126.59, 119.54, 
115.09
20, 103.88, 122.95, 115.86, 114.59, 121.13, 108.52, 116.90, 
121.10, 113.91, 108.20, 111.51, 125.64, 117.57, 120.86, 117.66, 100.40, 
104.88, 103.15,  98.10, 104.86, 104.69, 102.99, 121.81, 107.22, 122.68, 
106.43
24, 102.64, 102.33, 112.95, 110.63, 108.00, 111.53, 124.33, 
103.17, 108.16, 112.63,  97.42, 106.22, 102.54, 117.46, 100.66,  99.01, 
104.46,  99.02, 116.02, 112.49, 119.05, 104.03, 102.40, 102.44, 111.15,  
99.51
28, 101.12, 102.76, 114.14, 109.72, 120.63, 118.09, 119.85, 
113.80, 116.58, 110.24, 101.45, 110.31, 116.06, 112.04, 121.63,  91.26,  
98.88, 101.55, 104.51, 116.43, 112.98, 119.46, 120.08, 109.46, 106.29,  
96.69
32, 103.41, 117.33, 101.33, 102.29, 102.58, 116.18,

Re: [PERFORM] 10K vs 15k rpm for analytics

2010-03-03 Thread Yeb Havinga

Greg Smith wrote:

Yeb Havinga wrote:
With 24 drives it'll probably be the controller that is the limiting 
factor of bandwidth. Our HP SAN controller with 28 15K drives 
delivers 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. 


You should be able to clear 1GB/s on sequential reads with 28 15K 
drives in a RAID10, given proper read-ahead adjustment.  I get over 
200MB/s out of the 3-disk RAID0 on my home server without even trying 
hard.  Can you share what HP SAN controller you're using?
Yeah I should have mentioned a bit more, to allow for a better picture 
of the apples and pears.


Controller a is the built in controller of the HP MSA1000 SAN - with 14 
disks but with extra 14 disks from a MSA30. It is connected through a 
2Gbit/s fibrechannel adapter - should give up to roughly 250MB/s 
bandwidth, maybe a bit less due to frame overhead and gib/gb difference.

Controller has 256MB cache

It is three years old, however HP still sells it.

I performed a few dozen tests with oracle's free and standalone orion 
tool (http://www.oracle.com/technology/software/tech/orion/index.html) 
with different raid and controller settings, where I varied

- controller read/write cache ratio
- logical unit layout (like one big raidx, 3 luns with raid10 (giving 
stripe width of 4 disks and 4 hot spares), 7 luns with raid10

- stripe size set to maximum
- load type (random or sequential large io)
- linux io scheduler (deadline / cfq etc)
- fibre channel adapter queue depth
- ratio between reads and writes by the orion - our production 
application has about 25% writes.
- I also did the short stroking that is talked about further in this 
thread by only using one partition of about 30% size on each disk.

- etc

My primary goal was large IOPS for our typical load: mostly OLTP.

The orion tool tests in a matrix with on one axis the # concurrent small 
io's and the other axis the # concurrent large io's. It output numbers 
are also in a matrix, with MBps, iops and latency.


I put several of these numbers in matlab to produce 3d pictures and 
that showed some interesting stuff - its probably bad netiquette here to 
post a one of those pictures. One of the striking things was that I 
could see something that looked like  a mountain where the top was 
neatly cut of - my guess: controller maximum reached.


Below is the output data of a recent test, where a 4Gbit/s fc adapter 
was connected. From this numbers I conclude that in our setup, the 
controller is maxed out at 155MB/s for raid 1+0 *with this setup*. In a 
test run I constructed to try and see what the maximum mbps of the 
controller would be: 100% reads, sequential large io - that went to 170MBps.


I'm particularly proud of the iops of this test. Please note: large load 
is random, not sequential!


So to come back at my original claim: controller is important when you 
have 24 disks. I believe I have backed up this claim by this mail. Also 
please take notice that for our setup, a database that has a lot of 
concurrent users on a medium size database (~=160GB), random IO is what 
we needed, and for this purpose the HP MSA has proved rock solid. But 
the setup that Francisco mentioned is different: a few users doing 
mostly sequential IO. For that load, our setup is far from optimal, 
mainly because of the (single) controller.


regards,
Yeb Havinga


ORION VERSION 10.2.0.1.0

Commandline:
-run advanced -testname r10-7 -num_disks 24 -size_small 4 -size_large 
1024 -type rand -simulate concat -verbose -write 25 -duration 15 -matrix 
detailed -cache_size 256


This maps to this test:
Test: r10-7
Small IO size: 4 KB
Large IO size: 1024 KB
IO Types: Small Random IOs, Large Random IOs
Simulated Array Type: CONCAT
Write: 25%
Cache Size: 256 MB
Duration for each Data Point: 15 seconds
Small Columns:,  0,  1,  2,  3,  4,  5,  
6, 12, 18, 24, 30, 36, 42, 48, 54, 
60, 66, 72, 78, 84, 90, 96,102,108,
114,120
Large Columns:,  0,  1,  2,  3,  4,  8, 
12, 16, 20, 24, 28, 32, 36, 40, 44, 48

Total Data Points: 416

Name: /dev/sda1 Size: 72834822144
Name: /dev/sdb1 Size: 72834822144
Name: /dev/sdc1 Size: 72834822144
Name: /dev/sdd1 Size: 72834822144
Name: /dev/sde1 Size: 72834822144
Name: /dev/sdf1 Size: 72834822144
Name: /dev/sdg1 Size: 72834822144
7 FILEs found.

Maximum Large MBPS=155.05 @ Small=2 and Large=48
Maximum Small IOPS=6261 @ Small=120 and Large=0
Minimum Small Latency=3.93 @ Small=1 and Large=0

Below the MBps matrix - hope this reads well in email clients??

Large/Small,  0,  1,  2,  3,  4,  5,  6, 
12, 18, 24, 30, 36, 42, 48, 54, 60, 
66, 72, 78, 84, 90, 96,102,108,114,120
 1,  76.60,  74.87,  73.24,  70.66,  70.45,  68.36,  67.58,  
59.63, 

Re: [PERFORM] 10K vs 15k rpm for analytics

2010-03-02 Thread Yeb Havinga

Francisco Reyes wrote:
Anyone has any experience doing analytics with postgres. In particular 
if 10K rpm drives are good enough vs using 15K rpm, over 24 drives. 
Price difference is $3,000.


Rarely ever have more than 2 or 3 connections to the machine.

So far from what I have seen throughput is more important than TPS for 
the queries we do. Usually we end up doing sequential scans to do 
summaries/aggregates.


With 24 drives it'll probably be the controller that is the limiting 
factor of bandwidth. Our HP SAN controller with 28 15K drives delivers 
170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. So I'd 
go for the 10K drives and put the saved money towards the controller (or 
maybe more than one controller).


regards,
Yeb Havinga

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query slowing down significantly??

2010-03-01 Thread Yeb Havinga

Rainer Pruy wrote:

Thanks for the hint.
I should have been considering that in the first place.
(But the obvious is easily left unrecognised..)

The prepared statement gives:

QUERY 
PLAN
---
 Nested Loop  (cost=0.00..25.18 rows=2 width=175) (actual time=36.116..49.998 
rows=1 loops=1)
   ->  Index Scan using x_context_01 on context c  (cost=0.00..10.76 rows=2 
width=67) (actual time=0.029..6.947 rows=12706 loops=1)
 Index Cond: ((contextid)::text = $1)
   ->  Index Scan using x_fk_context_hierarchy_02 on context_hierarchy h  
(cost=0.00..7.20 rows=1 width=108) (actual time=0.003..0.003
rows=0 loops=12706)
 Index Cond: (h.contextidx = c.idx)
 Filter: (((h.hierarchyname)::text = $2) AND (h.parentidx = $3))
 Total runtime: 50.064 ms
(7 rows)


And that is quite a bad plan given the current distribution of values.
  
Another approach might be to rewrite recursion into your hierarchy with 
the in 8.4 new WITH RECURSIVE option in sql queries. The possible gains 
there are way beyond anything you can accomplish with optimizing 
recursive functions.


Regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query: table iteration (8.3)

2010-02-23 Thread Yeb Havinga

Glenn Maynard wrote:

CREATE FUNCTION high_score_for_user(user_id int) RETURNS SETOF INTEGER
LANGUAGE SQL AS $$
   SELECT t.id FROM test t
   WHERE t.user_id = $1
   ORDER BY t.score DESC LIMIT 1
$$;
SELECT high_score_for_user(u.id) FROM test_users u;

runs in 100ms.
  
Though it doesn't solve your problem without changing result format, but 
what about


aap=# explain select u.id, ARRAY(select t.id from test t where 
t.user_id=u.id order by t.score desc limit 2) as high from test_users u;
  QUERY 
PLAN
-

Seq Scan on test_users u  (cost=0.00..3290.84 rows=1000 width=4)
  SubPlan 1
->  Limit  (cost=0.00..3.28 rows=2 width=8)
  ->  Index Scan using test_2 on test t  (cost=0.00..1637.92 
rows=1000 width=8)

Index Cond: (user_id = $0)
(5 rows)

 id  |  high  
--+-

   1 | {641,896}
   2 | {1757,1167}
   3 | {2765,2168}
   4 | {3209,3674}
   5 | {4479,4993}

regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why primary key index are not using in joining?

2010-02-15 Thread Yeb Havinga

AI Rumman wrote:


explain analyze
select *
from vtiger_crmentity
inner JOIN vtiger_users
 ON vtiger_users.id <http://vtiger_users.id> = 
vtiger_crmentity.smownerid

where vtiger_crmentity.deleted = 0  ;
 
QUERY 
PLAN 



 Hash Join  (cost=3665.17..40019.25 rows=640439 width=1603) (actual 
time=115.613..3288.436 rows=638081 loops=1)

   Hash Cond: ("outer".smownerid = "inner".id)
   ->  Bitmap Heap Scan on vtiger_crmentity  (cost=3646.54..30394.02 
rows=640439 width=258) (actual time=114.763..986.504 rows=638318 loops=1)

 Recheck Cond: (deleted = 0)
 ->  Bitmap Index Scan on vtiger_crmentity_deleted_idx  
(cost=0.00..3646.54 rows=640439 width=0) (actual time=107.851..107.851 
rows=638318 loops=1)

   Index Cond: (deleted = 0)
   ->  Hash  (cost=18.11..18.11 rows=211 width=1345) (actual 
time=0.823..0.823 rows=211 loops=1)
 ->  Seq Scan on vtiger_users  (cost=0.00..18.11 rows=211 
width=1345) (actual time=0.005..0.496 rows=211 loops=1)

 Total runtime: 3869.022 ms

Sequential index is occuring on vtiger_users table while it has 
primary key index on id.

Could anyone please tell me why?

From the list of indexes you also supplied it seems to me you very much 
want index scanning, the reason being that 4secs is too slow? The 
seqscan is not the reason for that - the main reason is that you process 
almost all rows of the crmentity table. I bet that if you add a LIMIT, 
or adding a clause that selects only for a specific vtiger_user, the 
plan looks different on the access to the crmentity table as well as the 
kind of join, however if your application really needs to process the 
600k rows, I'm not sure if it can get any faster than that. Perhaps it 
would help a bit to shorten the SELECT * to only the attributes you 
really need.


Regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Immutable table functions

2010-02-12 Thread Yeb Havinga

Luiz Angelo Daros de Luca wrote:


I have a directed graph, or better, a tree in postgresql 8.3. One 
table are the nodes and another one are the connections. Given any 
node, I need to get all nodes down to it(to the leafs) that have 
relation with anotherTable. Also, this connection change on time, so I 
have a connectionDate and a disconnectionDate for each connection 
(which might be null to represent open interval). This way, I wrote a 
pgsql function (I rename the tables and columns to generic names). 
These are the tables and the function:



Hello Luiz,

If you could upgrade to 8.4, you could use WITH RECURSIVE - my 
experience is that it is several orders of magnitude faster than 
recursive functions.


http://developer.postgresql.org/pgdocs/postgres/queries-with.html

regards,
Yeb Havinga



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query: table iteration (8.3)

2010-02-05 Thread Yeb Havinga

Glenn Maynard wrote:

The function version:

CREATE FUNCTION high_score_for_user(user_id int) RETURNS SETOF INTEGER
LANGUAGE SQL AS $$
   SELECT t.id FROM test t
   WHERE t.user_id = $1
   ORDER BY t.score DESC LIMIT 1
$$;
SELECT high_score_for_user(u.id) FROM test_users u;

runs in 100ms.
  

Hi Glenn,

About cached plans of SQL functions: from the source of function.c

00067 /*
00068  * An SQLFunctionCache record is built during the first call,
00069  * and linked to from the fn_extra field of the FmgrInfo struct.
00070  *
00071  * Note that currently this has only the lifespan of the calling 
query.
00072  * Someday we might want to consider caching the parse/plan 
results longer

00073  * than that.
00074  */

So it is planned at every call of

SELECT high_score_for_user(u.id) FROM test_users u;

and the cache is used between each row of test_users. The plan is with a 
parameter, that means the optimizer could not make use of an actual 
value during planning. However, your test case is clever in the sense 
that there is an index on users and score and the sql function has an 
order by that matches the index, so the planner can avoid a sort by 
accessing the test table using the index. In this particular case, that 
means that the plan is optimal; no unneeded tuples are processed and the 
(function) plan complexity is logaritmic on the size of the test 
relation, you can't get it any better than that. In short: the lack of 
an actual parameter in the test case did not result in an inferior plan. 
So using a dynamic constructed query string in pl/pgsql to 'force' 
replanning during iteration cannot be faster than this sql function.


It is possible to make the performance if this function worse by 
disabling indexscans:


CREATE FUNCTION high_score_for_user(user_id int) RETURNS SETOF INTEGER
LANGUAGE SQL AS $$
  SELECT t.id FROM test t
  WHERE t.user_id = $1
  ORDER BY t.score DESC LIMIT 1
$$
SET enable_indexscan = off;

Now the query time with test_users is over a second. So maybe the 
converse could also be true in your production setup using the same 
technique.


regards,
Yeb Havinga









--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query: table iteration (8.3)

2010-02-02 Thread Yeb Havinga

Glenn Maynard wrote:

On Mon, Feb 1, 2010 at 6:15 AM, Yeb Havinga  wrote:
  

Stomp_steps is analyzed to 2902 rows but when you run the query the actual
rows are 0. This means that the highscore function is not called or the
number 0 is incorrect.


This SELECT returns 0 rows: it calls the function 1500 times, and each
time it returns no data, because there simply aren't any results for
these parameters.
  
Hmm.. first posting on a pg mailing list and I make this mistake.. What 
an introduction :-[
Checked the source and indeed for every plan node the number of tuples 
that result from it are counted. In most cases this is the number of 
records that match the qualifiers (where clause/join conditions) so that 
was in my head: actual rows = rows that match where, and without where 
I'd expected the actual rows to reflect the total number of rows in the 
table. But with a set returning functions this number is something 
completely different.

below. The truth might be that you probably got that result by explaining
the query in the function with actual parameter values. This plan differs
from the one that is made when the function is called from sql and is
planned (once) without parameters, and in that case the plan is probably
different.



Yeah.  It would help a lot if EXPLAIN could show query plans of
functions used by the statement and not just the top-level query.
  

Like subplans are, yes. Sounds like a great future.

Squinting at the output, it definitely looks like a less optimized
plan; it's using a SEQSCAN instead of BITMAPHEAPSCAN.  (I've attached
the output.)

Does the planner not optimize functions based on context?
I believe it does for (re) binding of parameter values to prepared 
statements, but not in the case of an sql function. To test an idea, 
there might be a workaround where you could write a pl/pgsql function 
that makes a string with the query and actual parameter values and 
executes that new query everytime. It's not as pretty as a sql function, 
but would give an idea of how fast things would run with each loop 
replanned. Another idea is that maybe you could 'hint' the planner at 
planning time of the sql function by giving it some extra set commands 
(like set search_path but then set enable_seqscan = off) - I don't know 
if planning of the sql function occurs in the environment given by it's 
set commands, but its worth a try. Again, certainly not pretty.

I can't see how to apply WITH to this.  Non-recursive WITH seems like
syntax sugar that doesn't do anything a plain SELECT can't do, and I
don't think what I'm doing here can be done with a regular SELECT.
  
With indeed is not a solution because the with query is executed once, 
so it cannot take a parameter. What about a window function on a join of 
stomp_steps and stomp_round with partition by on  steps_id and user_card 
is and order by score and with row_number() < your third parameter. From 
the docs I read that window functions cannot be part of the where 
clause: an extra subselect leven is needed then to filter the correct 
row numbers.


Regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query: table iteration (8.3)

2010-02-01 Thread Yeb Havinga

Glenn Maynard wrote:

Hitting a performance issues that I'm not sure how to diagnose.

SELECT highscores_for_steps_and_card(s.id, 591, 1) FROM stomp_steps s;
Seq Scan on stomp_steps s  (cost=0.00..793.52 rows=2902 width=4)
(actual time=26509.919..26509.919 rows=0 loops=1)
Total runtime: 26509.972 ms
  

Hello Glenn,

Stomp_steps is analyzed to 2902 rows but when you run the query the
actual rows are 0. This means that the highscore function is not called
or the number 0 is incorrect.
Suppose that the number of rows is 2900, then 26 seconds means 100ms per
function call. This still is a lot, compared to the 0.054 ms analyze
result below. The truth might be that you probably got that result by
explaining the query in the function with actual parameter values. This
plan differs from the one that is made when the function is called from
sql and is planned (once) without parameters, and in that case the plan
is probably different. A way to check the plan of that query is to turn
on debug_print_plan and watch the server log. It takes a bit getting
used. The plan starts with CONTEXT:  SQL function "functionname" during
startup and is also recognized because in the opexpr (operator
expression) one of the operands is a parameter. Important is the total
cost of the top plan node (the limit).

I know 8.3 is mentioned in the subject, but I think that a WITH query
(http://www.postgresql.org/docs/8.4/interactive/queries-with.html) could
be a good solution to your problem and may be worth trying out, if you
have the possibility to try out 8.4.

Regards,
Yeb Havinga




The inner function looks like this:

CREATE FUNCTION highscores_for_steps_and_card(steps_id int, card_id
int, count int) RETURNS SETOF INTEGER LANGUAGE SQL AS $$
SELECT r.id FROM stomp_round r
WHERE ($1 IS NULL OR r.steps_id = $1) AND ($2 IS NULL OR
r.user_card_id = $2)
ORDER BY r.score DESC LIMIT $3
$$

 Limit  (cost=13.12..13.12 rows=1 width=8) (actual time=0.054..0.054
rows=0 loops=1)
   ->  Sort  (cost=13.12..13.12 rows=1 width=8) (actual
time=0.051..0.051 rows=0 loops=1)
 Sort Key: score
 Sort Method:  quicksort  Memory: 17kB
 ->  Bitmap Heap Scan on stomp_round r  (cost=9.09..13.11
rows=1 width=8) (actual time=0.036..0.036 rows=0 loops=1)
   Recheck Cond: ((280 = steps_id) AND (user_card_id = 591))
   ->  BitmapAnd  (cost=9.09..9.09 rows=1 width=0) (actual
time=0.032..0.032 rows=0 loops=1)
 ->  Bitmap Index Scan on stomp_round_steps_id
(cost=0.00..4.40 rows=20 width=0) (actual time=0.030..0.030 rows=0
loops=1)
   Index Cond: (280 = steps_id)
 ->  Bitmap Index Scan on stomp_round_user_card_id
 (cost=0.00..4.44 rows=25 width=0) (never executed)
   Index Cond: (user_card_id = 591)
 Total runtime: 0.153 ms
(12 rows)

stomp_steps has about 1500 rows, so it finds 1500 high scores, one for
each stage.

I expected scalability issues from this on a regular drive, since
it'll be doing a ton of index seeking when not working out of cache,
so I expected to need to change to an SSD at some point (when it no
longer easily fits in cache).  However, I/O doesn't seem to be the
bottleneck yet.  If I run it several times, it consistently takes 26
seconds.  The entire database is in OS cache (find | xargs cat:
250ms).

I'm not sure why the full query (26s) is orders of magnitude slower
than 1500*0.150ms (225ms).  It's not a very complex query, and I'd
hope it's not being re-planned every iteration through the loop.  Any
thoughts?  Using SELECT to iterate over a table like this is very
useful (and I don't know any practical alternative), but it's
difficult to profile since it doesn't play nice with EXPLAIN ANALYZE.

  




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


  1   2   >