Re: [PERFORM] sniff test on some PG 8.4 numbers
On Sun, Mar 10, 2013 at 11:28 PM, Greg Smith g...@2ndquadrant.com wrote: On 3/10/13 9:18 PM, Jon Nelson wrote: The following is with ext4, nobarrier, and noatime. As noted in the original post, I have done a fair bit of system tuning. I have the dirty_bytes and dirty_background_bytes set to 3GB and 2GB, respectively. That's good, but be aware those values are still essentially unlimited write caches. A server with 4 good but regular hard drives might do as little as 10MB/s of random writes on a real workload. If 2GB of data ends up dirty, the flushing that happens at the end of a database checkpoint will need to clear all of that out of RAM. When that happens, you're looking at a 3 minute long cache flush to push out 2GB. It's not unusual for pgbench tests to pause for over a minute straight when that happens. With your setup, where checkpoints happen every 5 minutes, this is only happening once per test run. The disruption isn't easily visible if you look at the average rate; it's outweighed by the periods where writes happen very fast because the cache isn't full yet. You have to get pgbench to plot latency over time to see them and then analyze that data. This problem is the main reason I put together the pgbench-tools set for running things, because once you get to processing the latency files and make graphs from them it starts to be a pain to look at the results. I'll try to find time for this, but it may need to wait until the weekend again. I built 9.2 and using 9.2 and the following pgbench invocation: pgbench -j 8 -c 32 -M prepared -T 600 transaction type: TPC-B (sort of) scaling factor: 400 I misread this completely in your message before; I thought you wrote 4000. A scaling factor of 400 is making a database that's 6GB in size. Your test is basically seeing how fast the system memory and the RAID cache can move things around. In that situation, your read and write numbers are reasonable. They aren't actually telling you anything useful about the disks though, because they're barely involved here. You've sniffed the CPU, memory, and RAID controller and they smell fine. You'll need at least an order of magnitude increase in scale to get a whiff of the disks. LOL! Your phrasing is humourous and the information useful. I ran for 8.0 hours and go this: transaction type: TPC-B (sort of) scaling factor: 400 query mode: prepared number of clients: 32 number of threads: 8 duration: 28800 s number of transactions actually processed: 609250619 tps = 21154.058025 (including connections establishing) tps = 21154.075922 (excluding connections establishing) pgbench scale numbers give approximately 16MB per scale factor. You don't actually stress the drives until that total number is at least 2X as big as RAM. We had to raise the limit on the pgbench scales recently because it only goes up to ~20,000 on earlier versions, and that's not a big enough scale to test many servers now. On the select-only tests, much of the increase from ~100K to ~200K is probably going from 8.4 to 9.2. There's two major and several minor tuning changes that make it much more efficient at that specific task. These are the *only* changes I've made to the config file: shared_buffers = 32GB wal_buffers = 16MB checkpoint_segments = 1024 Note that these are the only changes that actually impact pgbench results. The test doesn't stress very many parts of the system, such as the query optimizer. Also be aware these values may not be practical to use in production. You can expect bad latency issues due to having shared_buffers so large. All that memory has to be reconciled and written to disk if it's been modified at each checkpoint, and 32GB of such work is a lot. I have systems where we can't make shared_buffers any bigger than 4GB before checkpoint pauses get too bad. Similarly, setting checkpoint_segments to 1024 means that you might go through 16GB of writes before a checkpoint happens. That's great for average performance...but when that checkpoint does hit, you're facing a large random I/O backlog. I thought the bgwriter mitigated most of the problems here? Often I'll see the actual checkpoints with 'sync' times typically below a few seconds (when there is anything to do at all). I can't say I've seen checkpoint pauses in my workloads. There's not much you can do about all this on the Linux side. If you drop the dirty_* parameters too much, maintenance operations like VACUUM start to get slow. Really all you can do is avoid setting shared_buffers and checkpoint_segments too high, so the checkpoint backlog never gets gigantic. The tuning you've done is using higher values than we normally recommend because it's not quite practical to deploy like that. That and the very small database are probably why your numbers are so high. Mostly I do data warehouse type of workloads with very little (if any) data modification after
Re: [PERFORM] sniff test on some PG 8.4 numbers
On Sun, Mar 10, 2013 at 10:46 AM, Greg Smith g...@2ndquadrant.com wrote: On 3/5/13 10:00 PM, Jon Nelson wrote: On Tue, Mar 5, 2013 at 1:35 PM, Jon Nelson jnelson+pg...@jamponi.net wrote: pgbench -h BLAH -c 32 -M prepared -t 10 -S I get 95,000 to 100,000 tps. pgbench -h BLAH -c 32 -M prepared -t 10 seems to hover around 6,200 tps (size 100) to 13,700 (size 400) Some followup: The read test goes (up to) 133K tps, and the read-write test to 22k tps when performed over localhost. All your write numbers are inflated because the test is too short. This hardware will be lucky to sustain 7500 TPS on writes. But you're only writing 100,000 transactions, which means the entire test run isn't even hitting the database--only the WAL writes are. When your test run is finished, look at /proc/meminfo I'd wager a large sum you'll find Dirty: has hundreds of megabytes, if not gigabytes, of unwritten information. Basically, 100,000 writes on this sort of server can all be cached in Linux's write cache, and pgbench won't force them out of there. So you're not simulating sustained database writes, only how fast of a burst the server can handle for a little bit. For a write test, you must run for long enough to start and complete a checkpoint before the numbers are of any use, and 2 checkpoints are even better. The minimum useful length is a 10 minute run, so -T 600 instead of using -t. If you want something that does every trick possible to make it hard to cheat at this, as well as letting you graph size and client data, try my pgbench-tools: https://github.com/gregs1104/pgbench-tools (Note that there is a bug in that program right now, it spawns vmstat and iostat processes but they don't get killed at the end correctly. killall vmstat iostat after running is a good idea until I fix that). I (briefly!) acquired an identical machine as last but this time with an Areca instead of an LSI (4 drives). The following is with ext4, nobarrier, and noatime. As noted in the original post, I have done a fair bit of system tuning. I have the dirty_bytes and dirty_background_bytes set to 3GB and 2GB, respectively. I built 9.2 and using 9.2 and the following pgbench invocation: pgbench -j 8 -c 32 -M prepared -T 600 transaction type: TPC-B (sort of) scaling factor: 400 query mode: prepared number of clients: 32 number of threads: 8 duration: 600 s number of transactions actually processed: 16306693 tps = 27176.566608 (including connections establishing) tps = 27178.518841 (excluding connections establishing) Your read test numbers are similarly inflated, but read test errors aren't as large. Around 133K TPS on select-only is probably accurate. For a read test, use -T 30 to let it run for 30 seconds to get a more accurate number. The read read bottleneck on your hardware is going to be the pgbench client itself, which on 8.4 is running as a single thread. On 9.0+ you can have multiple pgbench workers. It normally takes 4 to 8 of them to saturate a larger server. The 'select-only' test (same as above with '-S'): starting vacuum...end. transaction type: SELECT only scaling factor: 400 query mode: prepared number of clients: 32 number of threads: 8 duration: 600 s number of transactions actually processed: 127513307 tps = 212514.337971 (including connections establishing) tps = 212544.392278 (excluding connections establishing) These are the *only* changes I've made to the config file: shared_buffers = 32GB wal_buffers = 16MB checkpoint_segments = 1024 I can run either or both of these again with different options, but mostly I'm looking for a sniff test. However, I'm a bit confused, now. It seems as though you say the write numbers are not believable, suggesting a value of 7,500 (roughly 1/4 what I'm getting). If I run the read test for 30 seconds I get - highly variable - between 300K and 400K tps. Why are these tps so high compared to your expectations? Note: I did get better results with HT on vs. with HT off, so I've left HT on for now. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] sniff test on some PG 8.4 numbers
I was hoping to just get a gut reaction on some pgbench numbers I have, to see if I'm in the ballpark. OS: ScientificLinux 6.3, x86_64 Hardware: 4x real disks (not SSD) behind an LSI 9260 in raid10, Xeon E5-2680 with hyperthreading OFF, 128GB of RAM. Setup: postgresql 8.4.13, ext4, barriers ON, disk write cache *off*, write- back enabled on the LSI. I initialized with sizes of 100, 200, and 400. I've done some tuning of the postgresql config, but mostly I'm just trying to find out if I'm in the right ballpark. I ran pgbench from another (similar) host: pgbench -h BLAH -c 32 -M prepared -t 10 -S I get 95,000 to 100,000 tps. pgbench -h BLAH -c 32 -M prepared -t 10 seems to hover around 6,200 tps (size 100) to 13,700 (size 400) Do these basically sniff right? (NOTE: with barriers off, I get a slight increase - 10% - in the read-write test, and a larger *decrease* - 15% - with the read-only test @ 400. No change @ 100) -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] sniff test on some PG 8.4 numbers
On Tue, Mar 5, 2013 at 7:02 PM, Josh Berkus j...@agliodbs.com wrote: Do these basically sniff right? Well, the read test seems reasonable. I'm impressed by the speed of the write test ... how large is the raid card cache? And why 8.4? Can you try 9.2? 8.4 because it's what I've got, basically. I might be able to try 9.2 later, but I'm targeting 8.4 right now. 512MB of memory on the card. (NOTE: with barriers off, I get a slight increase - 10% - in the read-write test, and a larger *decrease* - 15% - with the read-only test @ 400. No change @ 100) Oh, interesting. Can you reproduce that? I wonder what would cause read-only to drop without barriers ... I'll try to test again soon. I know that if I use writethrough instead of writeback mode the performance nosedives. Does anybody have suggestions for stripe size? (remember: *4* disks) -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] sniff test on some PG 8.4 numbers
On Tue, Mar 5, 2013 at 1:35 PM, Jon Nelson jnelson+pg...@jamponi.net wrote: pgbench -h BLAH -c 32 -M prepared -t 10 -S I get 95,000 to 100,000 tps. pgbench -h BLAH -c 32 -M prepared -t 10 seems to hover around 6,200 tps (size 100) to 13,700 (size 400) Some followup: The read test goes (up to) 133K tps, and the read-write test to 22k tps when performed over localhost. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04
On Mon, Feb 18, 2013 at 6:39 PM, Josh Berkus j...@agliodbs.com wrote: Scott, So do you have generally slow IO, or is it fsync behavior etc? All tests except pgBench show this system as superfast. Bonnie++ and DD tests are good (200 to 300mb/s), and test_fsync shows 14K/second. Basically it has no issues until checkpoint kicks in, at which time the entire system basically halts for the duration of the checkpoint. For that matter, if I run a pgbench and halt it just before checkpoint kicks in, I get around 12000TPS, which is what I'd expect on this system. At this point, we've tried 3.2.0.26, 3.2.0.27, 3.4.0, and tried updating the RAID driver, and changing the IO scheduler. Nothing seems to affect the behavior. Testing using Ext4 (instead of XFS) next. Did you try turning barriers on or off *manually* (explicitly)? With LSI and barriers *on* and ext4 I had less-optimal performance. With Linux MD or (some) 3Ware configurations I had no performance hit. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] temp tablespaces and SSDs, etc..
I was wondering if somebody could clear up how tablespaces are used. Let's say I have three classes of storage: - ramdisk (tmpfs) - SSD - spinning rust Furthermore, let's say I'd like to be able to tell postgresql to prefer them - in that order - until they each get full. IE, use tmpfs until it reports ENOSPC and then fall back to SSD, finally falling back to spinning rust. Is there a way to do this? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Poor performance using CTE
On Thu, Nov 22, 2012 at 7:42 AM, Jeremy Harris j...@wizmail.org wrote: On 22/11/2012 00:08, Craig Ringer wrote: WITH FENCE foo AS (SELECT ...), bar AS (SELECT ...) SELECT * FROM bar; Are we fencing just foo? Or all expressions? WITH foo AS (FENCED SELECT ...), bar AS (SELECT ...), SELECT ... ; I would much rather see 'MATERIALIZE' instead of 'FENCED', unless the by the latter you mean to forbid *all* optimizations, whereas with the latter the meaning is pretty clear. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: SOLVED - RE: [PERFORM] Poor performance using CTE
My perspective on this is that CTEs *should* be just like creating a temporary table and then joining to it, but without the materialization costs. In that respect, they seem like they should be like nifty VIEWs. If I wanted the behavior of materialization and then join, I'd do that explicitly with temporary tables, but using CTEs as an explicit optimization barrier feels like the explaining away surprising behavior. As can be seen by the current conversation, not everyone is convinced that CTEs ought to be an explicit optimization barrier, and setting that behavior as somehow desirable or explicit (rather than merely an implementation detail) feels shortsighted to me. I would be delighted to find that in some future version of PostgreSQL, but if that is not to be, at the very least, the verbiage surrounding CTEs might want to include (perhaps prominently) something along the lines of CTEs are currently an optimization barrier, but this is an implementation detail and may change in future versions. Perhaps even including a small blurb about what an optimization barrier even means (my understanding is that it merely forces materialization of that part of the query). That's just my perspective, coming at the use of CTEs not as a PostgreSQL developer, but as somebody who learned about CTEs and started using them - only to discover surprising behavior. On Tue, Nov 20, 2012 at 1:22 PM, Merlin Moncure mmonc...@gmail.com wrote: On Wed, Nov 14, 2012 at 8:03 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: On 15 November 2012 01:46, Andrew Dunstan and...@dunslane.net wrote: It cuts both ways. I have used CTEs a LOT precisely because this behaviour lets me get better plans. Without that I'll be back to using the offset 0 hack. Is the OFFSET 0 hack really so bad? We've been telling people to do that for years, so it's already something that we've effectively committed to. IMSNHO, 'OFFSET 0' is completely unreadable black magic. I agree with Andrew: CTEs allow for manual composition of queries and can be the best tool when the planner is outsmarting itself. In the old days, we'd extract data to a temp table and join against that: CTE are essentially a formalization of that technique. I like things the way they are; if CTE are hurting your plan, that's an indication you're using them inappropriately. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] postgres 8.4, COPY, and high concurrency
On Tue, Nov 13, 2012 at 7:10 PM, Jon Nelson jnelson+pg...@jamponi.net wrote: I had moved on to a different approach to importing the data which does not work concurrently. However, I went back and tried to re-create the situation and - at least a naive attempt failed. I'll give it a few more tries -- I was creating two tables using CREATE TABLE unique name LIKE (some other table INCLUDING everything). Then I would copy the data in, add some constraints (FK constraints but only within these two tables) and then finally (for each table) issue an ALTER TABLE unique name INHERIT some other table. To be clear, however, everything bogged down in the COPY stage which was immediately following the table creation. I'll note that my naive test showed almost no unexpected overhead at all, so it's clearly not representative of the problem I encountered. I'm still unable to replicate the problem, but I can show I wasn't crazy, either. The average time to perform one of these COPY operations when things are working is in the 15-45 second range. I had configured PG to log any statement that look longer than 3 seconds, so I got a bunch of those in the logs. I have since reconfigured to log *everything*. Anyway, when things were going badly, COPY would take anywhere from 814 seconds to just under 1400 seconds for the exact same files. UPDATE: I have been able to replicate the issue. The parent table (the one referenced in the LIKE portion of the CREATE TABLE statement) had three indices. Now that I've been able to replicate the issue, are there tests that I can perform that would be useful to people? I will also try to build a stand-alone test. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] postgres 8.4, COPY, and high concurrency
On Wed, Nov 14, 2012 at 1:01 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Wed, Nov 14, 2012 at 6:41 AM, Jon Nelson jnelson+pg...@jamponi.net wrote: UPDATE: I have been able to replicate the issue. The parent table (the one referenced in the LIKE portion of the CREATE TABLE statement) had three indices. Now that I've been able to replicate the issue, are there tests that I can perform that would be useful to people? I will also try to build a stand-alone test. While the WAL is suppressed for the table inserts, it is not suppressed for the index inserts, and the index WAL traffic is enough to lead to contention. Aha! I don't know why that is the case, it seems like the same method that allows us to bypass WAL for the table would work for the indices as well. Maybe it is just that no one bothered to implement it. After all, building the index after the copy will be even more efficient than building it before but by-passing WAL. But it does seem like the docs could at least be clarified here. In general, then, would it be safe to say that concurrent (parallel) index creation may be a source of significant WAL contention? I was planning on taking advantage of this due to modern, beefy boxes with 10's of CPUs all just sitting there bored. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] postgres 8.4, COPY, and high concurrency
I was working on a data warehousing project where a fair number of files could be COPY'd more or less directly into tables. I have a somewhat nice machine to work with, and I ran on 75% of the cores I have (75% of 32 is 24). Performance was pretty bad. With 24 processes going, each backend (in COPY) spent 98% of it's time in semop (as identified by strace). I tried larger and smaller shared buffers, all sorts of other tweaks, until I tried reducing the number of concurrent processes from 24 to 4. Disk I/O went up (on average) at least 10X and strace reports that the top system calls are write (61%), recvfrom (25%), and lseek (14%) - pretty reasonable IMO. Given that each COPY is into it's own, newly-made table with no indices or foreign keys, etc, I would have expected the interaction among the backends to be minimal, but that doesn't appear to be the case. What is the likely cause of the semops? I can't really try a newer version of postgres at this time (perhaps soon). I'm using PG 8.4.13 on ScientificLinux 6.2 (x86_64), and the CPU is a 32 core Xeon E5-2680 @ 2.7 GHz. -- Jon
Re: [PERFORM] postgres 8.4, COPY, and high concurrency
On Tue, Nov 13, 2012 at 1:27 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 13.11.2012 21:13, Jon Nelson wrote: I was working on a data warehousing project where a fair number of files could be COPY'd more or less directly into tables. I have a somewhat nice machine to work with, and I ran on 75% of the cores I have (75% of 32 is 24). Performance was pretty bad. With 24 processes going, each backend (in COPY) spent 98% of it's time in semop (as identified by strace). I tried larger and smaller shared buffers, all sorts of other tweaks, until I tried reducing the number of concurrent processes from 24 to 4. Disk I/O went up (on average) at least 10X and strace reports that the top system calls are write (61%), recvfrom (25%), and lseek (14%) - pretty reasonable IMO. Given that each COPY is into it's own, newly-made table with no indices or foreign keys, etc, I would have expected the interaction among the backends to be minimal, but that doesn't appear to be the case. What is the likely cause of the semops? I'd guess it's lock contention on WALInsertLock. That means, the system is experiencing lock contention on generating WAL records for the insertions. If that theory is correct, you ought to get a big gain if you have wal_level=minimal, and you create or truncate the table in the same transaction with the COPY. That allows the system to skip WAL-logging the COPY. wal_level doesn't exist for 8.4, but I have archive_mode = off and I am creating the table in the same transaction as the COPY. Or you could upgrade to 9.2. The WAL-logging of bulk COPY was optimized in 9.2, it should help precisely the scenario you're facing. Unfortunately, that's what I was expecting. -- Jon
Re: [PERFORM] postgres 8.4, COPY, and high concurrency
On Tue, Nov 13, 2012 at 2:43 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Tue, Nov 13, 2012 at 12:03 PM, Jon Nelson jnelson+pg...@jamponi.net wrote: On Tue, Nov 13, 2012 at 1:27 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 13.11.2012 21:13, Jon Nelson wrote: I'd guess it's lock contention on WALInsertLock. That means, the system is experiencing lock contention on generating WAL records for the insertions. If that theory is correct, you ought to get a big gain if you have wal_level=minimal, and you create or truncate the table in the same transaction with the COPY. That allows the system to skip WAL-logging the COPY. wal_level doesn't exist for 8.4, but I have archive_mode = off and I am creating the table in the same transaction as the COPY. That should work to bypass WAL. Can you directly verify whether you are generating lots of WAL (look at the churn in pg_xlog) during those loads? Maybe your contention is someplace else. Since they must all be using different tables, I don't think it would be the relation extension lock. Maybe buffer mapping lock or freelist lock? I had moved on to a different approach to importing the data which does not work concurrently. However, I went back and tried to re-create the situation and - at least a naive attempt failed. I'll give it a few more tries -- I was creating two tables using CREATE TABLE unique name LIKE (some other table INCLUDING everything). Then I would copy the data in, add some constraints (FK constraints but only within these two tables) and then finally (for each table) issue an ALTER TABLE unique name INHERIT some other table. To be clear, however, everything bogged down in the COPY stage which was immediately following the table creation. I'll note that my naive test showed almost no unexpected overhead at all, so it's clearly not representative of the problem I encountered. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] set-returning calls and overhead
On Thu, Jul 19, 2012 at 11:07 AM, Jon Nelson jnelson+pg...@jamponi.net wrote: Recently I found myself wondering what was taking a particular query so long. I immediately assumed it was a lack of I/O, because lack of I/O is a thorn in my side. Nope, the I/O was boring. CPU? Well, the process was using 100% of the CPU but the query itself was really very simple. I turned to ltrace (horribly imprecise, I know). ltrace told me this: % time seconds usecs/call calls function -- --- --- - 46.546.789433 69 97766 memcpy 28.164.1083241100 3732 strlen 14.452.107567 564 3732 malloc 9.161.336108 28 46877 memset 0.740.107935 28 3732 strcpy 0.730.107221 28 3732 free 0.160.023687 187 126 write 0.020.003587 28 126 __errno_location 0.020.003075 5952 read 0.010.001523 2952 memcmp -- --- --- - 100.00 14.588460159927 total and this: strlen(SRF multi-call context) strcpy(0xe01d40, SRF multi-call context) malloc(1024) memcpy(...) memset(...) ... memset(...) free(..) repeat. I was rather surprised to learn that (per-row): (1) memcpy of 64 bytes accounted for 46% of the time spent in library calls (2) the (other) costs of strlen, strcpy, malloc, and memset were so huge (in particular, strlen) What, if anything, can be done about this? It seems the overhead for setting up the memory context for the SRF is pretty high. I notice this overhead pretty much every time I use any of the array functions like unnest. Please help me to understand if I'm misinterpreting things here. [x86_64, Linux, PostgreSQL 9.1.4] A followup. Recently, I imported a bunch of data. The import ran in about 30 seconds. The data itself was represented in a way that made more sense - from a relational database perspective - as multiple tables. To accomplish this, I made use of string_to_array and unnest. The initial table creation and copy run in about 30 seconds, but then the creation of the new table (create table ... as select .. unnest(string_to_array())) took over 5 minutes. 10 times as long. What is it about the array functions (actually, all set-returning functions that I've tried) that causes them to be so expensive? The per-call overhead is enormous in some cases. PostgreSQL 9.1.5 on x86_64 (openSUSE 12.2 - but the effect has been observed across several platforms and major/minor releases of PostgreSQL). -- Jon -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] very very slow inserts into very large table
I have a single *table* that is some 560GB in size, 6 columns, average row width 63. There are approximately 6.1 billion rows. It has 4 indices, 131, 313, 131 and 190 GB in size, respectively. All are btree indices. I tried inserting new data into the table, and it's taking a *very* long time. I pre-built the data to be inserted into a temporary table with the exact same structure and column ordering, etc, and the temporary table is about 8.5GB in size with about 93 million rows. The temporary table was built in about 95 seconds. The insert has been going for 47 hours and 21 minutes, give or take. I'm not doing any correlation or filtering, etc -- straight up insert, literally insert into big_table select * from the_temp_table;. vmstat output doesn't seem that useful, with disk wait being 10-15% and I/O speeds highly variable, from 5-20MB/s reads couple with 0-16MB/s writes, generally on the lower end of these. strace of the inserting process shows that it's basically hammering the disk in terms of random reads and infrequent writes. postgresql. It's not verifying, rebuilding, etc. While this process is active, streaming write I/O is terrible - 36MB/s. WIth it paused (via strace) I get 72MB/s. (reads are 350MB/s). The OS is Scientific Linux 6.2, and the version of postgresql is 9.1.4 - x86_64. There is nothing else of note happening on the box. The box is a quad CPU, dual-core each Xeon E5430 @ 2.66GHz with 32GB of RAM and a 3ware 9690 RAID 4TB RAID10 for the storage for What might be going on here? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] very very slow inserts into very large table
On Mon, Jul 16, 2012 at 12:35 PM, Samuel Gendler sgend...@ideasculptor.com wrote: On Mon, Jul 16, 2012 at 7:06 AM, Mark Thornton mthorn...@optrak.com wrote: Every insert updates four indexes, so at least 3 of those will be in random order. The indexes don't fit in memory, so all those updates will involve reading most of the relevant b-tree pages from disk (or at least the leaf level). A total of 10ms of random read from disk (per inserted row) wouldn't surprise me ... which adds up to more than 10 days for your 93 million rows. Which is the long way of saying that you will likely benefit from partitioning that table into a number of smaller tables, especially if queries on that table tend to access only a subset of the data that can be defined to always fit into a smaller number of partitions than the total. At the very least, inserts will be faster because individual indexes will be smaller. But unless all queries can't be constrained to fit within a subset of partitions, you'll also see improved performance on selects. Acknowledged. My data is actually partitioned into individual tables, but this was an experiment to see what the performance was like. I was expecting that effectively appending all of the individual tables into a great big table would result in less redundant information being stored in indices and, therefore, a bit more speed and efficiency. However, I have to admit I was very surprised at the performance reduction. What is the greater lesson to take away, here? If you are working with data that is larger (substantially larger) than available memory, is the architecture and design of postgresql such that the only real approach is some type of data partitioning? It is not my intent to insult or even disparage my favorite software, but it took less time to *build* the indices for 550GB of data than it would have to insert 1/20th as much. That doesn't seem right. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] problems with set_config, work_mem, maintenance_work_mem, and sorting
I happened to be looking in the PostgreSQL logs (8.4.10, x86_64, ScientificLinux 6.1) and noticed that an app was doing some sorting (group by, order by, index creation) that ended up on disk rather than staying in memory. So I enabled trace_sort and restarted the app. What followed confused me. I know that the app is setting the work_mem and maintenance_work_mem to 1GB, at the start of the session, with the following calls: select set_config(work_mem, 1GB, False); select set_config(maintenance_work_mem, 1GB, False); By timestamps, I know that these statements take place before the next log items, generated by PostgreSQL (note: I also log the PID of the backend and all of these are from the same PID): LOG: 0: begin tuple sort: nkeys = 2, workMem = 1048576, randomAccess = f LOG: 0: begin tuple sort: nkeys = 1, workMem = 1048576, randomAccess = f LOG: 0: begin tuple sort: nkeys = 1, workMem = 1048576, randomAccess = f LOG: 0: begin tuple sort: nkeys = 1, workMem = 1048576, randomAccess = f LOG: 0: begin tuple sort: nkeys = 2, workMem = 1048576, randomAccess = f ^ these make sense LOG: 0: begin tuple sort: nkeys = 2, workMem = 131072, randomAccess = f LOG: 0: begin tuple sort: nkeys = 1, workMem = 131072, randomAccess = f LOG: 0: begin tuple sort: nkeys = 1, workMem = 131072, randomAccess = f ^^ these do not (but 128MB is the globally-configured work_mem value) LOG: 0: begin index sort: unique = t, workMem = 2097152, randomAccess = f ^ this kinda does (2GB is the globally-configured maintenance_work_mem value) LOG: 0: begin index sort: unique = f, workMem = 131072, randomAccess = f LOG: 0: begin tuple sort: nkeys = 2, workMem = 131072, randomAccess = f .. The config shows 128MB for work_mem and 2GB for maintenance_work_mem. Why does PostgreSQL /sometimes/ use the globally-configured values and sometimes use the values that come from the connection? Am I wrong in misunderstanding what 'session' variables are? I thought that session (versus transaction) config items were set for /all/ transactions in a given backend, until changed or until that backend terminates. Is that not so? If I reconfigure the app to call out to set_config(item, value, True) after each 'BEGIN' statement then workMem seems to be correct (at least more of the time -- the process takes some time to run and I haven't done an exhaustive check as yet). -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] problems with set_config, work_mem, maintenance_work_mem, and sorting
On Tue, Feb 28, 2012 at 1:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: The config shows 128MB for work_mem and 2GB for maintenance_work_mem. Why does PostgreSQL /sometimes/ use the globally-configured values and sometimes use the values that come from the connection? You sure those log entries are all from the same process? If I am understanding this correctly, yes. They all share the same pid. The logline format is: log_line_prefix = '%t %d %u [%p]' and I believe %p represents the pid, and also that a pid corresponds to a backend. Therefore, same pid == same backend == same connection == same session. Many transactions within a session. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] problems with set_config, work_mem, maintenance_work_mem, and sorting
On Tue, Feb 28, 2012 at 2:51 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: On Tue, Feb 28, 2012 at 1:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: Why does PostgreSQL /sometimes/ use the globally-configured values and sometimes use the values that come from the connection? You sure those log entries are all from the same process? If I am understanding this correctly, yes. They all share the same pid. Hmph ... does seem a bit weird. Can you turn on log_statements and identify which operations aren't using the session values? I had log_min_duration_statement = 1000. An example: LOG: 0: begin tuple sort: nkeys = 3, workMem = 131072, randomAccess = f LOCATION: tuplesort_begin_heap, tuplesort.c:573 STATEMENT: INSERT INTO (new table) SELECT (bunch of stuff here) FROM .. ORDER BY ... and also some CREATE TABLE ... statements: LOG: 0: begin index sort: unique = f, workMem = 131072, randomAccess = f LOCATION: tuplesort_begin_index_btree, tuplesort.c:642 STATEMENT: CREATE TABLE tablename (LIKE some_other_tablename) I also see this: LOG: 0: begin tuple sort: nkeys = 2, workMem = 131072, randomAccess = f LOCATION: tuplesort_begin_heap, tuplesort.c:573 STATEMENT: SELECT bunch of stuff from system catalogs which is the ORM library (SQLAlchemy) doing a reflection of the table(s) involved. The statement is from the same backend (pid) and takes place chronologically *after* the following: LOG: 0: begin tuple sort: nkeys = 2, workMem = 1048576, randomAccess = f LOCATION: tuplesort_begin_heap, tuplesort.c:573 STATEMENT: more reflection stuff Is that useful? If that's not enough, I can crank the logging up. What would you like to see for 'log_statements' (if what I've provided you above is not enough). -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] problems with set_config, work_mem, maintenance_work_mem, and sorting
On Tue, Feb 28, 2012 at 4:46 PM, Claudio Freire klaussfre...@gmail.com wrote: On Tue, Feb 28, 2012 at 6:54 PM, Tom Lane t...@sss.pgh.pa.us wrote: ... which is the ORM library (SQLAlchemy) doing a reflection of the table(s) involved. Oh, there's an ORM involved? I'll bet a nickel it's doing something surprising, like not issuing your SET until much later than you thought. I'd rather go for an auto-rollback at some point within the transaction that issued the set work_mem. SQLA tends to do that if, for instance, an exception is risen within a transaction block (ie, flushing). You can issue the set work_mem in its own transaction, and commit it, and in that way avoid that rollback. I cranked the logging /all/ the way up and isolated the server. I suspect that your theory is correct. I'll spend a bit more time investigating. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] *really* bad insert performance on table with unique index
I created a table with two columns: an id SERIAL (primary key) and a text (not null), and then added a unique index on the text field. Then I ran the following query (with a huge work_mem - 20GB): insert into tableA (text_field) select distinct other_text_field from some_huge_set_of_tables After 36 hours it had only written 3 GB (determined by looking at what files it was writing to). I started over with a TRUNCATE, and then removed the index and tried again. This time it took 3807270.780 ms (a bit over an hour). Total number of records: approx 227 million, comprising 16GB of storage. Why the huge discrepancy? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] regarding CLUSTER and HUGE work_mem / maintenance_work_mem
Let's say I have a 7GB table with 3-4 indices for a total of 10-12GB. Furthermore, let's say I have a machine with sufficient memory for me to set the work_mem and maintenance_work_mem to 20GB (just for this session). When I issue a CLUSTER using one of the indices, I see PostgreSQL (by way of strace) performing an index scan which amounts to large quantities of random I/O. In my case, that means it takes a very, very long time. PostgreSQL is largely at defaults, except for a 2GB shared_buffers and a few unrelated changes. The system itself has 32GB of physical RAM and has plenty free. Why didn't PostgreSQL just read the table into memory (and the interesting index) as a sequential scan, sort, and then write it out? It seems like there would be more than enough memory for that. The sequential I/O rate on this machine is 50-100x the random I/O rate. I'm using 8.4.10 (with the 'inet' de-toasting patch) on Scientific Linux 6.1. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] regarding CLUSTER and HUGE work_mem / maintenance_work_mem
On Fri, Jan 27, 2012 at 12:05 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 27.01.2012 19:43, Jon Nelson wrote: Let's say I have a 7GB table with 3-4 indices for a total of 10-12GB. Furthermore, let's say I have a machine with sufficient memory for me to set the work_mem and maintenance_work_mem to 20GB (just for this session). When I issue a CLUSTER using one of the indices, I see PostgreSQL (by way of strace) performing an index scan which amounts to large quantities of random I/O. In my case, that means it takes a very, very long time. PostgreSQL is largely at defaults, except for a 2GB shared_buffers and a few unrelated changes. The system itself has 32GB of physical RAM and has plenty free. Why didn't PostgreSQL just read the table into memory (and the interesting index) as a sequential scan, sort, and then write it out? It seems like there would be more than enough memory for that. The sequential I/O rate on this machine is 50-100x the random I/O rate. I'm using 8.4.10 (with the 'inet' de-toasting patch) on Scientific Linux 6.1. The suppport for doing a seqscan+sort in CLUSTER was introduced in version 9.1. Before that, CLUSTER always did an indexscan. See release notes: http://www.postgresql.org/docs/9.1/static/release-9-1.html#AEN107416 That's what I get for digging through the source (git) but working with 8.4.10, on a Friday, at the end of a long week. Thanks for pointing that out to somebody that should have known better. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] copy vs. C function
On Wed, Dec 14, 2011 at 12:18 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: The only thing I have left are these statements: get_call_result_type TupleDescGetAttInMetadata BuildTupleFromCStrings HeapTupleGetDatum and finally PG_RETURN_DATUM It turns out that: get_call_result_type adds 43 seconds [total: 54], TupleDescGetAttInMetadata adds 19 seconds [total: 73], BuildTypleFromCStrings accounts for 43 seconds [total: 116]. So those three functions account for 90% of the total time spent. What alternatives exist? Do I have to call get_call_result_type /every time/ through the function? Well, if you're concerned about performance then I think you're going about this in entirely the wrong way, because as far as I can tell from this you're converting all the field values to text and back again. You should be trying to keep the values in Datum format and then invoking heap_form_tuple. And yeah, you probably could cache the type information across calls. The parsing/conversion (except BuildTupleFromCStrings) is only a small fraction of the overall time spent in the function and could probably be made slightly faster. It's the overhead that's killing me. Remember: I'm not converting multiple field values to text and back again, I'm turning a *single* TEXT into 8 columns of varying types (INET, INTEGER, and one INTEGER array, among others). I'll re-write the code to use Tuples but given that 53% of the time is spent in just two functions (the two I'd like to cache) I'm not sure how much of a gain it's likely to be. Regarding caching, I tried caching it across calls by making the TupleDesc static and only initializing it once. When I tried that, I got: ERROR: number of columns (6769856) exceeds limit (1664) I tried to find some documentation or examples that cache the information, but couldn't find any. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] copy vs. C function
On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: Regarding caching, I tried caching it across calls by making the TupleDesc static and only initializing it once. When I tried that, I got: ERROR: number of columns (6769856) exceeds limit (1664) I tried to find some documentation or examples that cache the information, but couldn't find any. You might find reading record_in to be helpful. What it caches is not exactly what you need to, I think, but it shows the general principles. There are lots of other functions that use fn_extra to cache info, too. I will definitely look into those. I'm probably doing it wrong, but in the meantime, I allocated enough space (by way of MemoryContextAlloc) in TopMemoryContext for an AttInMetadata pointer, switched to that memory context (just the first time through), used CreateTupleDescCopy + TupleDescGetAttInMetadata to duplicate (in the new memory context) the TupleDesc, and then switched back. This approach seems to have dropped the total run time to about 54 seconds, the bulk of which is BuildTupleFromCStrings, a rather significant improvement. Looking at record_in, I think I see what I could be doing better. Again, thanks for the pointers. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] copy vs. C function
On Wed, Dec 14, 2011 at 9:40 AM, Jon Nelson jnelson+pg...@jamponi.net wrote: On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: Regarding caching, I tried caching it across calls by making the TupleDesc static and only initializing it once. When I tried that, I got: ERROR: number of columns (6769856) exceeds limit (1664) I tried to find some documentation or examples that cache the information, but couldn't find any. You might find reading record_in to be helpful. What it caches is not exactly what you need to, I think, but it shows the general principles. There are lots of other functions that use fn_extra to cache info, too. I will definitely look into those. I'm probably doing it wrong, but in the meantime, I allocated enough space (by way of MemoryContextAlloc) in TopMemoryContext for an AttInMetadata pointer, switched to that memory context (just the first time through), used CreateTupleDescCopy + TupleDescGetAttInMetadata to duplicate (in the new memory context) the TupleDesc, and then switched back. This approach seems to have dropped the total run time to about 54 seconds, the bulk of which is BuildTupleFromCStrings, a rather significant improvement. Looking at record_in, I think I see what I could be doing better. Indeed. I revised the code to make use of fcinfo-flinfo-fn_extra for storage and fcinfo-flinfo-fn_mcxt for the MemoryContext and everything seemed to work just fine. Assuming one *starts* with a char *some_var[8], would building Datum myself be faster than using BuildTupleFromCStrings? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] copy vs. C function
On Wed, Dec 14, 2011 at 9:51 AM, Jon Nelson jnelson+pg...@jamponi.net wrote: On Wed, Dec 14, 2011 at 9:40 AM, Jon Nelson jnelson+pg...@jamponi.net wrote: On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: Regarding caching, I tried caching it across calls by making the TupleDesc static and only initializing it once. When I tried that, I got: ERROR: number of columns (6769856) exceeds limit (1664) I tried to find some documentation or examples that cache the information, but couldn't find any. You might find reading record_in to be helpful. What it caches is not exactly what you need to, I think, but it shows the general principles. There are lots of other functions that use fn_extra to cache info, too. I will definitely look into those. I'm probably doing it wrong, but in the meantime, I allocated enough space (by way of MemoryContextAlloc) in TopMemoryContext for an AttInMetadata pointer, switched to that memory context (just the first time through), used CreateTupleDescCopy + TupleDescGetAttInMetadata to duplicate (in the new memory context) the TupleDesc, and then switched back. This approach seems to have dropped the total run time to about 54 seconds, the bulk of which is BuildTupleFromCStrings, a rather significant improvement. Looking at record_in, I think I see what I could be doing better. Indeed. I revised the code to make use of fcinfo-flinfo-fn_extra for storage and fcinfo-flinfo-fn_mcxt for the MemoryContext and everything seemed to work just fine. Assuming one *starts* with a char *some_var[8], would building Datum myself be faster than using BuildTupleFromCStrings? The answer is: yes. At least, in my case it is. The total run time is now down to about 32 seconds. Versus the BuildTupleFromCStrings which takes about 54 seconds. 32 seconds is more than 10-15 seconds, but it's livable. This experiment has been very worthwhile - thank you all for the help. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] copy vs. C function
On Mon, Dec 12, 2011 at 10:38 AM, Merlin Moncure mmonc...@gmail.com wrote: On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson jnelson+pg...@jamponi.net wrote: I was experimenting with a few different methods of taking a line of text, parsing it, into a set of fields, and then getting that info into a table. The first method involved writing a C program to parse a file, parse the lines and output newly-formatted lines in a format that postgresql's COPY function can use. End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse, output new data to new file -- 4 seconds, COPY new file -- 10 seconds). The next approach I took was to write a C function in postgresql to parse a single TEXT datum into an array of C strings, and then use BuildTupleFromCStrings. There are 8 columns involved. Eliding the time it takes to COPY the (raw) file into a temporary table, this method took 120 seconds, give or take. The difference was /quite/ a surprise to me. What is the probability that I am doing something very, very wrong? NOTE: the code that does the parsing is actually the same, line-for-line, the only difference is whether the routine is called by a postgresql function or by a C program via main, so obviously the overhead is elsewhere. NOTE #2: We are talking about approximately 2.6 million lines. Let me throw out an interesting third method I've been using to parse delimited text files that might be useful in your case. This is useful when parsing text that is bad csv where values are not escaped or there are lines, incomplete and/or missing records, or a huge amount of columns that you want to rotate into a more normalized structure based on columns position. 1. Import the data into a single column (containing the entire line) staging table, feeding the COPY parser a bogus delimiter 2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function). 3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if you can work it into your problem, INSERT/SELECT, expanding the array with a trick like used in information_schema._pg_expandarray so you can hook logic on the array (column position). If you replace [2] with my C function (which can process all of the data, *postgresql overhead not included*, in about 1 second) then that's what I did. It returns a composite type making [3] unnecessary. I know it's not parsing, so I started a time honored debugging approach: I returned early. Is the function-call overhead that high? That's outrageously high. What else could it be? Is returning a composite type outragously expensive? So here is what I did: I modified the code so that it immediately returns NULL. Result: 2 seconds. Extract arguments, allocate temporary work buffer: another 0.5 seconds. Add parsing: another 1.5 seconds [total: 4.1 seconds] and so on... Two of the items require base conversion, so: Calling strtol (twice) and snprintf (twice) -- adds *6 seconds. and to format one of the items as an array (a strcpy and a strcat) -- add 1.5 seconds for a total of 11.5. The only thing I have left are these statements: get_call_result_type TupleDescGetAttInMetadata BuildTupleFromCStrings HeapTupleGetDatum and finally PG_RETURN_DATUM It turns out that: get_call_result_type adds 43 seconds [total: 54], TupleDescGetAttInMetadata adds 19 seconds [total: 73], BuildTypleFromCStrings accounts for 43 seconds [total: 116]. So those three functions account for 90% of the total time spent. What alternatives exist? Do I have to call get_call_result_type /every time/ through the function? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] select distinct uses index scan vs full table scan
I've got a 5GB table with about 12 million rows. Recently, I had to select the distinct values from just one column. The planner chose an index scan. The query took almost an hour. When I forced index scan off, the query took 90 seconds (full table scan). The planner estimated 70,000 unique values when, in fact, there are 12 million (the value for this row is *almost* but not quite unique). What's more, despite bumping the statistics on that column up to 1000 and re-analyzing, the planner now thinks that there are 300,000 unique values. How can I tell the planner that a given column is much more unique than, apparently, it thinks it is? The column type is INET. This is on PG 8.4.10 on Linux x86_64, with 81f4e6cd27d538bc27e9714a9173e4df353a02e5 applied. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] select distinct uses index scan vs full table scan
On Tue, Dec 13, 2011 at 1:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: I've got a 5GB table with about 12 million rows. Recently, I had to select the distinct values from just one column. The planner chose an index scan. The query took almost an hour. When I forced index scan off, the query took 90 seconds (full table scan). Usually, we hear complaints about the opposite. Are you using nondefault cost settings? Cost settings had not been changed until a few minutes ago when your response prompted me to try a few things. I ended up changing the random_page_cost to 16.0 (from 4.0), partly because the H/W raid I'm using is awful bad at random I/O. I'll experiment and keep tabs on performance to see if this has a negative effect on other aspects. The planner estimated 70,000 unique values when, in fact, there are 12 million (the value for this row is *almost* but not quite unique). What's more, despite bumping the statistics on that column up to 1000 and re-analyzing, the planner now thinks that there are 300,000 unique values. Accurate ndistinct estimates are hard, but that wouldn't have much of anything to do with this particular choice, AFAICS. How can I tell the planner that a given column is much more unique than, apparently, it thinks it is? 9.0 and up have ALTER TABLE ... ALTER COLUMN ... SET n_distinct. D'oh! I'm on 8.4.10+patches. This may provide the necessary push. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] copy vs. C function
I was experimenting with a few different methods of taking a line of text, parsing it, into a set of fields, and then getting that info into a table. The first method involved writing a C program to parse a file, parse the lines and output newly-formatted lines in a format that postgresql's COPY function can use. End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse, output new data to new file -- 4 seconds, COPY new file -- 10 seconds). The next approach I took was to write a C function in postgresql to parse a single TEXT datum into an array of C strings, and then use BuildTupleFromCStrings. There are 8 columns involved. Eliding the time it takes to COPY the (raw) file into a temporary table, this method took 120 seconds, give or take. The difference was /quite/ a surprise to me. What is the probability that I am doing something very, very wrong? NOTE: the code that does the parsing is actually the same, line-for-line, the only difference is whether the routine is called by a postgresql function or by a C program via main, so obviously the overhead is elsewhere. NOTE #2: We are talking about approximately 2.6 million lines. I was testing: \copy some_table from 'some_file.csv' with csv vs. insert into some_table select (some_func(line)).* from some_temp_table; where some_func had been defined with (one) IN TEXT and (8) OUT params of varying types. PostgreSQL 9.1.1 on Linux, x86_64 -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] copy vs. C function
On Sat, Dec 10, 2011 at 8:32 PM, Craig Ringer ring...@ringerc.id.au wrote: On 12/11/2011 09:27 AM, Jon Nelson wrote: The first method involved writing a C program to parse a file, parse the lines and output newly-formatted lines in a format that postgresql's COPY function can use. End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse, output new data to new file -- 4 seconds, COPY new file -- 10 seconds). Why not `COPY tablename FROM /path/to/myfifo' ? If I were to do this in any sort of production environment, that's exactly what I would do. I was much more concerned about the /huge/ difference -- 10 seconds for COPY and 120 seconds for (INSERT INTO / CREATE TABLE AS / whatever). The next approach I took was to write a C function in postgresql to parse a single TEXT datum into an array of C strings, and then use BuildTupleFromCStrings. There are 8 columns involved. Eliding the time it takes to COPY the (raw) file into a temporary table, this method took 120 seconds, give or take. The difference was /quite/ a surprise to me. What is the probability that I am doing something very, very wrong? Have a look at how COPY does it within the Pg sources, see if that's any help. I don't know enough about Pg's innards to answer this one beyond that suggestion, sorry. Ack. Regarding a subsequent email, I was using full transactions. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] external sort performance
On Sun, Nov 20, 2011 at 7:56 AM, Jeremy Harris j...@wizmail.org wrote: On 2011-11-17 17:10, Jon Nelson wrote: external sort ended, 7708696 disk blocks used: CPU 359.84s/57504.66u sec elapsed 58966.76 sec Am I to understand that the CPU portion of the sorting only took 6 minutes but the sort itself took almost 16.5 hours and used approx 60GB of disk space? I realise you've had helpful answers by now, but that reads as 16 hours of cpu time to me; mostly user-mode but with 6 minute of system-mode. 98% cpu usage for the 16 hours elapsed. Thank you very much! I was going to post a followup asking for help interpreting the log line, but now I don't have to. Do you happen to recall if disk I/O is counted as user or system time? If it's counted as system time, then I have more questions, namely: If using a hash table (hash aggregation) shows that the GROUPing can take place in 35 minutes, but a Group Aggregation takes 16 hours, how much of that is CPU and how much is waiting for I/O? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] external sort performance
I have one query which does not run very often. Sometimes it may be months between runs. However, when it does get executed, it scans approximately 100 identically-structured tables (a form of partitioning), extracts and groups on a subset of the columns, and creates a new table. The individual table queries have no where clauses, this is a full table scan for every table. I've tried all sorts of things to try to improve the performance, which can take a /very/ long time. We are talking about approximately 175GB of data before grouping/summarizing. This is on PG 8.4.8 on Linux, 16GB of real RAM. Most recently, I enabled trace_sort, disabled hash aggregation[1], and set a large work_mem (normally very small, in this case I tried anything from 8MB to 256MB. I even tried 1GB and 2GB). In the logs, I saw this: external sort ended, 7708696 disk blocks used: CPU 359.84s/57504.66u sec elapsed 58966.76 sec Am I to understand that the CPU portion of the sorting only took 6 minutes but the sort itself took almost 16.5 hours and used approx 60GB of disk space? The resulting summary table is about 5GB in size as reported by \d+ in psql (and pg_relation_size). The underlying storage is ext4 on a hardware raid 10 with a BBU. What sorts of things should I be looking at to improve the performance of this query? Is my interpretation of that log line totally off base? [1] if I don't disable hash aggregation and the work_mem is over 8MB in size, the memory allocation explodes to the point where postgresql wants dozens of gigs of memory. I've tried setting the statistics as high as 1000 without benefit. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] external sort performance
I'll try to compile multiple questions/answers into a single response. On Thu, Nov 17, 2011 at 11:16 AM, Claudio Freire klaussfre...@gmail.com wrote: On Thu, Nov 17, 2011 at 2:10 PM, Jon Nelson jnelson+pg...@jamponi.net wrote: What sorts of things should I be looking at to improve the performance of this query? Is my interpretation of that log line totally off base? You'll have to post some more details. Like a query and an explain/explain analyze. Please see below, however, I am also very interested to know if I'm interpreting that log line correctly. Memory consumption probably skyrockets since you'll need at least one sort per table, so if you have 100+, then that's (at least) 100+ sorts. Right, that much I had understood. On Thu, Nov 17, 2011 at 11:28 AM, Craig James craig_ja...@emolecules.com wrote: You don't give any details about how and why you are sorting. Are you actually using all of the columns in your aggregated-data table in the sort operation? Or just a few of them? You're making the sort operation work with 175 GB of data. If most of that data is only needed for the report (not the sort), then separate it into two tables - one of just the data that the sorting/grouping needs, and the other with the rest of the data. Then create a view that joins it all back together for reporting purposes. I'm not actually using any ORDER BY at all. This is purely a GROUP BY. The sort happens because of the group aggregate (vs. hash aggregate). Two of the columns are used to group, the other two are aggregates (SUM). On Thu, Nov 17, 2011 at 11:55 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: This is on PG 8.4.8 on Linux, 16GB of real RAM. Most recently, I enabled trace_sort, disabled hash aggregation[1], and set a large work_mem (normally very small, in this case I tried anything from 8MB to 256MB. I even tried 1GB and 2GB). FWIW, I think hash aggregation is your best shot at getting reasonable performance. Sorting 175GB of data is going to hurt no matter what. If the grouped table amounts to 5GB, I wouldn't have expected the hash table to be more than maybe 2-3X that size (although this does depend on what aggregates you're running...). Letting the hash aggregation have all your RAM might be the best answer. I'm re-running the query with work_mem set to 16GB (for just that query). The query (with table and column names changed): SELECT anon_1.columnA, sum(anon_1.columnB) AS columnB, sum(anon_1.columnC) AS columnC, anon_1.columnD FROM ( SELECT columnA, columnB, columnC, columnD FROM tableA UNION ALL same select/union all pattern but from 90-ish other tables ) AS anon_1 GROUP BY anon_1.columnA, anon_1.columnD HAVING (anon_1.columnB) 0 The explain verbose with work_mem = 16GB HashAggregate (cost=54692162.83..54692962.83 rows=4 width=28) Output: columnA, sum(columnB), sum(columnC), columnD Filter: (sum(columnB) 0) - Append (cost=0.00..34547648.48 rows=1611561148 width=28) - Seq Scan on tableA (cost=0.00..407904.40 rows=19045540 width=28) Output: columnA, columnB, columnC, columnD 90-ish more tables here 12 minutes into the query it is consuming 10.1GB of memory. 21 minutes into the query it is consuming 12.9GB of memory. After just under 34 minutes it completed with about 15GB of memory being used. That is a rather impressive improvement. Previously, I had been advised against using a large work_mem value. I had never thought to use one 3 times the size of the resulting table. The explain verbose with enable_hashagg = false: GroupAggregate (cost=319560040.24..343734257.46 rows=4 width=28) Output: columnA, sum(columnB), sum(columnC), columnD Filter: (sum(columnB) 0) - Sort (cost=319560040.24..323588943.11 rows=1611561148 width=28) Output: columnA, columnB, columnC, columnD Sort Key: columnA, columnD - Result (cost=0.00..34547648.48 rows=1611561148 width=28) Output: columnA, columnB, columnC, columnD - Append (cost=0.00..34547648.48 rows=1611561148 width=28) - Seq Scan on tableA (cost=0.00..407904.40 rows=19045540 width=28) Output: columnA, columnB, columnC, columnD 90-ish more tables here -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] external sort performance
A follow-up question. Even with both work_mem and maintenance_work_mem equal to 16GB, I see this: LOG: 0: begin index sort: unique = f, workMem = 16777216, randomAccess = f and shortly thereafter: LOG: 0: switching to external sort with 59919 tapes: CPU 2.59s/13.20u sec elapsed 16.85 sec and a while later: LOG: 0: finished writing run 1 to tape 0: CPU 8.16s/421.45u sec elapsed 433.83 sec LOG: 0: performsort done (except 2-way final merge): CPU 9.53s/561.56u sec elapsed 576.54 sec LOG: 0: external sort ended, 181837 disk blocks used: CPU 12.90s/600.45u sec elapsed 625.05 sec The first log statement is expected. The second log statement, however, isn't. The total table size is (as noted earlier) about 5GB and, in fact, fit into one nice hash table (approx 15GB in size). Is the sorting that is necessary for index creation unable to use a hash table? (This is a standard btree index). -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgres INSERT performance and scalability
On Mon, Sep 19, 2011 at 7:53 PM, Stephen Frost sfr...@snowman.net wrote: Igor, * Igor Chudov (ichu...@gmail.com) wrote: Would the time that it takes, differ a great deal, depending on whether the table has only 100,000 or 5,000,000 records? Yes, because PostgreSQL is going to copy the data. If you don't need or want it to be copied, just use a view. I've never heard of any relational database implementing 'copy on write' type semantics, if that's what you're asking about. Databases, unlike applications with code in memory that's constantly copied, are typically focused around minimizing duplication of data (since it all has to end up on disk at some point). Not much point in having the overhead of COW for that kind of environment, I wouldn't think. Isn't the WAL basically COW? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] bad plan: 8.4.8, hashagg, work_mem=1MB.
PostgreSQL 8.4.8 on CentOS 5.6, x86_64. Default settings except work_mem = 1MB. NOTE: I am using partitioned tables here, and was querying the 'master' table. Perhaps is this a Known Issue. I ran a query recently where the result was very large. The outer-most part of the query looked like this: HashAggregate (cost=56886512.96..56886514.96 rows=200 width=30) - Result (cost=0.00..50842760.97 rows=2417500797 width=30) The row count for 'Result' is in the right ballpark, but why does HashAggregate think that it can turn 2 *billion* rows of strings (an average of 30 bytes long) into only 200? This is my primary concern. If I don't disable hash aggregation, postgresql quickly consumes huge quantities of memory and eventually gets killed by the OOM manager. After manually disabling hash aggregation, I ran the same query. It's been running for over 2 days now. The disk is busy but actual data transferred is very low. Total data size is approx. 250GB, perhaps a bit less. The query scans 160 or so tables for data. If I use a distinct + union on each table, the plan looks like this: Unique (cost=357204094.44..357318730.75 rows=22927263 width=28) - Sort (cost=357204094.44..357261412.59 rows=22927263 width=28) 23 million rows is more like it, and the cost is much lower. What is the possibility that distinct/unique operations can be pushed down into queries during the planning stage to see if they are less expensive? In this case, postgresql believes (probably correctly, I'll let you know) that distinct(column foo from tableA + column foo from tableB + column foo from tableC ...) is more expensive than distinct(distinct column foo from tableA + distinct column foo from tableB ). -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] bad plan: 8.4.8, hashagg, work_mem=1MB.
On Mon, Jun 20, 2011 at 11:08 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: I ran a query recently where the result was very large. The outer-most part of the query looked like this: HashAggregate (cost=56886512.96..56886514.96 rows=200 width=30) - Result (cost=0.00..50842760.97 rows=2417500797 width=30) The row count for 'Result' is in the right ballpark, but why does HashAggregate think that it can turn 2 *billion* rows of strings (an average of 30 bytes long) into only 200? 200 is the default assumption about number of groups when it's unable to make any statistics-based estimate. You haven't shown us any details so it's hard to say more than that. What sorts of details would you like? The row count for the Result line is approximately correct -- the stats for all tables are up to date (the tables never change after import). statistics is set at 100 currently. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] plan variations: join vs. exists vs. row comparison
Originally, I posted to -general but I found some time to write some samples, and realized it's probably more of a performance question. The original post is here: http://archives.postgresql.org/pgsql-general/2011-03/msg00198.php I was hoping that somebody could help me understand the differences between three plans. All of the plans are updating a table using a second table, and should be logically equivalent. Two of the plans use joins, and one uses an exists subquery. One of the plans uses row constructors and IS NOT DISTINCT FROM. It is this plan which has really awful performance. Clearly it is due to the nested loop, but why would the planner choose that approach? I also don't understand why in the 'exists' plan the planner thinks the index scan will provide 1019978 rows, when there are only 100, but that is a lesser issue. Here is a sample SQL file which demonstrates the issues and includes all three variations. begin; create temporary table t7 ( i BIGINT NOT NULL, k BIGINT ); create temporary table t8 ( i BIGINT NOT NULL, j INT ); CREATE FUNCTION populate_t8() RETURNS VOID LANGUAGE SQL AS $$ truncate t8; insert into t8 SELECT i, 1 from t7 ORDER BY i LIMIT 1; insert into t8 SELECT i, 2 from t7 WHERE i 1 ORDER BY i LIMIT 1; SELECT i, 3 from t7 WHERE i 2 ORDER BY i LIMIT 2; analyze t8; $$; INSERT INTO t7 select x, x + 10 from generate_series(1,100) as x ; analyze t7; select populate_t8(); explain analyze verbose update t7 SET k = 1 FROM t8 WHERE t7.i = t8.i AND ( t8.j = 2 OR t8.j = 1 ); select populate_t8(); explain analyze verbose update t7 SET k = 1 WHERE EXISTS ( SELECT 1 FROM t8 WHERE t8.i = t7.i AND ( t8.j = 2 OR t8.j = 1 ) ); select populate_t8(); explain update t7 SET k = 1 FROM t8 WHERE ROW(t7.i) IS NOT DISTINCT FROM ROW(t8.i) AND ( t8.j = 2 OR t8.j = 1 ); explain analyze verbose update t7 SET k = 1 FROM t8 WHERE ROW(t7.i) IS NOT DISTINCT FROM ROW(t8.i) AND ( t8.j = 2 OR t8.j = 1 ); rollback; -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] plan variations: join vs. exists vs. row comparison
On Mon, Mar 7, 2011 at 2:00 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: I was hoping that somebody could help me understand the differences between three plans. All of the plans are updating a table using a second table, and should be logically equivalent. Two of the plans use joins, and one uses an exists subquery. One of the plans uses row constructors and IS NOT DISTINCT FROM. It is this plan which has really awful performance. Clearly it is due to the nested loop, but why would the planner choose that approach? IS NOT DISTINCT FROM pretty much disables all optimizations: it can't be an indexqual, merge join qual, or hash join qual. So it's not surprising that you get a sucky plan for it. Possibly somebody will work on improving that someday. As for your other questions, what PG version are you using? Because I do get pretty much the same plan (modulo a plain join versus a semijoin) for the first two queries, when using 9.0 or later. And the results of ANALYZE are only approximate, so you shouldn't be surprised at all if a rowcount estimate is off by a couple percent. Most of the time, you should be happy if it's within a factor of 2 of reality. Sorry - I had stated in the original post that I was using 8.4.5 on 64 bit openSUSE and CentOS 5.5, and had forgotten to carry that information over into the second post. What is the difference between a plain join and a semi join? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [HACKERS] [PERFORM] Slow count(*) again...
On Thu, Feb 3, 2011 at 7:41 AM, Kenneth Marshall k...@rice.edu wrote: On Thu, Feb 03, 2011 at 02:11:58AM -0800, da...@lang.hm wrote: On Thu, 3 Feb 2011, Vitalii Tymchyshyn wrote: 02.02.11 20:32, Robert Haas ???(??): Yeah. Any kind of bulk load into an empty table can be a problem, even if it's not temporary. When you load a bunch of data and then immediately plan a query against it, autoanalyze hasn't had a chance to do its thing yet, so sometimes you get a lousy plan. May be introducing something like 'AutoAnalyze' threshold will help? I mean that any insert/update/delete statement that changes more then x% of table (and no less then y records) must do analyze right after it was finished. Defaults like x=50 y=1 should be quite good as for me. If I am understanding things correctly, a full Analyze is going over all the data in the table to figure out patterns. If this is the case, wouldn't it make sense in the situation where you are loading an entire table from scratch to run the Analyze as you are processing the data? If you don't want to slow down the main thread that's inserting the data, you could copy the data to a second thread and do the analysis while it's still in RAM rather than having to read it off of disk afterwords. this doesn't make sense for updates to existing databases, but the use case of loading a bunch of data and then querying it right away isn't _that_ uncommon. David Lang +1 for in-flight ANALYZE. This would be great for bulk loads of real tables as well as temp tables. Yes, please, that would be really nice. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any experience using shake defragmenter?
On Tue, Feb 1, 2011 at 1:24 PM, Greg Smith g...@2ndquadrant.com wrote: Mladen Gogala wrote: Did anyone try using shake while the cluster is active? Any problems with corruption or data loss? I ran the thing on my home directory and nothing was broken. I didn't develop any performance test, so cannot vouch for the effectiveness of the procedure. Did anyone play with that? Any positive or negative things to say about shake? Shake works by allocating a new file the size of the original, in what is presumed to be then be unfragmented space. It copies the original over to this new space and then gets rid of the original. That procedure will cause database corruption if the server happens to access the file it's moving while it's in the middle of doing so. If the database isn't running, though, it is probably fine. On ext3 you can measure whether it was useful or not by taking the filesystem off-line and running fsck before/after using it. Look for percentages given for non-contiguous files and directories. If those were low to begin with, little reason to run the utility. If they're high, running shake should bring them down afterwards if it's doing its job right. On a PostgreSQL database system, you can get the same basic effect while leaving the server up--but just with the table locked--using CLUSTER. And that will clean up a bunch of other potential messes inside the database that shake can't touch. I just do that instead if I'm worried a particular table has become fragmented on disk. One thing to note is that, in my experiments, ext4 handles large files (such as the 1GiB files postgresql uses for large relations) in a *vastly* improved manner over ext3. This is due to the use of extents. I found that, in some cases, heavily fragmented files under ext3 could not be effectively defragmented - and yes, I tried shake and some others (including one I wrote which *does* use fallocate / fallocate_posix). There was improvement, but by far the biggest improvement was switching to ext4. Instead of something like 'shake' (which more or less works, even though it doesn't use fallocate and friends) I frequently use either CLUSTER (which is what Greg Smith is suggesting) or a series of ALTER TABLE ... ALTER COLUMN... which rewrites the table. With PG 9 perhaps VACUUM FULL is more appropriate. Of course, the advice regarding using 'shake' (or any other defragmenter) on a live postgresql data directory is excellent - the potential for causing damage if the database is active during that time is very high. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] queries with lots of UNIONed relations
On Fri, Jan 14, 2011 at 2:11 PM, Jon Nelson jnelson+pg...@jamponi.net wrote: On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: If you have enough memory to de-dup them individually, you surely have enough to de-dup all at once. If everything were available up-front, sure. However, and please correct me if I'm wrong, but doesn't postgresql work in a fairly linear fashion, moving from table to table performing a series of operations on each? Doing a single sort+uniq works like that. But the alternate plan you are proposing we should consider involves building all the lower hashtables, and then reading from them to fill the upper hashtable. Max memory consumption *is* worst case here. Remember HashAggregate is incapable of swapping to disk (and if it did, you wouldn't be nearly as pleased with its performance). That's not exactly what I'm proposing - but it is probably due to a lack of understanding some of the underlying details of how postgresql works. I guess I had assumed that the result of a HashAggregate or any other de-duplication process was a table-like structure. And I assumed wrong, I think. I dug into the code (nodeHash.c and others) and I think I understand now why HashAggregate works only in certain circumstances, and I think I understand your comments a bit better now. Basically, HashAggregate doesn't stream unique Nodes the way nodeUnique.c does. nodeUnique basically emits Nodes and elides subsequent, identical Nodes, which is why it relies on the input being sorted. HashAggregate works only on entire input sets at once, and nodeHash.c doesn't emit Nodes at all, really. This makes me wonder if nodeHash.c and nodeHashjoin.c couldn't be modified to output Nodes in a streaming fashion. The memory requirements would not be any worse than now. Does postgresql support any sort of merge sort? If it did, then if the hashtable started consuming too much memory, it could be cleared and the nodes output from the new hashtable could be directed to another temporary file, and then a merge sort could be performed on all of the temporary files (and thus Unique could be used to affect the UNION operation). -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] queries with lots of UNIONed relations
On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: If you have enough memory to de-dup them individually, you surely have enough to de-dup all at once. If everything were available up-front, sure. However, and please correct me if I'm wrong, but doesn't postgresql work in a fairly linear fashion, moving from table to table performing a series of operations on each? Doing a single sort+uniq works like that. But the alternate plan you are proposing we should consider involves building all the lower hashtables, and then reading from them to fill the upper hashtable. Max memory consumption *is* worst case here. Remember HashAggregate is incapable of swapping to disk (and if it did, you wouldn't be nearly as pleased with its performance). That's not exactly what I'm proposing - but it is probably due to a lack of understanding some of the underlying details of how postgresql works. I guess I had assumed that the result of a HashAggregate or any other de-duplication process was a table-like structure. Regarding being pleased with hash aggregate - I am! - except when it goes crazy and eats all of the memory in the machine. I'd trade a bit of performance loss for not using up all of the memory and crashing. However, maybe I'm misunderstanding how SELECT DISTINCT works internally. In the case where a hashtable is used, does postgresql utilize table-like structure or does it remain a hashtable in memory? If it's a hashtable, couldn't the hashtable be built on-the-go rather than only after all of the underlying tuples are available? I'd love a bit more explanation as to how this works. Another example of where this might be useful: I'm currently running a SELECT DISTINCT query over some 500 million rows (120 contributory tables). I expect a de-duplicated row count of well under 10% of that 500 million, probably below 1%. The plan as it stands is to execute a series of sequential scans, appending each of the rows from each contributory table and then aggregating them. If the expected distinctness of each contributory subquery is, say, 5% then instead of aggregating over 500 million tuples the aggregation would take place over 25 million. In this case, that is a savings of 10 gigabytes, approximately. Yes, it's true, the same amount of data has to be scanned. However, the amount of data that needs to be stored (in memory or on disk) in order to provide a final de-duplication is much smaller. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] queries with lots of UNIONed relations
I was recently asked to look into why one particular set of queries was taking a long time. The queries are all of the same form. They select the UNION of a few columns from around 100 tables. The query in particular was taking some 7-8 minutes to run. On a whim, I changed the query from this form: SELECT a, b FROM FOO_a WHERE conditions UNION SELECT a,b FROM FOO_b WHERE conditions to: SELECT DISTINCT a,b FROM FOO_a WHERE conditions UNION SELECT DISTINCT a,b FROM FOO_b WHERE conditions ... and the query time dropped to under a minute. In the former case, the query plan was a bitmap heap scan for each table. Then those results were Appended, Sorted, Uniqued, Sorted again, and then returned. In the latter, before Appending, each table's results were run through HashAggregate. The total number of result rows is in the 500K range. Each table holds approximately 150K matching rows (but this can vary a bit). What I'm asking is this: since adding DISTINCT to each participating member of the UNION query reduced the total number of appended rows, is there some sort of heuristic that postgresql could use to do this automatically? The 12x speedup was quite nice. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] queries with lots of UNIONed relations
On Thu, Jan 13, 2011 at 11:13 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: In the former case, the query plan was a bitmap heap scan for each table. Then those results were Appended, Sorted, Uniqued, Sorted again, and then returned. In the latter, before Appending, each table's results were run through HashAggregate. Probably the reason it did that is that each individual de-duplication looked like it would fit in work_mem, but a single de-duplication didn't. Consider raising work_mem, at least for this one query. I raised work_mem to as high as 512MB (SET LOCAL work_mem = '512MB', within the transaction). Nice. Instead of 7-10 minutes the result is now about a minute (the same as with individual de-duplication). Your comment regarding each individual de-duplication looked like it would fit in work_mem doesn't really make sense, exactly. Maybe I'm misunderstanding you. What I'm asking is this: can postgresql apply a de-duplication to each member of a UNION (as I did with SELECT DISTINCT) in order to reduce the total number of rows that need to be de-duplicated when all of the rows have been Appended? The results of the various plans/tweaks are: Initial state: (work_mem = 16MB, no DISTINCT, run time of 7-10 minutes): Unique (Sort (Append ( Lots of Bitmap Heap Scans Here ) ) ) and (work_mem = 16MB, with DISTINCT, run time of ~ 1 minute): HashAggregate ( Append ( Lots Of HashAggregate( Bitmap Heap Scan ) ) ) and (work_mem = 64kB, DISTINCT, run time of *15+ minutes*): Unique (Sort ( Append ( Lots Of HashAggregate( Bitmap Heap Scan ) ) ) ) So I take from this the following: 1. if the result set fits in work_mem, hash aggregate is wicked fast. About 1 jillion times faster than Unique+Sort. 2. it would be nifty if postgresql could be taught that, in a UNION, to de-duplicate each contributory relation so as to reduce the total set of rows that need to be re-de-duplicated. It's extra work, true, and maybe there are some tricks here, but it seems to make a big difference. This is useful so that the total result set is small enough that hash aggregate might apply. NOTE: I have to have work_mem really low as a global on this machine because other queries involving the same tables (such as those that involve UNION ALL for SUM() or GROUP BY operations) cause the machine to run out of memory. Indeed, even with work_mem at 1MB I run the machine out of memory if I don't explicitly disable hashagg for some queries. Can anything be done about that? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] queries with lots of UNIONed relations
On Thu, Jan 13, 2011 at 2:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: Your comment regarding each individual de-duplication looked like it would fit in work_mem doesn't really make sense, exactly. Maybe I'm misunderstanding you. Yeah. What I was suggesting was to NOT add the DISTINCT's, but instead raise work_mem high enough so you get just one HashAggregation step at the top level. (Although I think this only works in 8.4 and up.) That should be faster than two levels of de-duplication. Gave it a try -- performance either way doesn't seem to change - although the final set that has to undergo de-duplication is rather larger (WITHOUT DISTINCT) so I still run the risk of not getting Hash Aggregation. Since having the DISTINCT doesn't seem to hurt, and it avoids (potential) significant pain, I'll keep it. I still think that having UNION do de-duplication of each contributory relation is a beneficial thing to consider -- especially if postgresql thinks the uniqueness is not very high. Thanks! -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] queries with lots of UNIONed relations
I was recently asked to look into why one particular set of queries was taking a long time. The queries are all of the same form. They select the UNION of a few columns from around 100 tables. The query in particular was taking some 7-8 minutes to run. On a whim, I changed the query from this form: SELECT a, b FROM FOO_a WHERE conditions UNION SELECT a,b FROM FOO_b WHERE conditions to: SELECT DISTINCT a,b FROM FOO_a WHERE conditions UNION SELECT DISTINCT a,b FROM FOO_b WHERE conditions and the query time dropped to under a minute. In the former case, the query plan was a bitmap heap scan for each table. Then those results were Appended, Sorted, Uniqued, Sorted again, and then returned. In the latter, before Appending, each table's results were run through HashAggregate. The total number of result rows is in the 500K range. Each table holds approximately 150K matching rows (but this can vary a bit). What I'm asking is this: since adding DISTINCT to each participating member of the UNION query reduced the total number of appended rows, is there some sort of heuristic that postgresql could use to do this automatically? The 12x speedup was quite nice. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] queries with lots of UNIONed relations
On Thu, Jan 13, 2011 at 4:49 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 13, 2011 at 5:47 PM, Andy Colson a...@squeakycode.net wrote: I don't believe there is any case where hashing each individual relation is a win compared to hashing them all together. If the optimizer were smart enough to be considering the situation as a whole, it would always do the latter. You might be right, but I'm not sure. Suppose that there are 100 inheritance children, and each has 10,000 distinct values, but none of them are common between the tables. In that situation, de-duplicating each individual table requires a hash table that can hold 10,000 entries. But deduplicating everything at once requires a hash table that can hold 1,000,000 entries. Or am I all wet? Yeah, I'm all wet, because you'd still have to re-de-duplicate at the end. But then why did the OP get a speedup? *scratches head* Because it all fix it memory and didnt swap to disk? Doesn't make sense. The re-de-duplication at the end should use the same amount of memory regardless of whether the individual relations have already been de-duplicated. I don't believe that to be true. Assume 100 tables each of which produces 10,000 rows from this query. Furthermore, let's assume that there are 3,000 duplicates per table. Without DISTINCT: uniqify (100 * 10,000 = 1,000,000 rows) With DISTINCT: uniqify (100 * (10,000 - 3,000) = 700,000 rows) 300,000 rows times (say, 64 bytes/row) = 18.75MB. Not a lot, but more than the work_mem of 16MB. Or maybe *I'm* all wet? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] queries with lots of UNIONed relations
On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jan 13, 2011 at 5:26 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't believe there is any case where hashing each individual relation is a win compared to hashing them all together. If the optimizer were smart enough to be considering the situation as a whole, it would always do the latter. You might be right, but I'm not sure. Suppose that there are 100 inheritance children, and each has 10,000 distinct values, but none of them are common between the tables. In that situation, de-duplicating each individual table requires a hash table that can hold 10,000 entries. But deduplicating everything at once requires a hash table that can hold 1,000,000 entries. Or am I all wet? If you have enough memory to de-dup them individually, you surely have enough to de-dup all at once. It is not possible for a single hashtable to have worse memory consumption than N hashtables followed by a union hashtable, and in fact if there are no common values then the latter eats twice as much space because every value appears twice in two different hashtables. If everything were available up-front, sure. However, and please correct me if I'm wrong, but doesn't postgresql work in a fairly linear fashion, moving from table to table performing a series of operations on each? That seems to indicate that is what the plan is: Compare: for each table LOOP scan table for result rows, append to results END LOOP hash / sort + unique results versus: for each table LOOP scan table for result rows, append to table-results hash / sort+unique table-results, append to results END LOOP hash / sort + unique results In the former case, all of the result rows from all tables are appended together before the de-duplification process can start. In the latter case, only enough memory for each table's result set is necessary for de-duplification, and it would only be necessary to allocate it for that table. Is that not how this works? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Update problem on large table
On Mon, Dec 6, 2010 at 1:46 PM, bricklen brick...@gmail.com wrote: On Sat, Dec 4, 2010 at 11:45 AM, felix crucialfe...@gmail.com wrote: Ok, I caught one : an update that is stuck in waiting. the first one blocks the second one. ns | 5902 | nssql | UPDATE fastadder_fastadderstatus SET built = false WHERE fastadder_fastadderstatus.service_id = 1 Not sure if anyone replied about killing your query, but you can do it like so: select pg_cancel_backend(5902); -- assuming 5902 is the pid of the query you want canceled. How does this differ from just killing the pid? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Defaulting wal_sync_method to fdatasync on Linux for 9.1?
On Wed, Nov 17, 2010 at 3:24 PM, Greg Smith g...@2ndquadrant.com wrote: Scott Carey wrote: Did you recompile your test on the RHEL6 system? On both systems I showed, I checked out a fresh copy of the PostgreSQL 9.1 HEAD from the git repo, and compiled that on the server, to make sure I was pulling in the appropriate kernel headers. I wasn't aware of exactly how the kernel sync stuff was refactored though, thanks for the concise update on that. I can do similar tests on a RHEL5 system, but not on the same hardware. Can only make my laptop boot so many operating systems at a time usefully. One thing to note is that where on a disk things sit can make a /huge/ difference - depending on if Ubuntu is /here/ and RHEL is /there/ and so on can make a factor of 2 or more difference. The outside tracks of most modern SATA disks can do around 120MB/s. The inside tracks aren't even half of that. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] do temporary tables have hint bits?
I was doing some testing with temporary tables using this sql: begin; select pg_sleep(30); create temporary TABLE foo (a int, b int, c int, d text); insert into foo SELECT (x%1000) AS a,(x%1001) AS b, (x % 650) as c, '' as d FROM generate_series( 1, 100 ) AS x; -- create temporary TABLE foo AS SELECT (x%1000) AS a,(x%1001) AS b, (x % 650) as c, '' as d FROM generate_series( 1, 100 ) AS x; select count(1) from foo; While it was in pg_sleep, I would attach to the backend process with strace. I observed a few things that I don't yet understand, but one thing I did notice was an I/O pattern (following the count(1)) that seemed to suggest that the table was getting its hint bits set. I thought hint bits were just for the mvcc side of things? If this is a temporary table, is there any need or benefit to setting hint bits? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] temporary tables, indexes, and query plans
On Sat, Nov 13, 2010 at 9:41 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: OK. This is a highly distilled example that shows the behavior. BEGIN; CREATE TEMPORARY TABLE foo AS SELECT x AS A, chr(x % 75 + 32) AS b, ''::text AS c from generate_series(1,500) AS x; UPDATE foo SET c = 'foo' WHERE b = 'A' ; CREATE INDEX foo_b_idx on foo (b); [ and the rest of the transaction can't use that index ] OK, this is an artifact of the HOT update optimization. Before creating the index, you did updates on the table that would have been executed differently if the index had existed. When the index does get created, its entries for those updates are incomplete, so the index can't be used in transactions that could in principle see the unmodified rows. Aha! When you indicated that HOT updates were part of the problem, I googled HOT updates for more detail and ran across this article: http://pgsql.tapoueh.org/site/html/misc/hot.html which was very useful in helping me to understand things. If I understand things correctly, after a tuple undergoes a HOT-style update, there is a chain from the original tuple to the updated tuple. If an index already exists on the relation (and involves the updated column), a *new entry* in the index is created. However, if an index does not already exist and one is created (which involves a column with tuples that underwent HOT update) then it seems as though the index doesn't see either version. Is that description inaccurate? What would the effect be of patching postgresql to allow indexes to see and follow the HOT chains during index creation? The reason I did the update before the index creation is that the initial update (in the actual version, not this test version) updates 2.8 million of some 7.5 million rows (or a bit under 40% of the entire table), and such a large update seems like it would have a deleterious effect on the index (although in either case the planner properly chooses a sequential scan for this update). You could avoid this effect either by creating the index before you do any updates on the table, or by not wrapping the entire process into a single transaction. I need the whole thing in a single transaction because I make /extensive/ use of temporary tables and many dozens of statements that need to either succeed or fail as one. Is this HOT update optimization interaction with indexes documented anywhere? It doesn't appear to be common knowledge as there are now 20 messages in this topic and this is the first mention of the HOT updates / index interaction. I would like to suggest that an update to the CREATE INDEX documentation might contain some caveats about creating indexes in transactions on relations that might have HOT updates. Again, I'd like to thank everybody for helping me to figure this out. It's not a huge burden to create the index before the updates, but understanding *why* it wasn't working (even if it violates the principle-of-least-surprise) helps quite a bit. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] postmaster consuming /lots/ of memory with hash aggregate. why?
On Thu, Nov 11, 2010 at 10:38 PM, Pavel Stehule pavel.steh...@gmail.com wrote: 2010/11/12 Jon Nelson jnelson+pg...@jamponi.net: On Thu, Nov 11, 2010 at 10:26 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello look on EXPLAIN ANALYZE command. Probably your statistic are out, and then planner can be confused. EXPLAIN ANALYZE statement show it. As I noted earlier, I did set statistics to 1000 an re-ran vacuum analyze and the plan did not change. this change can do nothing. this is default in config. did you use ALTER TABLE ALTER COLUMN SET STATISTIC = ... ? and ANALYZE No. To be clear: are you saying that changing the value for default_statistics_target, restarting postgresql, and re-running VACUUM ANALYZE does *not* change the statistics for columns created/populated *prior* to the sequence of operations, and that one /must/ use ALTER TABLE ALTER COLUMN SET STATISTICS ... and re-ANALYZE? That does not jive with the documentation, which appears to suggest that setting a new default_statistics_target, restarting postgresql, and then re-ANALYZE'ing a table should be sufficient (provided the columns have not had a statistics target explicitly set). What other diagnostics can I provide? This still doesn't answer the 4 row question, though. It seems absurd to me that the planner would give up and just use 4 rows (0.02 percent of the actual result). there can be some not well supported operation, then planner use a some % from rows without statistic based estimation The strange thing is that the value 4 keeps popping up in totally diffferent contexts, with different tables, databases, etc... I tried digging through the code and the only thing I found was that numGroups was being set to 4 but I couldn't see where. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] temporary tables, indexes, and query plans
On Thu, Oct 28, 2010 at 9:23 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: On Wed, Oct 27, 2010 at 5:36 PM, Tom Lane t...@sss.pgh.pa.us wrote: It thinks it's faster, or there is some reason why it *can't* use the index, like a datatype mismatch. You could tell which by trying set enable_seqscan = off to see if that will make it change to another plan; if so, the estimated costs of that plan versus the original seqscan would be valuable information. When I place the index creation and ANALYZE right after the bulk update, follow it with 'set enable_seqscan = false', the next query (also an UPDATE - should be about 7 rows) results in this plan: Seq Scan on foo_table (cost=100.00..1004998.00 rows=24 width=236) OK, so it thinks it can't use the index. (The cost=100 bit is the effect of enable_seqscan = off: it's not possible to just never use seqscans, but we assign an artificially high cost to discourage the planner from selecting them if there's any other alternative.) So we're back to wondering why it can't use the index. I will say once more that we could probably figure this out quickly if you'd post an exact example instead of handwaving. OK. This is a highly distilled example that shows the behavior. The ANALYZE doesn't appear to change anything, nor the SET STATISTICS (followed by ANALYZE), nor disabling seqential scans. Re-writing the table with ALTER TABLE does, though. If the initial UPDATE (the one before the index creation) is commented out, then the subsequent updates don't use sequential scans. \timing off BEGIN; CREATE TEMPORARY TABLE foo AS SELECT x AS A, chr(x % 75 + 32) AS b, ''::text AS c from generate_series(1,500) AS x; UPDATE foo SET c = 'foo' WHERE b = 'A' ; CREATE INDEX foo_b_idx on foo (b); -- let's see what it looks like EXPLAIN UPDATE foo SET c='bar' WHERE b = 'C'; -- does forcing a seqscan off help? set enable_seqscan = false; EXPLAIN UPDATE foo SET c='bar' WHERE b = 'C'; -- what about analyze? ANALYZE VERBOSE foo; EXPLAIN UPDATE foo SET c='bar' WHERE b = 'C'; -- what about statistics? ALTER TABLE foo ALTER COLUMN b SET STATISTICS 1; ANALYZE VERBOSE foo; EXPLAIN UPDATE foo SET c='bar' WHERE b = 'C'; -- let's re-write the table ALTER TABLE foo ALTER COLUMN a TYPE int; EXPLAIN UPDATE foo SET c='bar' WHERE b = 'C'; ROLLBACK; -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] anti-join chosen even when slower than old plan
On Thu, Nov 11, 2010 at 1:41 PM, Tom Lane t...@sss.pgh.pa.us wrote: Kevin Grittner kevin.gritt...@wicourts.gov writes: Besides the fully-scanned object size relative to relation size costing adjustment idea, the only one which seemed to be likely to be useful for this sort of issue was the costing factors by user ID idea -- the interactive queries hitting the well-cached portion of the tables are run through a read-only user ID, while the weekly maintenance scripts (obviously) are not. With the settings I initially had assigned to the cluster the maintenance scripts would never have seen this issue; it was tuning to resolve end-user complaints of slowness in the interactive queries which set up the conditions for failure, and if I'd had per-user settings, I probably would have (and definitely *should* have) used them. Erm ... you can in fact do ALTER USER SET random_page_cost today. As long as the settings are GUC parameters we have quite a lot of flexibility about how to control them. This gets back to my earlier point that our current form of per-relation properties (reloptions) is considerably less flexible than a GUC. I think that if we create any strong planner dependence on such properties, we're going to end up needing to be able to set them in all the same ways you can set a GUC. In Kevin's particular case, would this mechanism not help? By that I mean he could have two users: one user for the daily, the tables-ought-to-be-in-hot-cache use case. The other use could make use of the ALTER USER SET ... mechanism to drive the weekly reporting (tables are probably not hot) use case. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] postmaster consuming /lots/ of memory with hash aggregate. why?
On Fri, Nov 5, 2010 at 7:26 PM, Jon Nelson jnelson+pg...@jamponi.net wrote: I have a postgres 8.4.5 instance on CentOS 5 (x86_64) which appears to go crazy with the amount of memory it consumes. When I run the query below, in a matter of a few seconds memory balloons to 5.3G (virtual), 4.6G (resident) and 1840 (shared), and eventually the oom killer is invoked, killing the entire process. Physical memory is 8GB but other processes on the box consume approximately 4GB of that. The settings changed from their defaults: effective_cache_size = 4GB work_mem = 16MB maintenance_work_mem = 128MB wal_buffers = 16MB checkpoint_segments = 16 shared_buffers = 384MB checkpoint_segments = 64 and default_statistics_target = 100 The query is this: insert into d_2010_09_13_sum select FOO.i, FOO.n, sum(FOO.cc) as cc, sum(FOO.oc) as oc from ( select * from d_2010_09_12_sum union all select * from d_2010_09_13 ) AS FOO group by i, n; here is the explain: Subquery Scan *SELECT* (cost=1200132.06..1201332.06 rows=4 width=80) - HashAggregate (cost=1200132.06..1200732.06 rows=4 width=41) - Append (cost=0.00..786531.53 rows=41360053 width=41) - Seq Scan on d_2010_09_12_sum (cost=0.00..520066.48 rows=27272648 width=42) - Seq Scan on d_2010_09_13 (cost=0.00..266465.05 rows=14087405 width=40) Both source tables freshly vacuum analyze'd. The row estimates are correct for both source tables. If I use set enable_hashagg = false I get this plan: Subquery Scan *SELECT* (cost=8563632.73..9081838.25 rows=4 width=80) - GroupAggregate (cost=8563632.73..9081238.25 rows=4 width=41) - Sort (cost=8563632.73..8667033.84 rows=41360441 width=41) Sort Key: d_2010_09_12_sum.i, d_2010_09_12_sum.n - Result (cost=0.00..786535.41 rows=41360441 width=41) - Append (cost=0.00..786535.41 rows=41360441 width=41) - Seq Scan on d_2010_09_12_sum (cost=0.00..520062.04 rows=27272204 width=42) - Seq Scan on d_2010_09_13 (cost=0.00..266473.37 rows=14088237 width=40) and postmaster's memory never exceeds (roughly) 548M (virtual), 27M (resident), 5M (shared). I even set default_statistics_target to 1000 and re-ran vacuum analyze verbose on both tables - no change. If I set work_mem to 1MB (from 16MB) then the GroupAggregate variation is chosen instead. Experimentally, HashAggregate is chosen when work_mem is 16MB, 8MB, 6MB, 5MB but not 4MB and on down. Two things I don't understand: 1. Why, when hash aggregation is allowed, does memory absolutely explode (eventually invoking the wrath of the oom killer). 16MB for work_mem does not seem outrageously high. For that matter, neither does 5MB. 2. Why do both HashAggregate and GroupAggregate say the cost estimate is 4 rows? Unfortunately, I've found that as my database size grows, I've generally had to disable hash aggregates for fear of even simple seeming queries running out of memory, even with work_mem = 1MB. In some cases I saw memory usage (with hashagg) grow to well over 5GB and with group aggregate it barely moves. Am *I* doing something wrong? Some of these queries are on partitioned tables (typically querying the parent) and the resulting UNION or UNION ALL really starts to hurt, and when the server runs out of memory and kills of the postmaster process a few minutes or even hours into the query it doesn't make anybody very happy. Is there some setting I can turn on to look to see when memory is being allocated (and, apparently, not deallocated)? The latest query has a HashAggregate that looks like this: HashAggregate (cost=19950525.30..19951025.30 rows=4 width=37) but there are, in reality, approximately 200 million rows (when I run the query with GroupAggregate, that's what I get). Why does it keep choosing 40,000 rows? I suppose I could use the newly-learned ALTER USER trick to disable hash aggregation for the primary user, because disabling hash aggregation system-wide sounds fairly drastic. However, if I *don't* disable it, the query quickly balloons memory usage to the point where the process is killed off. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] postmaster consuming /lots/ of memory with hash aggregate. why?
On Thu, Nov 11, 2010 at 10:26 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello look on EXPLAIN ANALYZE command. Probably your statistic are out, and then planner can be confused. EXPLAIN ANALYZE statement show it. As I noted earlier, I did set statistics to 1000 an re-ran vacuum analyze and the plan did not change. What other diagnostics can I provide? This still doesn't answer the 4 row question, though. It seems absurd to me that the planner would give up and just use 4 rows (0.02 percent of the actual result). -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] postmaster consuming /lots/ of memory with hash aggregate. why?
I also found this. Perhaps it is related? http://postgresql.1045698.n5.nabble.com/Hash-Aggregate-plan-picked-for-very-large-table-out-of-memory-td1883299.html -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Running PostgreSQL as fast as possible no matter the consequences
On Fri, Nov 5, 2010 at 7:08 AM, Guillaume Cottenceau g...@mnc.ch wrote: Marti Raudsepp marti 'at' juffo.org writes: On Fri, Nov 5, 2010 at 13:32, A B gentosa...@gmail.com wrote: I was just thinking about the case where I will have almost 100% selects, but still needs something better than a plain key-value storage so I can do some sql queries. The server will just boot, load data, run, hopefully not crash but if it would, just start over with load and run. If you want fast read queries then changing fsync/full_page_writes/synchronous_commit won't help you. That illustrates how knowing the reasoning of this particular requests makes new suggestions worthwhile, while previous ones are now seen as useless. I disagree that they are useless - the stated mechanism was start, load data, and run. Changing the params above won't likely change much in the 'run' stage but would they help in the 'load' stage? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] postmaster consuming /lots/ of memory with hash aggregate. why?
I have a postgres 8.4.5 instance on CentOS 5 (x86_64) which appears to go crazy with the amount of memory it consumes. When I run the query below, in a matter of a few seconds memory balloons to 5.3G (virtual), 4.6G (resident) and 1840 (shared), and eventually the oom killer is invoked, killing the entire process. Physical memory is 8GB but other processes on the box consume approximately 4GB of that. The settings changed from their defaults: effective_cache_size = 4GB work_mem = 16MB maintenance_work_mem = 128MB wal_buffers = 16MB checkpoint_segments = 16 shared_buffers = 384MB checkpoint_segments = 64 and default_statistics_target = 100 The query is this: insert into d_2010_09_13_sum select FOO.i, FOO.n, sum(FOO.cc) as cc, sum(FOO.oc) as oc from ( select * from d_2010_09_12_sum union all select * from d_2010_09_13 ) AS FOO group by i, n; here is the explain: Subquery Scan *SELECT* (cost=1200132.06..1201332.06 rows=4 width=80) - HashAggregate (cost=1200132.06..1200732.06 rows=4 width=41) - Append (cost=0.00..786531.53 rows=41360053 width=41) - Seq Scan on d_2010_09_12_sum (cost=0.00..520066.48 rows=27272648 width=42) - Seq Scan on d_2010_09_13 (cost=0.00..266465.05 rows=14087405 width=40) Both source tables freshly vacuum analyze'd. The row estimates are correct for both source tables. If I use set enable_hashagg = false I get this plan: Subquery Scan *SELECT* (cost=8563632.73..9081838.25 rows=4 width=80) - GroupAggregate (cost=8563632.73..9081238.25 rows=4 width=41) - Sort (cost=8563632.73..8667033.84 rows=41360441 width=41) Sort Key: d_2010_09_12_sum.i, d_2010_09_12_sum.n - Result (cost=0.00..786535.41 rows=41360441 width=41) - Append (cost=0.00..786535.41 rows=41360441 width=41) - Seq Scan on d_2010_09_12_sum (cost=0.00..520062.04 rows=27272204 width=42) - Seq Scan on d_2010_09_13 (cost=0.00..266473.37 rows=14088237 width=40) and postmaster's memory never exceeds (roughly) 548M (virtual), 27M (resident), 5M (shared). I even set default_statistics_target to 1000 and re-ran vacuum analyze verbose on both tables - no change. If I set work_mem to 1MB (from 16MB) then the GroupAggregate variation is chosen instead. Experimentally, HashAggregate is chosen when work_mem is 16MB, 8MB, 6MB, 5MB but not 4MB and on down. Two things I don't understand: 1. Why, when hash aggregation is allowed, does memory absolutely explode (eventually invoking the wrath of the oom killer). 16MB for work_mem does not seem outrageously high. For that matter, neither does 5MB. 2. Why do both HashAggregate and GroupAggregate say the cost estimate is 4 rows? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] typoed column name, but postgres didn't grump
I've been having trouble with a query. The query is a cross join between two tables. Initially, I mis-typed the query, and one of the columns specified in the query doesn't exist, however the query ran nonetheless. The actual query: select gid from t2, t3 where t2.name = t3.name and t3.scope = 'city' and t3.hierarchy = 'STANDARD' and t2.adiv = t3.adiv limit 1 ; However, there *is* no column 'name' in table 't2'. When I ran the query, it took a *really* long time to run (670 seconds). When I corrected the query to use the right column name (city_name), the query ran in 28ms. The question, then, is why didn't the postgres grump about the non-existent column name? The version is 8.4.5 on x86_64, openSUSE 11.3 PostgreSQL 8.4.5 on x86_64-unknown-linux-gnu, compiled by GCC gcc (SUSE Linux) 4.5.0 20100604 [gcc-4_5-branch revision 160292], 64-bit -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] temporary tables, indexes, and query plans
On Wed, Oct 27, 2010 at 5:36 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: I'd like to zoom out a little bit and, instead of focusing on the specifics, ask more general questions: .. - is there some way for me to determine /why/ the planner chooses a sequential scan over other options? It thinks it's faster, or there is some reason why it *can't* use the index, like a datatype mismatch. You could tell which by trying set enable_seqscan = off to see if that will make it change to another plan; if so, the estimated costs of that plan versus the original seqscan would be valuable information. When I place the index creation and ANALYZE right after the bulk update, follow it with 'set enable_seqscan = false', the next query (also an UPDATE - should be about 7 rows) results in this plan: Seq Scan on foo_table (cost=100.00..1004998.00 rows=24 width=236) The subsequent queries all have the same first-row cost and similar last-row costs, and of course the rows value varies some as well. All of them, even the queries which update exactly 1 row, have similar cost: Seq Scan on foo_table (cost=100.00..1289981.17 rows=1 width=158) I cranked the logging up a bit, but I don't really know what to fiddle there, and while I got a lot of output, I didn't see much in the way of cost comparisons. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] temporary tables, indexes, and query plans
On Wed, Oct 27, 2010 at 12:44 PM, Mladen Gogala mladen.gog...@vmsinfo.com wrote: On 10/27/2010 1:29 PM, Jon Nelson wrote: How big is your default statistics target? The default is rather small, it doesn't produce very good or usable histograms. Currently, default_statistics_target is 50. I note that if I create a indexes earlier in the process (before the copy) then they are used. I'm not trying creating them after the first UPDATE (which updates 2.8million of the 10million rows). The subsequent UPDATE statements update very few (3-4 thousand for 2 of them, less than a few dozen for the others) and the ones that use the index only update *1* row. I'll also try setting a higher default_statistics_target and let you know! -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] temporary tables, indexes, and query plans
On Wed, Oct 27, 2010 at 12:59 PM, Jon Nelson jnelson+pg...@jamponi.net wrote: On Wed, Oct 27, 2010 at 12:44 PM, Mladen Gogala mladen.gog...@vmsinfo.com wrote: On 10/27/2010 1:29 PM, Jon Nelson wrote: How big is your default statistics target? The default is rather small, it doesn't produce very good or usable histograms. Currently, default_statistics_target is 50. I set it to 500 and restarted postgres. No change in (most of) the query plans! The update statement that updates 7 rows? No change. The one that updates 242 rows? No change. 3714? No change. I killed the software before it got to the 1-row-only statements. I'm not trying creating them after the first UPDATE (which updates 2.8million of the 10million rows). I mean to say that I (tried) creating the indexes after the first UPDATE statement. This did not improve things. I am now trying to see how creating the indexes before between the COPY and the UPDATE performs. I didn't really want to do this because I know that the first UPDATE statement touches about 1/3 of the table, and this would bloat the index and slow the UPDATE (which should be a full table scan anyway). It's every subsequent UPDATE that touches (at most) 4000 rows (out of 10 million) that I'm interested in. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] temporary tables, indexes, and query plans
On Wed, Oct 27, 2010 at 1:32 PM, Reid Thompson reid.thomp...@ateb.com wrote: On Wed, 2010-10-27 at 13:23 -0500, Jon Nelson wrote: set it to 500 and restarted postgres. did you re-analyze? Not recently. I tried that, initially, and there was no improvement. I'll try it again now that I've set the stats to 500. The most recent experiment shows me that, unless I create whatever indexes I would like to see used *before* the large (first) update, then they just don't get used. At all. Why would I need to ANALYZE the table immediately following index creation? Isn't that part of the index creation process? Currently executing is a test where I place an ANALYZE foo after the COPY, first UPDATE, and first index, but before the other (much smaller) updates. .. Nope. The ANALYZE made no difference. This is what I just ran: BEGIN; CREATE TEMPORARY TABLE foo COPY ... UPDATE ... -- 1/3 of table, approx CREATE INDEX foo_rowB_idx on foo (rowB); ANALYZE ... -- queries from here to 'killed' use WHERE rowB = 'someval' UPDATE ... -- 7 rows. seq scan! UPDATE ... -- 242 rows, seq scan! UPDATE .. -- 3700 rows, seq scan! UPDATE .. -- 3100 rows, seq scan! killed. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] temporary tables, indexes, and query plans
On Wed, Oct 27, 2010 at 1:52 PM, Jon Nelson jnelson+pg...@jamponi.net wrote: On Wed, Oct 27, 2010 at 1:32 PM, Reid Thompson reid.thomp...@ateb.com wrote: On Wed, 2010-10-27 at 13:23 -0500, Jon Nelson wrote: set it to 500 and restarted postgres. did you re-analyze? Not recently. I tried that, initially, and there was no improvement. I'll try it again now that I've set the stats to 500. The most recent experiment shows me that, unless I create whatever indexes I would like to see used *before* the large (first) update, then they just don't get used. At all. Why would I need to ANALYZE the table immediately following index creation? Isn't that part of the index creation process? Currently executing is a test where I place an ANALYZE foo after the COPY, first UPDATE, and first index, but before the other (much smaller) updates. .. Nope. The ANALYZE made no difference. This is what I just ran: BEGIN; CREATE TEMPORARY TABLE foo COPY ... UPDATE ... -- 1/3 of table, approx CREATE INDEX foo_rowB_idx on foo (rowB); ANALYZE ... -- queries from here to 'killed' use WHERE rowB = 'someval' UPDATE ... -- 7 rows. seq scan! UPDATE ... -- 242 rows, seq scan! UPDATE .. -- 3700 rows, seq scan! UPDATE .. -- 3100 rows, seq scan! killed. Even generating the index beforehand (sans ANALYZE) was no help. If I generate *all* of the indexes ahead of time, before the COPY, that's the only time index usage jives with my expectations. Here is an example of the output from auto analyze (NOTE: the WHERE clause in this statement specifies a single value in the same column that has a UNIQUE index on it): Seq Scan on foo_table (cost=0.00..289897.04 rows=37589 width=486) and yet the actual row count is exactly 1. If I change the order so that the index creation *and* analyze happen *before* the first (large) update, then things appear to proceed normally and the indexes are used when expected, although in some cases the stats are still way off: Bitmap Heap Scan on foo_table (cost=40.96..7420.39 rows=1999 width=158) and yet there are only 7 rows that match. The others seem closer (only off by 2x rather than 250x). It seems as though creating an index is not enough. It seems as though ANALYZE after index creation is not enough, either. I am theorizing that I have to touch (or just scan?) some percentage of the table in order for the index to be used? If that's true, then what is ANALYZE for? I've got the stats cranked up to 500. Should I try 1000? Jason Pitts: RE: changing default_statistics_target (or via ALTER TABLE SET STATS) not taking effect until ANALYZE is performed. I did already know that, but it's probably good to put into this thread. However, you'll note that this is a temporary table created at the beginning of a transaction. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] temporary tables, indexes, and query plans
On Wed, Oct 27, 2010 at 2:43 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: The most recent experiment shows me that, unless I create whatever indexes I would like to see used *before* the large (first) update, then they just don't get used. At all. You're making a whole lot of assertions here that don't square with usual experience. I think there is some detail about what you're doing that affects the outcome, but since you haven't shown a concrete example, it's pretty hard to guess what the critical detail is. First, let me supply all of the changed (from the default) params: default_statistics_target = 500 maintenance_work_mem = 240MB work_mem = 256MB effective_cache_size = 1GB checkpoint_segments = 128 shared_buffers = 1GB max_connections = 30 wal_buffers = 64MB shared_preload_libraries = 'auto_explain' The machine is a laptop with 4GB of RAM running my desktop. Kernel is 2.6.36, filesystem is ext4 (for data) and ext2 (for WAL logs). The disk is a really real disk, not an SSD. The sequence goes exactly like this: BEGIN; CREATE TEMPORARY TABLE (20 columns, mostly text, a few int). COPY (approx 8 million rows, ~900 MB)[1] UPDATE (2.8 million of the rows) UPDATE (7 rows) UPDATE (250 rows) UPDATE (3500 rows) UPDATE (3100 rows) a bunch of UPDATE (1 row) ... Experimentally, I noticed that performance was not especially great. So, I added some indexes (three indexes on one column each). One index is UNIQUE. The first UPDATE can't use any of the indexes. The rest should be able to. In my experiments, I found that: If I place the index creation *before* the copy, the indexes are used. If I place the index creation *after* the copy but before first UPDATE, the indexes are used. If I place the index creation at any point after the first UPDATE, regardless of whether ANALYZE is run, the indexes are not used (at least, according to auto_analyze). Does that help? [1] I've been saying 10 million. It's really more like 8 million. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] temporary tables, indexes, and query plans
On Wed, Oct 27, 2010 at 4:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: The sequence goes exactly like this: BEGIN; CREATE TEMPORARY TABLE (20 columns, mostly text, a few int). COPY (approx 8 million rows, ~900 MB)[1] UPDATE (2.8 million of the rows) UPDATE (7 rows) UPDATE (250 rows) UPDATE (3500 rows) UPDATE (3100 rows) a bunch of UPDATE (1 row) ... Experimentally, I noticed that performance was not especially great. So, I added some indexes (three indexes on one column each). One index is UNIQUE. The first UPDATE can't use any of the indexes. The rest should be able to. Please ... there is *nothing* exact about that. It's not even clear what the datatypes of the indexed columns are, let alone what their statistics are, or whether there's something specific about how you're declaring the table or the indexes. The indexed data types are: - an INT (this is a unique ID, and it is declared so) - two TEXT fields. The initial value of one of the text fields is NULL, and it is updated to be not longer than 10 characters long. The other text field is not more than 4 characters long. My guesstimate as to the distribution of values in this column is not more than 2 dozen. I am not doing anything when I define the table except using TEMPORARY. The indexes are as bog-standard as one can get. No where clause, no functions, nothing special at all. I'd like to zoom out a little bit and, instead of focusing on the specifics, ask more general questions: - does the table being temporary effect anything? Another lister emailed me and wondered if ANALYZE on a temporary table might behave differently. - is there some way for me to determine /why/ the planner chooses a sequential scan over other options? I'm already using auto explain. - in the general case, are indexes totally ready to use after creation or is an analyze step necessary? - do hint bits come into play here at all? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgres insert performance and storage requirement compared to Oracle
On Tue, Oct 26, 2010 at 4:02 PM, Mladen Gogala mladen.gog...@vmsinfo.com wrote: On 10/26/2010 11:41 AM, Merlin Moncure wrote: yup, that's exactly what I mean -- this will give you more uniform insert performance (your temp table doesn't even need indexes). Every N records (say 1) you send to permanent and truncate the temp table. Obviously, this is more fragile approach so weigh the pros/cons carefully. merlin Truncate temporary table? What a horrible advice! All that you need is the temporary table to delete rows on commit. I believe Merlin was suggesting that, after doing 1 inserts into the temporary table, that something like this might work better: start loop: populate rows in temporary table insert from temporary table into permanent table truncate temporary table loop I do something similar, where I COPY data to a temporary table, do lots of manipulations, and then perform a series of INSERTS from the temporary table into a permanent table. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] odd postgresql performance (excessive lseek)
No replies? This is another situation where using pread would have saved a lot of time and sped things up a bit, but failing that, keeping track of the file position ourselves and only lseek'ing when necessary would also help. Postgresql was spending 37% of it's time in redundant lseek! -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] odd postgresql performance (excessive lseek)
On Tue, Oct 19, 2010 at 8:25 AM, Merlin Moncure mmonc...@gmail.com wrote: On Tue, Oct 19, 2010 at 9:10 AM, Jon Nelson jnelson+pg...@jamponi.net wrote: No replies? This is another situation where using pread would have saved a lot of time and sped things up a bit, but failing that, keeping track of the file position ourselves and only lseek'ing when necessary would also help. Postgresql was spending 37% of it's time in redundant lseek! 37% of cpu time? Is that according to strace -T? how did you measure it? Per the original post, it (redundant lseek system calls) accounted for 37% of the time spent in the kernel. strace -f -p pid -c -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] odd postgresql performance (excessive lseek)
On Tue, Oct 19, 2010 at 9:36 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: This is another situation where using pread would have saved a lot of time and sped things up a bit, but failing that, keeping track of the file position ourselves and only lseek'ing when necessary would also help. No, it wouldn't; you don't have the slightest idea what's going on there. Those lseeks are for the purpose of detecting the current EOF location, ie, finding out whether some other backend has extended the file recently. We could get rid of them, but only at the cost of putting in some other communication mechanism instead. That's a little harsh (it's not untrue, though). It's true I don't know how postgresql works WRT how it manages files, but now I've been educated (some). I'm guessing, then, that due to how each backend may extend files without the other backends knowing of it, that using fallocate or some-such is also likely a non-starter. I ask because, especially when allocating files 8KB at a time, file fragmentation on a busy system is potentially high. I recently saw an ext3 filesystem (dedicated to postgresql) with 38% file fragmentation and, yes, it does make a huge performance difference in some cases. After manually defragmenting some files (with pg offline) I saw a read speed increase for single-MB-per-second to high-double-digit-MB-per-second. However, after asking pg to rewrite some of the worst files (by way of CLUSTER or ALTER TABLE) I saw no improvement - I'm guessing due to the 8KB-at-a-time allocation mechanism. Has any work been done on making use of shared memory for file stats or using fallocate (or posix_fallocate) to allocate files in larger chunks? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] unexpected query failure: ERROR: GIN indexes do not support whole-index scans
I have a table with an array column. I added a GIN index to the array: CREATE INDEX foo_idx ON t USING GIN (alternatecodes) WHERE alternatecodes IS NOT NULL; That's all well and good. However, some queries started failing and I was able to reproduce the behavior in psql! SELECT * FROM t WHERE alternatecodes IS NOT NULL; returns: ERROR: GIN indexes do not support whole-index scans Whaaa? Adding an *index* makes my /queries/ stop working? How can this be? This really violated my principle of least surprise. If GIN indexes don't support whole-index scans, fine, don't use them, but don't make a perfectly valid query fail because of it. This seems like a bug. Is it? PostgreSQL version: PostgreSQL 8.4.5 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] unexpected query failure: ERROR: GIN indexes do not support whole-index scans
On Mon, Oct 18, 2010 at 6:01 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jon Nelson jnelson+pg...@jamponi.net writes: CREATE INDEX foo_idx ON t USING GIN (alternatecodes) WHERE alternatecodes IS NOT NULL; SELECT * FROM t WHERE alternatecodes IS NOT NULL; ERROR: GIN indexes do not support whole-index scans Yep, this is a known issue. It's going to take major surgery on GIN to fix it, so don't hold your breath. In the particular case, what good do you think the WHERE clause is doing anyway? GIN won't index nulls at all ... which indeed is an aspect of the underlying issue --- see recent discussions, eg here: http://archives.postgresql.org/pgsql-hackers/2010-10/msg00521.php OK, so GIN doesn't index NULLs. I guess the IS NOT NULL part comes about as a habit - that particular column is fairly sparse. However, I'm honestly quite surprised at two things: 1. if GIN indexes ignore NULLs, then either it should grump when one specifics WHERE ... IS NOT NULL or it should be treated as a no-op 2. (and this is by far the more surprising) that the /presence/ of an INDEX can *break* a SELECT. It's not that the engine ignores the index - that would be reasonable - but that I can't issue a SELECT with a WHERE statement that matches the same as the index. However, I see that this also surprised Josh Berkus, and not that long ago (11 days!), so I'll just shush. Thanks! -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] oracle to psql migration - slow query in postgres
Just my take on this. The first thing I'd do is think real hard about whether you really really want 'numeric' instead of boolean, smallint, or integer. The second thing is that none of your indices (which specify a whole bunch of fields, by the way) have only just emailok, emailbounced, or only the pair of them. Without knowing the needs of your app, I would reconsider your index choices and go with fewer columns per index. For this particular query I would think either two indexes (depending on the cardinality of the data, one for each of emailok, emailbounced) or one index (containing both emailok, emailbounced) would make quite a bit of difference. Consider creating the indexes using a WITH clause, for example: CREATE INDEX members_just_an_example_idx ON members (emailok, emailbounced) WHERE emailok = 1 AND emailbounced = 0; Obviously that index is only useful in situations where both fields are specified with those values. Furthermore, if the result is such that a very high percentage of the table has those conditions a sequential scan is going to be cheaper, anyway. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] odd postgresql performance (excessive lseek)
postgres 8.4.4 on openSUSE 11.3 (2.6.36rc7, x86_64). I was watching a fairly large query run and observed that the disk light went out. I checked 'top' and postgres was using 100% CPU so I strace'd the running process. This is what I saw: lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(94, 270680064, SEEK_SET) = 270680064 read(94, elided..., 8192) = 8192 and I observed that pattern quite a bit. I know lseek is cheap, but a superfluous systemcall is a superfluous systemcall, and over a short period amounted to 37% (according to strace) of the time spent in the system. What's with the excess calls to lseek? The query plan was a nested loop anti-join (on purpose). -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow count(*) again...
On Tue, Oct 12, 2010 at 7:27 AM, Mladen Gogala mladen.gog...@vmsinfo.com wrote: So, the results weren't cached the first time around. The explanation is the fact that Oracle, as of the version 10.2.0, reads the table in the private process memory, not in the shared buffers. This table alone is 35GB in size, Oracle took 2 minutes 47 seconds to read it using the full table scan. If I do the same thing with PostgreSQL and a comparable table, Postgres is, in fact, faster: Well, I didn't quite mean that - having no familiarity with Oracle I don't know what the alter system statement does, but I was talking specifically about the linux buffer and page cache. The easiest way to drop the linux caches in one fell swoop is: echo 3 /proc/sys/vm/drop_caches Is there a command to tell postgresql to drop/clear/reset it's buffer_cache? Clearing/dropping both the system (Linux) and the DB caches is important when doing benchmarks that involve I/O. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow count(*) again...
On Tue, Oct 12, 2010 at 8:18 AM, Greg Smith g...@2ndquadrant.com wrote: No. Usually the sequence used to remove all cached data from RAM before a benchmark is: All cached data (as cached in postgresql - *not* the Linux system caches)..., right? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] read only transactions
Are there any performance implications (benefits) to executing queries in a transaction where SET TRANSACTION READ ONLY; has been executed? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow count(*) again...
On Sun, Oct 10, 2010 at 12:14 PM, Mladen Gogala mladen.gog...@vmsinfo.com wrote: In other words, when I batched the sequential scan to do 128 blocks I/O, it was 4 times faster than when I did the single block I/O. Does that provide enough of an evidence and, if not, why not? These numbers tell us nothing because, unless you dropped the caches between runs, then at least part of some runs was very probably cached. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
On Wed, Oct 6, 2010 at 5:31 PM, Ivan Voras ivo...@freebsd.org wrote: On 10/04/10 20:49, Josh Berkus wrote: The other major bottleneck they ran into was a kernel one: reading from the heap file requires a couple lseek operations, and Linux acquires a mutex on the inode to do that. The proper place to fix this is certainly in the kernel but it may be possible to work around in Postgres. Or we could complain to Kernel.org. They've been fairly responsive in the past. Too bad this didn't get posted earlier; I just got back from LinuxCon. So you know someone who can speak technically to this issue? I can put them in touch with the Linux geeks in charge of that part of the kernel code. Hmmm... lseek? As in lseek() then read() or write() idiom? It AFAIK cannot be fixed since you're modifying the global strean position variable and something has got to lock that. OTOH, pread() / pwrite() don't have to do that. While lseek is very cheap it is like any other system call in that when you multiple cheap times a jillion you end up with notable or even lots. I've personally seen notable performance improvements by switching to pread/pwrite instead of lseek+{read,write}. For platforms that don't implement pread or pwrite, wrapper calls are trivial to produce. One less system call is, in this case, 50% fewer. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Identical query slower on 8.4 vs 8.3
On Thu, Jul 15, 2010 at 9:41 AM, Patrick Donlin pdon...@oaisd.org wrote: I have two servers with equal specs, one of them running 8.3.7 and the new server running 8.4.4. The only tweak I have made from the default install (from Ubuntu repositories) is increasing shared_buffers to 768MB. Both servers are running 64-bit, but are different releases of Ubuntu. ^^^ Right there. *different releases*. I've seen fairly significant differences in identical hardware with even minor O/S point releases. After you run a full vacuum and then reindex and then vacuum analyze (probably not entirely necessary) if there is still a difference I'd point at the O/S. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow Bulk Delete
On Mon, May 17, 2010 at 5:10 AM, Pierre C li...@peufeu.com wrote: - or use a JOIN delete with a virtual VALUES table - or fill a temp table with ids and use a JOIN DELETE What is a virtual VALUES table? Can you give me an example of using a virtual table with selects, joins, and also deletes? -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow Bulk Delete
2010/5/17 Віталій Тимчишин tiv...@gmail.com: 2010/5/17 Jon Nelson jnelson+pg...@jamponi.net On Mon, May 17, 2010 at 5:10 AM, Pierre C li...@peufeu.com wrote: - or use a JOIN delete with a virtual VALUES table - or fill a temp table with ids and use a JOIN DELETE What is a virtual VALUES table? Can you give me an example of using a virtual table with selects, joins, and also deletes? delete from a using (values (1),(2),(5),(8)) b(x) where a.id=b.x See http://www.postgresql.org/docs/8.4/static/sql-values.html This syntax I'm familiar with. The author of the previous message (Pierre C) indicated that there is a concept of a virtual table which could be joined to. I'd like to know what this virtual table thing is, specifically in the context of joins. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow Bulk Delete
On Mon, May 17, 2010 at 7:28 AM, A. Kretschmer andreas.kretsch...@schollglas.com wrote: In response to Jon Nelson : On Mon, May 17, 2010 at 5:10 AM, Pierre C li...@peufeu.com wrote: - or use a JOIN delete with a virtual VALUES table - or fill a temp table with ids and use a JOIN DELETE What is a virtual VALUES table? Can you give me an example of using a virtual table with selects, joins, and also deletes? Something like this: ... delete from foo using (values (1),(2) ) as bla where foo.c1=bla.column1; ... Aha! Cool. That's not quite what I envisioned when you said virtual table, but it surely clarifies things. Thanks! -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Best suiting OS
On Thu, Oct 1, 2009 at 4:46 AM, S Arvind arvindw...@gmail.com wrote: Is it FreeBSD, CentOS, Fedora, Redhat xxx?? FreeBSD isn't Linux. Don't run Fedora, it undergoes way too much Churn. No real difference between CentOS and RedHat. I personally prefer openSUSE (or SLES/SLED if you want their commerical offering). I find it faster, more up-to-date (but no churn), in general higher quality. I find postgresql *substantially* faster on openSUSE than CentOS, but that's purely anecdotal and I don't have any raw numbers to compare. openSUSE 11.1 has 8.3.8 and 11.2 (not out yet - a few months) will have 8.4.X. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Best suiting OS
On Thu, Oct 1, 2009 at 4:46 AM, S Arvind arvindw...@gmail.com wrote: Hi everyone, What is the best Linux flavor for server which runs postgres alone. The postgres must handle greater number of database around 200+. Performance on speed is the vital factor. Is it FreeBSD, CentOS, Fedora, Redhat xxx?? FreeBSD isn't Linux. I don't recommend that you run Fedora, it undergoes way too much churn. I don't find any real difference between CentOS and RedHat. I personally prefer openSUSE (or SLES/SLED if you want their commerical offering). I find it faster, more up-to-date (but no churn), and in general higher quality - it just works. I find postgresql *substantially* faster on openSUSE than CentOS, but that's purely anecdotal and I don't have any raw numbers to compare. openSUSE 11.1 has 8.3.8 and 11.2 (not out yet - a few months) will have 8.4.X. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] hyperthreaded cpu still an issue in 8.4?
On Tue, Jul 28, 2009 at 4:11 PM, Scott Marlowescott.marl...@gmail.com wrote: On Tue, Jul 28, 2009 at 2:58 PM, Merlin Moncuremmonc...@gmail.com wrote: On Mon, Jul 27, 2009 at 2:05 PM, Dave Youattd...@meteorsolutions.com wrote: On 01/-10/-28163 11:59 AM, Greg Smith wrote: On Tue, 21 Jul 2009, Doug Hunley wrote: Just wondering is the issue referenced in http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php is still present in 8.4 or if some tunable (or other) made the use of hyperthreading a non-issue. We're looking to upgrade our servers soon for performance reasons and am trying to determine if more cpus (no HT) or less cpus (with HT) are the way to go. If you're talking about the hyperthreading in the latest Intel Nehalem processors, I've been seeing great PostgreSQL performance from those. The kind of weird behavior the old generation hyperthreading designs had seems gone. You can see at http://archives.postgresql.org/message-id/alpine.gso.2.01.0907222158050.16...@westnet.com that I've cleared 90K TPS on a 16 core system (2 quad-core hyperthreaded processors) running a small test using lots of parallel SELECTs. That would not be possible if there were HT spinlock problems still around. There have been both PostgreSQL scaling improvments and hardware improvements since the 2005 messages you saw there that have combined to clear up the issues there. While true cores would still be better if everything else were equal, it rarely is, and I wouldn't hestitate to jump on Intel's bandwagon right now. Greg, those are compelling numbers for the new Nehalem processors. Great news for postgresql. Do you think it's due to the new internal interconnect, that bears a strong resemblance to AMD's hypertransport I'd love to see some comparisons on the exact same hardware, same kernel and everything but with HT enabled and disabled. Don't forget that newer (Linux) kernels have vastly improved SMP performance. -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance