Re: [PERFORM] sniff test on some PG 8.4 numbers

2013-03-11 Thread Jon Nelson
On Sun, Mar 10, 2013 at 11:28 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 3/10/13 9:18 PM, Jon Nelson wrote:

 The following is with ext4, nobarrier, and noatime. As noted in the
 original post, I have done a fair bit of system tuning. I have the
 dirty_bytes and dirty_background_bytes set to 3GB and 2GB,
 respectively.


 That's good, but be aware those values are still essentially unlimited write
 caches.  A server with 4 good but regular hard drives might do as little as
 10MB/s of random writes on a real workload.  If 2GB of data ends up dirty,
 the flushing that happens at the end of a database checkpoint will need to
 clear all of that out of RAM.  When that happens, you're looking at a 3
 minute long cache flush to push out 2GB.  It's not unusual for pgbench tests
 to pause for over a minute straight when that happens.  With your setup,
 where checkpoints happen every 5 minutes, this is only happening once per
 test run.  The disruption isn't easily visible if you look at the average
 rate; it's outweighed by the periods where writes happen very fast because
 the cache isn't full yet.  You have to get pgbench to plot latency over time
 to see them and then analyze that data.  This problem is the main reason I
 put together the pgbench-tools set for running things, because once you get
 to processing the latency files and make graphs from them it starts to be a
 pain to look at the results.

I'll try to find time for this, but it may need to wait until the weekend again.

 I built 9.2 and using 9.2 and the following pgbench invocation:

 pgbench  -j 8  -c 32 -M prepared -T 600

 transaction type: TPC-B (sort of)
 scaling factor: 400


 I misread this completely in your message before; I thought you wrote 4000.
 A scaling factor of 400 is making a database that's 6GB in size.  Your test
 is basically seeing how fast the system memory and the RAID cache can move
 things around.  In that situation, your read and write numbers are
 reasonable.  They aren't actually telling you anything useful about the
 disks though, because they're barely involved here. You've sniffed the CPU,
 memory, and RAID controller and they smell fine.  You'll need at least an
 order of magnitude increase in scale to get a whiff of the disks.

LOL! Your phrasing is humourous and the information useful.

I ran for 8.0 hours and go this:

transaction type: TPC-B (sort of)
scaling factor: 400
query mode: prepared
number of clients: 32
number of threads: 8
duration: 28800 s
number of transactions actually processed: 609250619
tps = 21154.058025 (including connections establishing)
tps = 21154.075922 (excluding connections establishing)

 pgbench scale numbers give approximately 16MB per scale factor.  You don't
 actually stress the drives until that total number is at least 2X as big as
 RAM.  We had to raise the limit on the pgbench scales recently because it
 only goes up to ~20,000 on earlier versions, and that's not a big enough
 scale to test many servers now.

 On the select-only tests, much of the increase from ~100K to ~200K is
 probably going from 8.4 to 9.2.  There's two major and several minor tuning
 changes that make it much more efficient at that specific task.


 These are the *only* changes I've made to the config file:

 shared_buffers = 32GB
 wal_buffers = 16MB
 checkpoint_segments = 1024


 Note that these are the only changes that actually impact pgbench results.
 The test doesn't stress very many parts of the system, such as the query
 optimizer.

 Also be aware these values may not be practical to use in production. You
 can expect bad latency issues due to having shared_buffers so large.  All
 that memory has to be reconciled and written to disk if it's been modified
 at each checkpoint, and 32GB of such work is a lot.  I have systems where we
 can't make shared_buffers any bigger than 4GB before checkpoint pauses get
 too bad.

 Similarly, setting checkpoint_segments to 1024 means that you might go
 through 16GB of writes before a checkpoint happens.  That's great for
 average performance...but when that checkpoint does hit, you're facing a
 large random I/O backlog.

I thought the bgwriter mitigated most of the problems here? Often I'll
see the actual checkpoints with 'sync' times typically below a few
seconds (when there is anything to do at all). I can't say I've seen
checkpoint pauses in my workloads.

 There's not much you can do about all this on the Linux side.  If you drop
 the dirty_* parameters too much, maintenance operations like VACUUM start to
 get slow.  Really all you can do is avoid setting shared_buffers and
 checkpoint_segments too high, so the checkpoint backlog never gets gigantic.
 The tuning you've done is using higher values than we normally recommend
 because it's not quite practical to deploy like that.  That and the very
 small database are probably why your numbers are so high.

Mostly I do data warehouse type of workloads with very little (if any)
data modification after

Re: [PERFORM] sniff test on some PG 8.4 numbers

2013-03-10 Thread Jon Nelson
On Sun, Mar 10, 2013 at 10:46 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 3/5/13 10:00 PM, Jon Nelson wrote:

 On Tue, Mar 5, 2013 at 1:35 PM, Jon Nelson jnelson+pg...@jamponi.net
 wrote:


 pgbench -h BLAH -c 32 -M prepared -t 10 -S
 I get 95,000 to 100,000 tps.

 pgbench -h BLAH -c 32 -M prepared -t 10
 seems to hover around 6,200 tps (size 100) to 13,700 (size 400)


 Some followup:
 The read test goes (up to) 133K tps, and the read-write test to 22k
 tps when performed over localhost.


 All your write numbers are inflated because the test is too short.  This
 hardware will be lucky to sustain 7500 TPS on writes.  But you're only
 writing 100,000 transactions, which means the entire test run isn't even
 hitting the database--only the WAL writes are.  When your test run is
 finished, look at /proc/meminfo  I'd wager a large sum you'll find Dirty:
 has hundreds of megabytes, if not gigabytes, of unwritten information.
 Basically, 100,000 writes on this sort of server can all be cached in
 Linux's write cache, and pgbench won't force them out of there.  So you're
 not simulating sustained database writes, only how fast of a burst the
 server can handle for a little bit.

 For a write test, you must run for long enough to start and complete a
 checkpoint before the numbers are of any use, and 2 checkpoints are even
 better.  The minimum useful length is a 10 minute run, so -T 600 instead
 of using -t.  If you want something that does every trick possible to make
 it hard to cheat at this, as well as letting you graph size and client data,
 try my pgbench-tools: https://github.com/gregs1104/pgbench-tools  (Note that
 there is a bug in that program right now, it spawns vmstat and iostat
 processes but they don't get killed at the end correctly.  killall vmstat
 iostat after running is a good idea until I fix that).

I (briefly!) acquired an identical machine as last but this time with
an Areca instead of an LSI (4 drives).

The following is with ext4, nobarrier, and noatime. As noted in the
original post, I have done a fair bit of system tuning. I have the
dirty_bytes and dirty_background_bytes set to 3GB and 2GB,
respectively.

I built 9.2 and using 9.2 and the following pgbench invocation:

pgbench  -j 8  -c 32 -M prepared -T 600

transaction type: TPC-B (sort of)
scaling factor: 400
query mode: prepared
number of clients: 32
number of threads: 8
duration: 600 s
number of transactions actually processed: 16306693
tps = 27176.566608 (including connections establishing)
tps = 27178.518841 (excluding connections establishing)

 Your read test numbers are similarly inflated, but read test errors aren't
 as large.  Around 133K TPS on select-only is probably accurate. For a read
 test, use -T 30 to let it run for 30 seconds to get a more accurate
 number.  The read read bottleneck on your hardware is going to be the
 pgbench client itself, which on 8.4 is running as a single thread.  On 9.0+
 you can have multiple pgbench workers.  It normally takes 4 to 8 of them to
 saturate a larger server.

The 'select-only' test (same as above with '-S'):

starting vacuum...end.
transaction type: SELECT only
scaling factor: 400
query mode: prepared
number of clients: 32
number of threads: 8
duration: 600 s
number of transactions actually processed: 127513307
tps = 212514.337971 (including connections establishing)
tps = 212544.392278 (excluding connections establishing)

These are the *only* changes I've made to the config file:

shared_buffers = 32GB
wal_buffers = 16MB
checkpoint_segments = 1024

I can run either or both of these again with different options, but
mostly I'm looking for a sniff test.
However, I'm a bit confused, now.

It seems as though you say the write numbers are not believable,
suggesting a value of 7,500 (roughly 1/4 what I'm getting). If I run
the read test for 30 seconds I get - highly variable - between 300K
and 400K tps. Why are these tps so high compared to your expectations?
Note: I did get better results with HT on vs. with HT off, so I've
left HT on for now.



-- 
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] sniff test on some PG 8.4 numbers

2013-03-05 Thread Jon Nelson
I was hoping to just get a gut reaction on some pgbench numbers I have, to
see if I'm in the ballpark.

OS:  ScientificLinux 6.3, x86_64
Hardware: 4x real disks (not SSD) behind an LSI 9260 in raid10, Xeon E5-2680
with hyperthreading OFF, 128GB of RAM.
Setup: postgresql 8.4.13, ext4, barriers ON, disk write cache *off*, write-
back enabled on the LSI.
I initialized with sizes of 100, 200, and 400.

I've done some tuning of the postgresql config, but mostly I'm just trying to
find out if I'm in the right ballpark.

I ran pgbench from another (similar) host:

pgbench -h BLAH -c 32 -M prepared -t 10 -S
I get 95,000 to 100,000 tps.

pgbench -h BLAH -c 32 -M prepared -t 10
seems to hover around 6,200 tps (size 100) to 13,700 (size 400)

Do these basically sniff right?
(NOTE: with barriers off, I get a slight increase - 10% - in the
read-write test, and a larger *decrease* - 15% - with the read-only
test @ 400. No change @ 100)

-- 
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] sniff test on some PG 8.4 numbers

2013-03-05 Thread Jon Nelson
On Tue, Mar 5, 2013 at 7:02 PM, Josh Berkus j...@agliodbs.com wrote:

 Do these basically sniff right?

 Well, the read test seems reasonable.  I'm impressed by the speed of the
 write test ... how large is the raid card cache?

 And why 8.4?  Can you try 9.2?

8.4 because it's what I've got, basically. I might be able to try 9.2
later, but I'm targeting 8.4 right now.
512MB of memory on the card.

 (NOTE: with barriers off, I get a slight increase - 10% - in the
 read-write test, and a larger *decrease* - 15% - with the read-only
 test @ 400. No change @ 100)

 Oh, interesting.  Can you reproduce that? I wonder what would cause
 read-only to drop without barriers ...

I'll try to test again soon.
I know that if I use writethrough instead of writeback mode the
performance nosedives.
Does anybody have suggestions for stripe size? (remember: *4* disks)

-- 
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] sniff test on some PG 8.4 numbers

2013-03-05 Thread Jon Nelson
On Tue, Mar 5, 2013 at 1:35 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:

 pgbench -h BLAH -c 32 -M prepared -t 10 -S
 I get 95,000 to 100,000 tps.

 pgbench -h BLAH -c 32 -M prepared -t 10
 seems to hover around 6,200 tps (size 100) to 13,700 (size 400)

Some followup:
The read test goes (up to) 133K tps, and the read-write test to 22k
tps when performed over localhost.

-- 
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04

2013-02-18 Thread Jon Nelson
On Mon, Feb 18, 2013 at 6:39 PM, Josh Berkus j...@agliodbs.com wrote:
 Scott,

 So do you have generally slow IO, or is it fsync behavior etc?

 All tests except pgBench show this system as superfast.  Bonnie++ and DD
 tests are good (200 to 300mb/s), and test_fsync shows 14K/second.
 Basically it has no issues until checkpoint kicks in, at which time the
 entire system basically halts for the duration of the checkpoint.

 For that matter, if I run a pgbench and halt it just before checkpoint
 kicks in, I get around 12000TPS, which is what I'd expect on this system.

 At this point, we've tried 3.2.0.26, 3.2.0.27, 3.4.0, and tried updating
 the RAID driver, and changing the IO scheduler.  Nothing seems to affect
 the behavior.   Testing using Ext4 (instead of XFS) next.

Did you try turning barriers on or off *manually* (explicitly)? With
LSI and barriers *on* and ext4 I had less-optimal performance. With
Linux MD or (some) 3Ware configurations I had no performance hit.

-- 
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] temp tablespaces and SSDs, etc..

2013-02-08 Thread Jon Nelson
I was wondering if somebody could clear up how tablespaces are used.
Let's say I have three classes of storage:
- ramdisk (tmpfs)
- SSD
- spinning rust

Furthermore, let's say I'd like to be able to tell postgresql to
prefer them - in that order - until they each get full. IE, use tmpfs
until it reports ENOSPC and then fall back to SSD, finally falling
back to spinning rust.  Is there a way to do this?

-- 
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Poor performance using CTE

2012-11-22 Thread Jon Nelson
On Thu, Nov 22, 2012 at 7:42 AM, Jeremy Harris j...@wizmail.org wrote:
 On 22/11/2012 00:08, Craig Ringer wrote:

 WITH
FENCE foo AS (SELECT ...),
bar AS (SELECT ...)
 SELECT * FROM bar;

 Are we fencing just foo? Or all expressions?


 WITH foo AS (FENCED SELECT ...),
  bar AS (SELECT ...),
 SELECT ... ;

I would much rather see 'MATERIALIZE' instead of 'FENCED', unless the
by the latter you mean to forbid *all* optimizations, whereas with the
latter the meaning is pretty clear.

-- 
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: SOLVED - RE: [PERFORM] Poor performance using CTE

2012-11-20 Thread Jon Nelson
My perspective on this is that CTEs *should* be just like creating a
temporary table and then joining to it, but without the
materialization costs. In that respect, they seem like they should be
like nifty VIEWs. If I wanted the behavior of materialization and then
join, I'd do that explicitly with temporary tables, but using CTEs as
an explicit optimization barrier feels like the explaining away
surprising behavior.

As can be seen by the current conversation, not everyone is convinced
that CTEs ought to be an explicit optimization barrier, and setting
that behavior as somehow desirable or explicit (rather than merely an
implementation detail) feels shortsighted to me. I would be delighted
to find that in some future version of PostgreSQL, but  if that is not
to be, at the very least, the verbiage surrounding CTEs might want to
include (perhaps prominently) something along the lines of CTEs are
currently an optimization barrier, but this is an implementation
detail and may change in future versions.  Perhaps even including a
small blurb about what an optimization barrier even means (my
understanding is that it merely forces materialization of that part of
the query).

That's just my perspective, coming at the use of CTEs not as a
PostgreSQL developer, but as somebody who learned about CTEs and
started using them - only to discover surprising behavior.

On Tue, Nov 20, 2012 at 1:22 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Wed, Nov 14, 2012 at 8:03 PM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
 On 15 November 2012 01:46, Andrew Dunstan and...@dunslane.net wrote:
 It cuts both ways. I have used CTEs a LOT precisely because this behaviour
 lets me get better plans. Without that I'll be back to using the offset 0
 hack.

 Is the OFFSET 0 hack really so bad? We've been telling people to do
 that for years, so it's already something that we've effectively
 committed to.

 IMSNHO, 'OFFSET 0' is completely unreadable black magic.  I agree with
 Andrew: CTEs allow for manual composition of queries and can be the
 best tool when the planner is outsmarting itself.  In the old days,
 we'd extract data to a temp table and join against that: CTE are
 essentially a formalization of that technique.  I like things the way
 they are; if CTE are hurting your plan, that's an indication you're
 using them inappropriately.

 merlin


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



-- 
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] postgres 8.4, COPY, and high concurrency

2012-11-14 Thread Jon Nelson
On Tue, Nov 13, 2012 at 7:10 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 I had moved on to a different approach to importing the data which
 does not work concurrently. However, I went back and tried to
 re-create the situation and - at least a naive attempt failed. I'll
 give it a few more tries -- I was creating two tables using CREATE
 TABLE unique name LIKE (some other table INCLUDING everything).
 Then I would copy the data in, add some constraints (FK constraints
 but only within these two tables) and then finally (for each table)
 issue an ALTER TABLE unique name INHERIT some other table.  To be
 clear, however, everything bogged down in the COPY stage which was
 immediately following the table creation.

 I'll note that my naive test showed almost no unexpected overhead at
 all, so it's clearly not representative of the problem I encountered.


I'm still unable to replicate the problem, but I can show I wasn't
crazy, either. The average time to perform one of these COPY
operations when things are working  is in the 15-45 second range.  I
had configured PG to log any statement that look longer than 3
seconds, so I got a bunch of those in the logs. I have since
reconfigured to log *everything*. Anyway, when things were going
badly, COPY would take anywhere from 814 seconds to just under 1400
seconds for the exact same files.

UPDATE: I have been able to replicate the issue. The parent table (the
one referenced in the LIKE portion of the CREATE TABLE statement) had
three indices.

Now that I've been able to replicate the issue, are there tests that I
can perform that would be useful to people?
I will also try to build a stand-alone test.


--
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] postgres 8.4, COPY, and high concurrency

2012-11-14 Thread Jon Nelson
On Wed, Nov 14, 2012 at 1:01 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Wed, Nov 14, 2012 at 6:41 AM, Jon Nelson jnelson+pg...@jamponi.net wrote:

 UPDATE: I have been able to replicate the issue. The parent table (the
 one referenced in the LIKE portion of the CREATE TABLE statement) had
 three indices.

 Now that I've been able to replicate the issue, are there tests that I
 can perform that would be useful to people?
 I will also try to build a stand-alone test.

 While the WAL is suppressed for the table inserts, it is not
 suppressed for the index inserts, and the index WAL traffic is enough
 to lead to contention.

Aha!

 I don't know why that is the case, it seems like the same method that
 allows us to bypass WAL for the table would work for the indices as
 well.  Maybe it is just that no one bothered to implement it.  After
 all, building the index after the copy will be even more efficient
 than building it before but by-passing WAL.

 But it does seem like the docs could at least be clarified here.

In general, then, would it be safe to say that concurrent (parallel)
index creation may be a source of significant WAL contention? I was
planning on taking advantage of this due to modern, beefy boxes with
10's of CPUs all just sitting there bored.


--
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] postgres 8.4, COPY, and high concurrency

2012-11-13 Thread Jon Nelson
I was working on a data warehousing project where a fair number of files
could be COPY'd more or less directly into tables. I have a somewhat nice
machine to work with, and I ran on 75% of the cores I have (75% of 32 is
24).

Performance was pretty bad. With 24 processes going, each backend (in COPY)
spent 98% of it's time in semop (as identified by strace).  I tried larger
and smaller shared buffers, all sorts of other tweaks, until I tried
reducing the number of concurrent processes from 24 to 4.

Disk I/O went up (on average) at least 10X and strace reports that the top
system calls are write (61%), recvfrom (25%), and lseek (14%) - pretty
reasonable IMO.

Given that each COPY is into it's own, newly-made table with no indices or
foreign keys, etc, I would have expected the interaction among the backends
to be minimal, but that doesn't appear to be the case.  What is the likely
cause of the semops?

I can't really try a newer version of postgres at this time (perhaps soon).

I'm using PG 8.4.13 on ScientificLinux 6.2 (x86_64), and the CPU is a 32
core Xeon E5-2680 @ 2.7 GHz.

-- 
Jon


Re: [PERFORM] postgres 8.4, COPY, and high concurrency

2012-11-13 Thread Jon Nelson
On Tue, Nov 13, 2012 at 1:27 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 13.11.2012 21:13, Jon Nelson wrote:

 I was working on a data warehousing project where a fair number of files
 could be COPY'd more or less directly into tables. I have a somewhat nice
 machine to work with, and I ran on 75% of the cores I have (75% of 32 is
 24).

 Performance was pretty bad. With 24 processes going, each backend (in
 COPY)
 spent 98% of it's time in semop (as identified by strace).  I tried larger
 and smaller shared buffers, all sorts of other tweaks, until I tried
 reducing the number of concurrent processes from 24 to 4.

 Disk I/O went up (on average) at least 10X and strace reports that the top
 system calls are write (61%), recvfrom (25%), and lseek (14%) - pretty
 reasonable IMO.

 Given that each COPY is into it's own, newly-made table with no indices or
 foreign keys, etc, I would have expected the interaction among the
 backends
 to be minimal, but that doesn't appear to be the case.  What is the likely
 cause of the semops?


 I'd guess it's lock contention on WALInsertLock. That means, the system is
 experiencing lock contention on generating WAL records for the insertions.
 If that theory is correct, you ought to get a big gain if you have
 wal_level=minimal, and you create or truncate the table in the same
 transaction with the COPY. That allows the system to skip WAL-logging the
 COPY.


wal_level doesn't exist for 8.4, but I have archive_mode = off and I am
creating the table in the same transaction as the COPY.



 Or you could upgrade to 9.2. The WAL-logging of bulk COPY was optimized in
 9.2, it should help precisely the scenario you're facing.


Unfortunately, that's what I was expecting.



-- 
Jon


Re: [PERFORM] postgres 8.4, COPY, and high concurrency

2012-11-13 Thread Jon Nelson
On Tue, Nov 13, 2012 at 2:43 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Tue, Nov 13, 2012 at 12:03 PM, Jon Nelson jnelson+pg...@jamponi.net 
 wrote:
 On Tue, Nov 13, 2012 at 1:27 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:

 On 13.11.2012 21:13, Jon Nelson wrote:


 I'd guess it's lock contention on WALInsertLock. That means, the system is
 experiencing lock contention on generating WAL records for the insertions.
 If that theory is correct, you ought to get a big gain if you have
 wal_level=minimal, and you create or truncate the table in the same
 transaction with the COPY. That allows the system to skip WAL-logging the
 COPY.


 wal_level doesn't exist for 8.4, but I have archive_mode = off and I am
 creating the table in the same transaction as the COPY.


 That should work to bypass WAL.  Can you directly verify whether you
 are generating lots of WAL (look at the churn in pg_xlog) during those
 loads?

 Maybe your contention is someplace else.  Since they must all be using
 different tables, I don't think it would be the relation extension
 lock.  Maybe buffer mapping lock or freelist lock?

I had moved on to a different approach to importing the data which
does not work concurrently. However, I went back and tried to
re-create the situation and - at least a naive attempt failed. I'll
give it a few more tries -- I was creating two tables using CREATE
TABLE unique name LIKE (some other table INCLUDING everything).
Then I would copy the data in, add some constraints (FK constraints
but only within these two tables) and then finally (for each table)
issue an ALTER TABLE unique name INHERIT some other table.  To be
clear, however, everything bogged down in the COPY stage which was
immediately following the table creation.

I'll note that my naive test showed almost no unexpected overhead at
all, so it's clearly not representative of the problem I encountered.


--
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set-returning calls and overhead

2012-10-30 Thread Jon Nelson
On Thu, Jul 19, 2012 at 11:07 AM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 Recently I found myself wondering what was taking a particular query so long.
 I immediately assumed it was a lack of I/O, because lack of I/O is a
 thorn in my side.
 Nope, the I/O was boring. CPU? Well, the process was using 100% of the
 CPU but the query itself was really very simple.
 I turned to ltrace (horribly imprecise, I know). ltrace told me this:


 % time seconds  usecs/call calls  function
 -- --- --- - 
  46.546.789433  69 97766 memcpy
  28.164.1083241100  3732 strlen
  14.452.107567 564  3732 malloc
   9.161.336108  28 46877 memset
   0.740.107935  28  3732 strcpy
   0.730.107221  28  3732 free
   0.160.023687 187   126 write
   0.020.003587  28   126 __errno_location
   0.020.003075  5952 read
   0.010.001523  2952 memcmp
 -- --- --- - 
 100.00   14.588460159927 total


 and this:

 strlen(SRF multi-call context)
 strcpy(0xe01d40, SRF multi-call context)
 malloc(1024)
 memcpy(...)
 memset(...)
 ...
 memset(...)
 free(..)

 repeat.

 I was rather surprised to learn that (per-row):
 (1) memcpy of 64 bytes accounted for 46% of the time spent in library calls
 (2) the (other) costs of strlen, strcpy, malloc, and memset were so
 huge (in particular, strlen)

 What, if anything, can be done about this? It seems the overhead for
 setting up the memory context for the SRF is pretty high.
 I notice this overhead pretty much every time I use any of the array
 functions like unnest.

 Please help me to understand if I'm misinterpreting things here.

 [x86_64, Linux, PostgreSQL 9.1.4]


A followup.

Recently, I imported a bunch of data. The import ran in about 30
seconds. The data itself was represented in a way that made more sense
- from a relational database perspective - as multiple tables. To
accomplish this, I made use of string_to_array and unnest.  The
initial table creation and copy run in about 30 seconds, but then the
creation of the new table (create table ... as select ..
unnest(string_to_array())) took over 5 minutes. 10 times as long.
What is it about the array functions (actually, all set-returning
functions that I've tried) that causes them to be so expensive? The
per-call overhead is enormous in some cases.  PostgreSQL 9.1.5 on
x86_64 (openSUSE 12.2 - but the effect has been observed across
several platforms and major/minor releases of PostgreSQL).



 --
 Jon



-- 
Jon


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] very very slow inserts into very large table

2012-07-16 Thread Jon Nelson
I have a single *table* that is some 560GB in size, 6 columns, average
row width 63.
There are approximately 6.1 billion rows.
It has 4 indices, 131, 313, 131 and 190 GB in size, respectively. All
are btree indices.

I tried inserting new data into the table, and it's taking a *very* long time.
I pre-built the data to be inserted into a temporary table with the
exact same structure and column ordering, etc, and the temporary table
is about 8.5GB in size with about 93 million rows.
The temporary table was built in about 95 seconds.
The insert has been going for 47 hours and 21 minutes, give or take.
I'm not doing any correlation or filtering, etc --  straight up
insert, literally insert into big_table select * from
the_temp_table;.

vmstat output doesn't seem that useful, with disk wait being 10-15%
and I/O speeds highly variable, from 5-20MB/s reads couple with
0-16MB/s writes, generally on the lower end of these.
strace of the inserting process shows that it's basically hammering
the disk in terms of random reads and infrequent writes.
postgresql. It's not verifying, rebuilding, etc. While this process is
active, streaming write I/O is terrible - 36MB/s. WIth it paused
(via strace) I get 72MB/s.  (reads are 350MB/s).

The OS is Scientific Linux 6.2, and the version of postgresql is 9.1.4
- x86_64. There is nothing else of note happening on the box. The box
is a quad CPU, dual-core each Xeon E5430  @ 2.66GHz with 32GB of RAM
and a 3ware 9690 RAID 4TB RAID10 for the storage for

What might be going on here?


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] very very slow inserts into very large table

2012-07-16 Thread Jon Nelson
On Mon, Jul 16, 2012 at 12:35 PM, Samuel Gendler
sgend...@ideasculptor.com wrote:
 On Mon, Jul 16, 2012 at 7:06 AM, Mark Thornton mthorn...@optrak.com wrote:


 Every insert updates four indexes, so at least 3 of those will be in
 random order. The indexes don't fit in memory, so all those updates will
 involve reading most of the relevant b-tree pages from disk (or at least the
 leaf level). A total of 10ms of random read from disk (per inserted row)
 wouldn't surprise me ... which adds up to more than 10 days for your 93
 million rows.


 Which is the long way of saying that you will likely benefit from
 partitioning that table into a number of smaller tables, especially if
 queries on that table tend to access only a subset of the data that can be
 defined to always fit into a smaller number of partitions than the total.
 At the very least, inserts will be faster because individual indexes will be
 smaller.  But unless all queries can't be constrained to fit within a subset
 of partitions, you'll also see improved performance on selects.

Acknowledged. My data is actually partitioned into individual tables,
but this was an experiment to see what the performance was like. I was
expecting that effectively appending all of the individual tables into
a great big table would result in less redundant information being
stored in indices and, therefore, a bit more speed and efficiency.
However, I have to admit I was very surprised at the performance
reduction.

What is the greater lesson to take away, here? If you are working with
data that is larger (substantially larger) than available memory, is
the architecture and design of postgresql such that the only real
approach is some type of data partitioning? It is not my intent to
insult or even disparage my favorite software, but it took less time
to *build* the indices for 550GB of data than it would have to insert
1/20th as much. That doesn't seem right.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] problems with set_config, work_mem, maintenance_work_mem, and sorting

2012-02-28 Thread Jon Nelson
I happened to be looking in the PostgreSQL logs (8.4.10, x86_64,
ScientificLinux 6.1) and noticed that an app was doing some sorting
(group by, order by, index creation) that ended up on disk rather than
staying in memory.
So I enabled trace_sort and restarted the app.
What followed confused me.

I know that the app is setting the work_mem and maintenance_work_mem
to 1GB, at the start of the session, with the following calls:

select set_config(work_mem, 1GB, False);
select set_config(maintenance_work_mem, 1GB, False);

By timestamps, I know that these statements take place before the next
log items, generated by PostgreSQL (note: I also log the PID of the
backend and all of these are from the same PID):

LOG:  0: begin tuple sort: nkeys = 2, workMem = 1048576, randomAccess = f
LOG:  0: begin tuple sort: nkeys = 1, workMem = 1048576, randomAccess = f
LOG:  0: begin tuple sort: nkeys = 1, workMem = 1048576, randomAccess = f
LOG:  0: begin tuple sort: nkeys = 1, workMem = 1048576, randomAccess = f
LOG:  0: begin tuple sort: nkeys = 2, workMem = 1048576, randomAccess = f
^ these make sense

LOG:  0: begin tuple sort: nkeys = 2, workMem = 131072, randomAccess = f
LOG:  0: begin tuple sort: nkeys = 1, workMem = 131072, randomAccess = f
LOG:  0: begin tuple sort: nkeys = 1, workMem = 131072, randomAccess = f

^^ these do not (but 128MB is the globally-configured work_mem value)

LOG:  0: begin index sort: unique = t, workMem = 2097152, randomAccess = f
^ this kinda does (2GB is the globally-configured maintenance_work_mem value)

LOG:  0: begin index sort: unique = f, workMem = 131072, randomAccess = f
LOG:  0: begin tuple sort: nkeys = 2, workMem = 131072, randomAccess = f
..


The config shows 128MB for work_mem and 2GB for maintenance_work_mem.
Why does PostgreSQL /sometimes/ use the globally-configured values and
sometimes use the values that come from the connection?
Am I wrong in misunderstanding what 'session' variables are? I thought
that session (versus transaction) config items were set for /all/
transactions in a given backend, until changed or until that backend
terminates. Is that not so?

If I reconfigure the app to call out to set_config(item, value, True)
after each 'BEGIN' statement then workMem seems to be correct (at
least more of the time -- the process takes some time to run and I
haven't done an exhaustive check as yet).

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] problems with set_config, work_mem, maintenance_work_mem, and sorting

2012-02-28 Thread Jon Nelson
On Tue, Feb 28, 2012 at 1:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 The config shows 128MB for work_mem and 2GB for maintenance_work_mem.
 Why does PostgreSQL /sometimes/ use the globally-configured values and
 sometimes use the values that come from the connection?

 You sure those log entries are all from the same process?

If I am understanding this correctly, yes. They all share the same pid.
The logline format is:

log_line_prefix = '%t %d %u [%p]'

and I believe %p represents the pid, and also that a pid corresponds
to a backend. Therefore, same pid == same backend == same connection
== same session. Many transactions within a session.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] problems with set_config, work_mem, maintenance_work_mem, and sorting

2012-02-28 Thread Jon Nelson
On Tue, Feb 28, 2012 at 2:51 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 On Tue, Feb 28, 2012 at 1:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 Why does PostgreSQL /sometimes/ use the globally-configured values and
 sometimes use the values that come from the connection?

 You sure those log entries are all from the same process?

 If I am understanding this correctly, yes. They all share the same pid.

 Hmph ... does seem a bit weird.  Can you turn on log_statements and
 identify which operations aren't using the session values?

I had log_min_duration_statement = 1000.

An example:

LOG:  0: begin tuple sort: nkeys = 3, workMem = 131072, randomAccess = f
LOCATION:  tuplesort_begin_heap, tuplesort.c:573
STATEMENT:  INSERT INTO (new table) SELECT (bunch of stuff here) FROM
.. ORDER BY ...

and also some CREATE TABLE ... statements:

LOG:  0: begin index sort: unique = f, workMem = 131072, randomAccess = f
LOCATION:  tuplesort_begin_index_btree, tuplesort.c:642
STATEMENT:  CREATE TABLE tablename (LIKE some_other_tablename)

I also see this:

LOG:  0: begin tuple sort: nkeys = 2, workMem = 131072, randomAccess = f
LOCATION:  tuplesort_begin_heap, tuplesort.c:573
STATEMENT: SELECT bunch of stuff from system catalogs

which is the ORM library (SQLAlchemy) doing a reflection of the
table(s) involved.
The statement is from the same backend (pid) and takes place
chronologically *after* the following:

LOG:  0: begin tuple sort: nkeys = 2, workMem = 1048576, randomAccess = f
LOCATION:  tuplesort_begin_heap, tuplesort.c:573
STATEMENT: more reflection stuff

Is that useful?

If that's not enough, I can crank the logging up.
What would you like to see for 'log_statements' (if what I've provided
you above is not enough).

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] problems with set_config, work_mem, maintenance_work_mem, and sorting

2012-02-28 Thread Jon Nelson
On Tue, Feb 28, 2012 at 4:46 PM, Claudio Freire klaussfre...@gmail.com wrote:
 On Tue, Feb 28, 2012 at 6:54 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 ... which is the ORM library (SQLAlchemy) doing a reflection of the
 table(s) involved.

 Oh, there's an ORM involved?  I'll bet a nickel it's doing something
 surprising, like not issuing your SET until much later than you thought.

 I'd rather go for an auto-rollback at some point within the
 transaction that issued the set work_mem. SQLA tends to do that if,
 for instance, an exception is risen within a transaction block (ie,
 flushing).

 You can issue the set work_mem in its own transaction, and commit it,
 and in that way avoid that rollback.

I cranked the logging /all/ the way up and isolated the server.
I suspect that your theory is correct.
I'll spend a bit more time investigating.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] *really* bad insert performance on table with unique index

2012-02-02 Thread Jon Nelson
I created a table with two columns: an id SERIAL (primary key) and a
text (not null), and then added a unique index on the text field.
Then I ran the following query (with a huge work_mem - 20GB):

insert into tableA (text_field) select distinct other_text_field from
some_huge_set_of_tables

After 36 hours it had only written 3 GB (determined by looking at what
files it was writing to).
I started over with a TRUNCATE, and then removed the index and tried again.
This time it took 3807270.780 ms (a bit over an hour).
Total number of records: approx 227 million, comprising 16GB of storage.

Why the huge discrepancy?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] regarding CLUSTER and HUGE work_mem / maintenance_work_mem

2012-01-27 Thread Jon Nelson
Let's say I have a 7GB table with 3-4 indices for a total of 10-12GB.
Furthermore, let's say I have a machine with sufficient memory for me
to set the work_mem  and maintenance_work_mem to 20GB (just for this
session).
When I issue a CLUSTER using one of the indices, I see PostgreSQL (by
way of strace) performing an index scan which amounts to large
quantities of random I/O.
In my case, that means it takes a very, very long time. PostgreSQL is
largely at defaults, except for a 2GB shared_buffers and a few
unrelated changes. The system itself has 32GB of physical RAM and has
plenty free.
Why didn't PostgreSQL just read the table into memory (and the
interesting index) as a sequential scan, sort, and then write it out?
It seems like there would be more than enough memory for that. The
sequential I/O rate on this machine is 50-100x the random I/O rate.

I'm using 8.4.10 (with the 'inet' de-toasting patch) on Scientific Linux 6.1.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] regarding CLUSTER and HUGE work_mem / maintenance_work_mem

2012-01-27 Thread Jon Nelson
On Fri, Jan 27, 2012 at 12:05 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 27.01.2012 19:43, Jon Nelson wrote:

 Let's say I have a 7GB table with 3-4 indices for a total of 10-12GB.
 Furthermore, let's say I have a machine with sufficient memory for me
 to set the work_mem  and maintenance_work_mem to 20GB (just for this
 session).
 When I issue a CLUSTER using one of the indices, I see PostgreSQL (by
 way of strace) performing an index scan which amounts to large
 quantities of random I/O.
 In my case, that means it takes a very, very long time. PostgreSQL is
 largely at defaults, except for a 2GB shared_buffers and a few
 unrelated changes. The system itself has 32GB of physical RAM and has
 plenty free.
 Why didn't PostgreSQL just read the table into memory (and the
 interesting index) as a sequential scan, sort, and then write it out?
 It seems like there would be more than enough memory for that. The
 sequential I/O rate on this machine is 50-100x the random I/O rate.

 I'm using 8.4.10 (with the 'inet' de-toasting patch) on Scientific Linux
 6.1.


 The suppport for doing a seqscan+sort in CLUSTER was introduced in version
 9.1. Before that, CLUSTER always did an indexscan. See release notes:
 http://www.postgresql.org/docs/9.1/static/release-9-1.html#AEN107416

That's what I get for digging through the source (git) but working
with 8.4.10, on a Friday, at the end of a long week.
Thanks for pointing that out to somebody that should have known better.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] copy vs. C function

2011-12-14 Thread Jon Nelson
On Wed, Dec 14, 2011 at 12:18 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 The only thing I have left are these statements:

 get_call_result_type
 TupleDescGetAttInMetadata
 BuildTupleFromCStrings
 HeapTupleGetDatum
 and finally PG_RETURN_DATUM

 It turns out that:
 get_call_result_type adds 43 seconds [total: 54],
 TupleDescGetAttInMetadata adds 19 seconds [total: 73],
 BuildTypleFromCStrings accounts for 43 seconds [total: 116].

 So those three functions account for 90% of the total time spent.
 What alternatives exist? Do I have to call get_call_result_type /every
 time/ through the function?

 Well, if you're concerned about performance then I think you're going
 about this in entirely the wrong way, because as far as I can tell from
 this you're converting all the field values to text and back again.
 You should be trying to keep the values in Datum format and then
 invoking heap_form_tuple.  And yeah, you probably could cache the
 type information across calls.

The parsing/conversion (except BuildTupleFromCStrings) is only a small
fraction of the overall time spent in the function and could probably
be made slightly faster. It's the overhead that's killing me.

Remember: I'm not converting multiple field values to text and back
again, I'm turning a *single* TEXT into 8 columns of varying types
(INET, INTEGER, and one INTEGER array, among others).  I'll re-write
the code to use Tuples but given that 53% of the time is spent in just
two functions (the two I'd like to cache) I'm not sure how much of a
gain it's likely to be.

Regarding caching, I tried caching it across calls by making the
TupleDesc static and only initializing it once.
When I tried that, I got:

ERROR:  number of columns (6769856) exceeds limit (1664)

I tried to find some documentation or examples that cache the
information, but couldn't find any.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] copy vs. C function

2011-12-14 Thread Jon Nelson
On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 Regarding caching, I tried caching it across calls by making the
 TupleDesc static and only initializing it once.
 When I tried that, I got:

 ERROR:  number of columns (6769856) exceeds limit (1664)

 I tried to find some documentation or examples that cache the
 information, but couldn't find any.

 You might find reading record_in to be helpful.  What it caches is not
 exactly what you need to, I think, but it shows the general principles.
 There are lots of other functions that use fn_extra to cache info, too.

I will definitely look into those. I'm probably doing it wrong, but in
the meantime, I allocated enough space (by way of MemoryContextAlloc)
in TopMemoryContext for an AttInMetadata pointer, switched to that
memory context (just the first time through), used CreateTupleDescCopy
+ TupleDescGetAttInMetadata to duplicate (in the new memory context)
the TupleDesc, and then switched back. This approach seems to have
dropped the total run time to about 54 seconds, the bulk of which is
BuildTupleFromCStrings, a rather significant improvement.



Looking at record_in, I think I see what I could be doing better.

Again, thanks for the pointers.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] copy vs. C function

2011-12-14 Thread Jon Nelson
On Wed, Dec 14, 2011 at 9:40 AM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 Regarding caching, I tried caching it across calls by making the
 TupleDesc static and only initializing it once.
 When I tried that, I got:

 ERROR:  number of columns (6769856) exceeds limit (1664)

 I tried to find some documentation or examples that cache the
 information, but couldn't find any.

 You might find reading record_in to be helpful.  What it caches is not
 exactly what you need to, I think, but it shows the general principles.
 There are lots of other functions that use fn_extra to cache info, too.

 I will definitely look into those. I'm probably doing it wrong, but in
 the meantime, I allocated enough space (by way of MemoryContextAlloc)
 in TopMemoryContext for an AttInMetadata pointer, switched to that
 memory context (just the first time through), used CreateTupleDescCopy
 + TupleDescGetAttInMetadata to duplicate (in the new memory context)
 the TupleDesc, and then switched back. This approach seems to have
 dropped the total run time to about 54 seconds, the bulk of which is
 BuildTupleFromCStrings, a rather significant improvement.

 

 Looking at record_in, I think I see what I could be doing better.

Indeed. I revised the code to make use of fcinfo-flinfo-fn_extra for
storage and fcinfo-flinfo-fn_mcxt for the MemoryContext and
everything seemed to work just fine.

Assuming one *starts* with a char *some_var[8], would building Datum
myself be faster than using BuildTupleFromCStrings?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] copy vs. C function

2011-12-14 Thread Jon Nelson
On Wed, Dec 14, 2011 at 9:51 AM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Wed, Dec 14, 2011 at 9:40 AM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 Regarding caching, I tried caching it across calls by making the
 TupleDesc static and only initializing it once.
 When I tried that, I got:

 ERROR:  number of columns (6769856) exceeds limit (1664)

 I tried to find some documentation or examples that cache the
 information, but couldn't find any.

 You might find reading record_in to be helpful.  What it caches is not
 exactly what you need to, I think, but it shows the general principles.
 There are lots of other functions that use fn_extra to cache info, too.

 I will definitely look into those. I'm probably doing it wrong, but in
 the meantime, I allocated enough space (by way of MemoryContextAlloc)
 in TopMemoryContext for an AttInMetadata pointer, switched to that
 memory context (just the first time through), used CreateTupleDescCopy
 + TupleDescGetAttInMetadata to duplicate (in the new memory context)
 the TupleDesc, and then switched back. This approach seems to have
 dropped the total run time to about 54 seconds, the bulk of which is
 BuildTupleFromCStrings, a rather significant improvement.

 

 Looking at record_in, I think I see what I could be doing better.

 Indeed. I revised the code to make use of fcinfo-flinfo-fn_extra for
 storage and fcinfo-flinfo-fn_mcxt for the MemoryContext and
 everything seemed to work just fine.

 Assuming one *starts* with a char *some_var[8], would building Datum
 myself be faster than using BuildTupleFromCStrings?

The answer is: yes. At least, in my case it is.
The total run time is now down to about 32 seconds.
Versus the BuildTupleFromCStrings which takes about 54 seconds.
32 seconds is more than 10-15 seconds, but it's livable.

This experiment has been very worthwhile - thank you all for the help.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] copy vs. C function

2011-12-13 Thread Jon Nelson
On Mon, Dec 12, 2011 at 10:38 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 I was experimenting with a few different methods of taking a line of
 text, parsing it, into a set of fields, and then getting that info
 into a table.

 The first method involved writing a C program to parse a file, parse
 the lines and output newly-formatted lines in a format that
 postgresql's COPY function can use.
 End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
 output new data to new file -- 4 seconds, COPY new file -- 10
 seconds).

 The next approach I took was to write a C function in postgresql to
 parse a single TEXT datum into an array of C strings, and then use
 BuildTupleFromCStrings. There are 8 columns involved.
 Eliding the time it takes to COPY the (raw) file into a temporary
 table, this method took 120 seconds, give or take.

 The difference was /quite/ a surprise to me. What is the probability
 that I am doing something very, very wrong?

 NOTE: the code that does the parsing is actually the same,
 line-for-line, the only difference is whether the routine is called by
 a postgresql function or by a C program via main, so obviously the
 overhead is elsewhere.
 NOTE #2: We are talking about approximately 2.6 million lines.


 Let me throw out an interesting third method I've been using to parse
 delimited text files that might be useful in your case.  This is
 useful when parsing text that is bad csv where values are not escaped
 or there are lines, incomplete and/or missing records, or a huge
 amount of columns that you want to rotate into a more normalized
 structure based on columns position.

 1. Import the data into a single column (containing the entire line)
 staging table, feeding the COPY parser a bogus delimiter
 2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function).
 3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if
 you can work it into your problem, INSERT/SELECT, expanding the array
 with a trick like used in information_schema._pg_expandarray so you
 can hook logic on the array (column position).

If you replace [2] with my C function (which can process all of the
data, *postgresql overhead not included*, in about 1 second) then
that's what I did. It returns a composite type making [3] unnecessary.

I know it's not parsing, so I started a time honored debugging
approach: I returned early.

Is the function-call overhead that high? That's outrageously high.
What else could it be? Is returning a composite type outragously
expensive?
So here is what I did: I modified the code so that it immediately returns NULL.
Result: 2 seconds.
Extract arguments, allocate temporary work buffer: another 0.5 seconds.
Add parsing: another 1.5 seconds [total: 4.1 seconds]

and so on...

Two of the items require base conversion, so:
Calling strtol (twice) and snprintf (twice) -- adds *6 seconds.

and to format one of the items as an array (a strcpy and a strcat) --
add 1.5 seconds for a total of 11.5.

The only thing I have left are these statements:

get_call_result_type
TupleDescGetAttInMetadata
BuildTupleFromCStrings
HeapTupleGetDatum
and finally PG_RETURN_DATUM

It turns out that:
get_call_result_type adds 43 seconds [total: 54],
TupleDescGetAttInMetadata adds 19 seconds [total: 73],
BuildTypleFromCStrings accounts for 43 seconds [total: 116].

So those three functions account for 90% of the total time spent.
What alternatives exist? Do I have to call get_call_result_type /every
time/ through the function?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] select distinct uses index scan vs full table scan

2011-12-13 Thread Jon Nelson
I've got a 5GB table with about 12 million rows.
Recently, I had to select the distinct values from just one column.
The planner chose an index scan. The query took almost an hour.
When I forced index scan off, the query took 90 seconds (full table scan).

The planner estimated 70,000 unique values when, in fact, there are 12
million (the value for this row is *almost* but not quite unique).
What's more, despite bumping the statistics on that column up to 1000
and re-analyzing, the planner now thinks that there are 300,000 unique
values.
How can I tell the planner that a given column is much more unique
than, apparently, it thinks it is?
The column type is INET.
This is on PG 8.4.10 on Linux x86_64, with
81f4e6cd27d538bc27e9714a9173e4df353a02e5 applied.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] select distinct uses index scan vs full table scan

2011-12-13 Thread Jon Nelson
On Tue, Dec 13, 2011 at 1:57 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 I've got a 5GB table with about 12 million rows.
 Recently, I had to select the distinct values from just one column.
 The planner chose an index scan. The query took almost an hour.
 When I forced index scan off, the query took 90 seconds (full table scan).

 Usually, we hear complaints about the opposite.  Are you using
 nondefault cost settings?

Cost settings had not been changed until a few minutes ago when your
response prompted me to try a few things.

I ended up changing the random_page_cost to 16.0 (from 4.0), partly
because the H/W raid I'm using is awful bad at random I/O. I'll
experiment and keep tabs on performance to see if this has a negative
effect on other aspects.

 The planner estimated 70,000 unique values when, in fact, there are 12
 million (the value for this row is *almost* but not quite unique).
 What's more, despite bumping the statistics on that column up to 1000
 and re-analyzing, the planner now thinks that there are 300,000 unique
 values.

 Accurate ndistinct estimates are hard, but that wouldn't have much of
 anything to do with this particular choice, AFAICS.

 How can I tell the planner that a given column is much more unique
 than, apparently, it thinks it is?

 9.0 and up have ALTER TABLE ... ALTER COLUMN ... SET n_distinct.

D'oh!  I'm on 8.4.10+patches.
This may provide the necessary push.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] copy vs. C function

2011-12-10 Thread Jon Nelson
I was experimenting with a few different methods of taking a line of
text, parsing it, into a set of fields, and then getting that info
into a table.

The first method involved writing a C program to parse a file, parse
the lines and output newly-formatted lines in a format that
postgresql's COPY function can use.
End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
output new data to new file -- 4 seconds, COPY new file -- 10
seconds).

The next approach I took was to write a C function in postgresql to
parse a single TEXT datum into an array of C strings, and then use
BuildTupleFromCStrings. There are 8 columns involved.
Eliding the time it takes to COPY the (raw) file into a temporary
table, this method took 120 seconds, give or take.

The difference was /quite/ a surprise to me. What is the probability
that I am doing something very, very wrong?

NOTE: the code that does the parsing is actually the same,
line-for-line, the only difference is whether the routine is called by
a postgresql function or by a C program via main, so obviously the
overhead is elsewhere.
NOTE #2: We are talking about approximately 2.6 million lines.

I was testing:

\copy some_table from 'some_file.csv' with csv
vs.
insert into some_table select (some_func(line)).* from some_temp_table;

where some_func had been defined with (one) IN TEXT and (8) OUT params
of varying types.

PostgreSQL 9.1.1 on Linux, x86_64

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] copy vs. C function

2011-12-10 Thread Jon Nelson
On Sat, Dec 10, 2011 at 8:32 PM, Craig Ringer ring...@ringerc.id.au wrote:
 On 12/11/2011 09:27 AM, Jon Nelson wrote:

 The first method involved writing a C program to parse a file, parse
 the lines and output newly-formatted lines in a format that
 postgresql's COPY function can use.
 End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
 output new data to new file -- 4 seconds, COPY new file -- 10
 seconds).

 Why not `COPY tablename FROM /path/to/myfifo' ?

If I were to do this in any sort of production environment, that's
exactly what I would do. I was much more concerned about the /huge/
difference -- 10 seconds for COPY and 120 seconds for (INSERT INTO /
CREATE TABLE AS / whatever).

 The next approach I took was to write a C function in postgresql to
 parse a single TEXT datum into an array of C strings, and then use
 BuildTupleFromCStrings. There are 8 columns involved.
 Eliding the time it takes to COPY the (raw) file into a temporary
 table, this method took 120 seconds, give or take.

 The difference was /quite/ a surprise to me. What is the probability
 that I am doing something very, very wrong?

 Have a look at how COPY does it within the Pg sources, see if that's any
 help. I don't know enough about Pg's innards to answer this one beyond that
 suggestion, sorry.

Ack.

Regarding a subsequent email, I was using full transactions.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] external sort performance

2011-11-20 Thread Jon Nelson
On Sun, Nov 20, 2011 at 7:56 AM, Jeremy Harris j...@wizmail.org wrote:
 On 2011-11-17 17:10, Jon Nelson wrote:

 external sort ended, 7708696 disk blocks used: CPU 359.84s/57504.66u
 sec elapsed 58966.76 sec

 Am I to understand that the CPU portion of the sorting only took 6
 minutes but the sort itself took almost 16.5 hours and used approx
 60GB of disk space?


 I realise you've had helpful answers by now, but that reads
 as 16 hours of cpu time to me; mostly user-mode but with 6 minute
 of system-mode.  98% cpu usage for the 16 hours elapsed.

Thank you very much!
I was going to post a followup asking for help interpreting the log
line, but now I don't have to. Do you happen to recall if disk I/O is
counted as user or system time? If it's counted as system time, then I
have more questions, namely:

If using a hash table (hash aggregation) shows that the GROUPing can
take place in 35 minutes, but a Group Aggregation takes 16 hours, how
much of that is CPU and how much is waiting for I/O?


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] external sort performance

2011-11-17 Thread Jon Nelson
I have one query which does not run very often. Sometimes it may be
months between runs.
However, when it does get executed, it scans approximately 100
identically-structured tables (a form of partitioning), extracts and
groups on a subset of the columns, and creates a new table. The
individual table queries have no where clauses, this is a full table
scan for every table.

I've tried all sorts of things to try to improve the performance,
which can take a /very/ long time.
We are talking about approximately 175GB of data before grouping/summarizing.

This is on PG 8.4.8 on Linux, 16GB of real RAM.
Most recently, I enabled trace_sort, disabled hash aggregation[1], and
set a large work_mem (normally very small, in this case I tried
anything from 8MB to 256MB. I even tried 1GB and 2GB).

In the logs, I saw this:

external sort ended, 7708696 disk blocks used: CPU 359.84s/57504.66u
sec elapsed 58966.76 sec

Am I to understand that the CPU portion of the sorting only took 6
minutes but the sort itself took almost 16.5 hours and used approx
60GB of disk space?
The resulting summary table is about 5GB in size as reported by \d+ in
psql (and pg_relation_size).

The underlying storage is ext4 on a hardware raid 10 with a BBU.

What sorts of things should I be looking at to improve the performance
of this query? Is my interpretation of that log line totally off base?



[1] if I don't disable hash aggregation and the work_mem is over 8MB
in size, the memory allocation explodes to the point where postgresql
wants dozens of gigs of memory. I've tried setting the statistics as
high as 1000 without benefit.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] external sort performance

2011-11-17 Thread Jon Nelson
I'll try to compile multiple questions/answers into a single response.

On Thu, Nov 17, 2011 at 11:16 AM, Claudio Freire klaussfre...@gmail.com wrote:
 On Thu, Nov 17, 2011 at 2:10 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 What sorts of things should I be looking at to improve the performance
 of this query? Is my interpretation of that log line totally off base?

 You'll have to post some more details.
 Like a query and an explain/explain analyze.

Please see below, however, I am also very interested to know if I'm
interpreting that log line correctly.

 Memory consumption probably skyrockets since you'll need at least one
 sort per table, so if you have 100+, then that's (at least) 100+
 sorts.

Right, that much I had understood.


On Thu, Nov 17, 2011 at 11:28 AM, Craig James
craig_ja...@emolecules.com wrote:
 You don't give any details about how and why you are sorting. Are you
 actually using all of the columns in your aggregated-data table in the sort
 operation?  Or just a few of them?

 You're making the sort operation work with 175 GB of data.  If most of that
 data is only needed for the report (not the sort), then separate it into two
 tables - one of just the data that the sorting/grouping needs, and the other
 with the rest of the data. Then create a view that joins it all back
 together for reporting purposes.

I'm not actually using any ORDER BY at all. This is purely a GROUP BY.
The sort happens because of the group aggregate (vs. hash aggregate).
Two of the columns are used to group, the other two are aggregates (SUM).

On Thu, Nov 17, 2011 at 11:55 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 This is on PG 8.4.8 on Linux, 16GB of real RAM.
 Most recently, I enabled trace_sort, disabled hash aggregation[1], and
 set a large work_mem (normally very small, in this case I tried
 anything from 8MB to 256MB. I even tried 1GB and 2GB).

 FWIW, I think hash aggregation is your best shot at getting reasonable
 performance.  Sorting 175GB of data is going to hurt no matter what.

 If the grouped table amounts to 5GB, I wouldn't have expected the hash
 table to be more than maybe 2-3X that size (although this does depend on
 what aggregates you're running...).  Letting the hash aggregation have
 all your RAM might be the best answer.

I'm re-running the query with work_mem set to 16GB (for just that query).

The query (with table and column names changed):

SELECT anon_1.columnA, sum(anon_1.columnB) AS columnB,
sum(anon_1.columnC) AS columnC, anon_1.columnD
FROM (
  SELECT columnA, columnB, columnC, columnD FROM tableA
  UNION ALL
   same select/union all pattern but from 90-ish other tables
) AS anon_1
GROUP BY anon_1.columnA, anon_1.columnD
HAVING (anon_1.columnB)  0

The explain verbose with work_mem = 16GB

 HashAggregate  (cost=54692162.83..54692962.83 rows=4 width=28)
   Output: columnA, sum(columnB), sum(columnC), columnD
   Filter: (sum(columnB)  0)
   -  Append  (cost=0.00..34547648.48 rows=1611561148 width=28)
 -  Seq Scan on tableA  (cost=0.00..407904.40 rows=19045540 width=28)
   Output: columnA, columnB, columnC, columnD
  90-ish more tables here

12 minutes into the query it is consuming 10.1GB of memory.
21 minutes into the query it is consuming 12.9GB of memory.
After just under 34 minutes it completed with about 15GB of memory being used.
That is a rather impressive improvement. Previously, I had been
advised against using a large work_mem value. I had never thought to
use one 3 times the size of the resulting table.

The explain verbose with enable_hashagg = false:

 GroupAggregate  (cost=319560040.24..343734257.46 rows=4 width=28)
   Output: columnA, sum(columnB), sum(columnC), columnD
   Filter: (sum(columnB)  0)
   -  Sort  (cost=319560040.24..323588943.11 rows=1611561148 width=28)
 Output: columnA, columnB, columnC, columnD
 Sort Key: columnA, columnD
 -  Result  (cost=0.00..34547648.48 rows=1611561148 width=28)
   Output: columnA, columnB, columnC, columnD
   -  Append  (cost=0.00..34547648.48 rows=1611561148 width=28)
 -  Seq Scan on tableA  (cost=0.00..407904.40
rows=19045540 width=28)
 Output: columnA, columnB, columnC, columnD
  90-ish more tables here



--
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] external sort performance

2011-11-17 Thread Jon Nelson
A follow-up question.
Even with both work_mem and maintenance_work_mem equal to 16GB, I see this:

LOG:  0: begin index sort: unique = f, workMem = 16777216, randomAccess = f
and shortly thereafter:
LOG:  0: switching to external sort with 59919 tapes: CPU
2.59s/13.20u sec elapsed 16.85 sec
and a while later:
LOG:  0: finished writing run 1 to tape 0: CPU 8.16s/421.45u sec
elapsed 433.83 sec
LOG:  0: performsort done (except 2-way final merge): CPU
9.53s/561.56u sec elapsed 576.54 sec
LOG:  0: external sort ended, 181837 disk blocks used: CPU
12.90s/600.45u sec elapsed 625.05 sec


The first log statement is expected. The second log statement, however, isn't.
The total table size is (as noted earlier) about 5GB and, in fact, fit
into one nice hash table (approx 15GB in size).
Is the sorting that is necessary for index creation unable to use a
hash table? (This is a standard btree index).

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres INSERT performance and scalability

2011-09-19 Thread Jon Nelson
On Mon, Sep 19, 2011 at 7:53 PM, Stephen Frost sfr...@snowman.net wrote:
 Igor,

 * Igor Chudov (ichu...@gmail.com) wrote:
 Would the time that it takes, differ a great deal, depending on whether the
 table has only 100,000 or 5,000,000 records?

 Yes, because PostgreSQL is going to copy the data.  If you don't need or
 want it to be copied, just use a view.  I've never heard of any
 relational database implementing 'copy on write' type semantics, if
 that's what you're asking about.  Databases, unlike applications with
 code in memory that's constantly copied, are typically focused around
 minimizing duplication of data (since it all has to end up on disk at
 some point).  Not much point in having the overhead of COW for that kind
 of environment, I wouldn't think.

Isn't the WAL basically COW?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] bad plan: 8.4.8, hashagg, work_mem=1MB.

2011-06-20 Thread Jon Nelson
PostgreSQL 8.4.8 on CentOS 5.6, x86_64. Default settings except work_mem = 1MB.

NOTE: I am using partitioned tables here, and was querying the
'master' table. Perhaps is this a Known Issue.

I ran a query recently where the result was very large. The outer-most
part of the query looked like this:

 HashAggregate  (cost=56886512.96..56886514.96 rows=200 width=30)
   -  Result  (cost=0.00..50842760.97 rows=2417500797 width=30)

The row count for 'Result' is in the right ballpark, but why does
HashAggregate think that it can turn 2 *billion* rows of strings (an
average of 30 bytes long) into only 200?  This is my primary concern.
If I don't disable hash aggregation, postgresql quickly consumes huge
quantities of memory and eventually gets killed by the OOM manager.



After manually disabling hash aggregation, I ran the same query. It's
been running for over 2 days now. The disk is busy but actual data
transferred is very low. Total data size is approx. 250GB, perhaps a
bit less.

The query scans 160 or so tables for data. If I use a distinct + union
on each table, the plan looks like this:

 Unique  (cost=357204094.44..357318730.75 rows=22927263 width=28)
   -  Sort  (cost=357204094.44..357261412.59 rows=22927263 width=28)

23 million rows is more like it, and the cost is much lower. What is
the possibility that distinct/unique operations can be pushed down
into queries during the planning stage to see if they are less
expensive?

In this case, postgresql believes (probably correctly, I'll let you
know) that distinct(column foo from tableA + column foo from tableB +
column foo from tableC ...) is more expensive than distinct(distinct
column foo from tableA + distinct column foo from tableB  ).

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bad plan: 8.4.8, hashagg, work_mem=1MB.

2011-06-20 Thread Jon Nelson
On Mon, Jun 20, 2011 at 11:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 I ran a query recently where the result was very large. The outer-most
 part of the query looked like this:

  HashAggregate  (cost=56886512.96..56886514.96 rows=200 width=30)
    -  Result  (cost=0.00..50842760.97 rows=2417500797 width=30)

 The row count for 'Result' is in the right ballpark, but why does
 HashAggregate think that it can turn 2 *billion* rows of strings (an
 average of 30 bytes long) into only 200?

 200 is the default assumption about number of groups when it's unable to
 make any statistics-based estimate.  You haven't shown us any details so
 it's hard to say more than that.

What sorts of details would you like? The row count for the Result
line is approximately correct -- the stats for all tables are up to
date (the tables never change after import).  statistics is set at 100
currently.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] plan variations: join vs. exists vs. row comparison

2011-03-07 Thread Jon Nelson
Originally, I posted to -general but I found some time to write some
samples, and realized it's probably more of a performance question.

The original post is here:
http://archives.postgresql.org/pgsql-general/2011-03/msg00198.php

I was hoping that somebody could help me understand the differences
between three plans.
All of the plans are updating a table using a second table, and should
be logically equivalent.
Two of the plans use joins, and one uses an exists subquery.
One of the plans uses row constructors and IS NOT DISTINCT FROM. It is
this plan which has really awful performance.
Clearly it is due to the nested loop, but why would the planner choose
that approach?

I also don't understand why in the 'exists' plan the planner thinks
the index scan will provide 1019978 rows, when there are only 100,
but that is a lesser issue.

Here is a sample SQL file which demonstrates the issues and includes
all three variations.

begin;
create temporary table t7 (
  i BIGINT NOT NULL,
  k BIGINT
);

create temporary table t8 (
  i BIGINT NOT NULL,
  j INT
);

CREATE FUNCTION populate_t8()
RETURNS VOID
LANGUAGE SQL
AS
$$
truncate t8;
insert into t8
SELECT i, 1 from t7
ORDER BY i LIMIT 1;

insert into t8
SELECT i, 2 from t7
WHERE i  1
ORDER BY i LIMIT 1;

SELECT i, 3 from t7
WHERE i  2
ORDER BY i LIMIT 2;

analyze t8;
$$;

INSERT INTO t7
select x, x + 10 from generate_series(1,100) as x ;
analyze t7;

select populate_t8();

explain analyze verbose
update
  t7
SET
  k = 1
FROM
  t8
WHERE
  t7.i = t8.i
  AND
  (
t8.j = 2
OR
t8.j = 1
  );

select populate_t8();

explain analyze verbose
update
  t7
SET
  k = 1
WHERE
  EXISTS (
SELECT 1 FROM t8
WHERE t8.i = t7.i
AND
(
  t8.j = 2
  OR
  t8.j = 1
)
  );

select populate_t8();

explain
update
  t7
SET
  k = 1
FROM
  t8
WHERE
  ROW(t7.i) IS NOT DISTINCT FROM ROW(t8.i)
  AND
  (
t8.j = 2
OR
t8.j = 1
  );

explain analyze verbose
update
  t7
SET
  k = 1
FROM
  t8
WHERE
  ROW(t7.i) IS NOT DISTINCT FROM ROW(t8.i)
  AND
  (
t8.j = 2
OR
t8.j = 1
  );

rollback;





-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] plan variations: join vs. exists vs. row comparison

2011-03-07 Thread Jon Nelson
On Mon, Mar 7, 2011 at 2:00 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 I was hoping that somebody could help me understand the differences
 between three plans.
 All of the plans are updating a table using a second table, and should
 be logically equivalent.
 Two of the plans use joins, and one uses an exists subquery.
 One of the plans uses row constructors and IS NOT DISTINCT FROM. It is
 this plan which has really awful performance.
 Clearly it is due to the nested loop, but why would the planner choose
 that approach?

 IS NOT DISTINCT FROM pretty much disables all optimizations: it can't be
 an indexqual, merge join qual, or hash join qual.  So it's not
 surprising that you get a sucky plan for it.  Possibly somebody will
 work on improving that someday.

 As for your other questions, what PG version are you using?  Because I
 do get pretty much the same plan (modulo a plain join versus a semijoin)
 for the first two queries, when using 9.0 or later.  And the results of
 ANALYZE are only approximate, so you shouldn't be surprised at all if a
 rowcount estimate is off by a couple percent.  Most of the time, you
 should be happy if it's within a factor of 2 of reality.

Sorry - I had stated in the original post that I was using 8.4.5 on 64
bit openSUSE and CentOS 5.5, and had forgotten to carry that
information over into the second post.

What is the difference between a plain join and a semi join?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [HACKERS] [PERFORM] Slow count(*) again...

2011-02-03 Thread Jon Nelson
On Thu, Feb 3, 2011 at 7:41 AM, Kenneth Marshall k...@rice.edu wrote:
 On Thu, Feb 03, 2011 at 02:11:58AM -0800, da...@lang.hm wrote:
 On Thu, 3 Feb 2011, Vitalii Tymchyshyn wrote:

 02.02.11 20:32, Robert Haas ???(??):
 Yeah.  Any kind of bulk load into an empty table can be a problem,
 even if it's not temporary.  When you load a bunch of data and then
 immediately plan a query against it, autoanalyze hasn't had a chance
 to do its thing yet, so sometimes you get a lousy plan.

 May be introducing something like 'AutoAnalyze' threshold will help? I
 mean that any insert/update/delete statement that changes more then x% of
 table (and no less then y records) must do analyze right after it was
 finished.
 Defaults like x=50 y=1 should be quite good as for me.

 If I am understanding things correctly, a full Analyze is going over all
 the data in the table to figure out patterns.

 If this is the case, wouldn't it make sense in the situation where you are
 loading an entire table from scratch to run the Analyze as you are
 processing the data? If you don't want to slow down the main thread that's
 inserting the data, you could copy the data to a second thread and do the
 analysis while it's still in RAM rather than having to read it off of disk
 afterwords.

 this doesn't make sense for updates to existing databases, but the use case
 of loading a bunch of data and then querying it right away isn't _that_
 uncommon.

 David Lang


 +1 for in-flight ANALYZE. This would be great for bulk loads of
 real tables as well as temp tables.

Yes, please, that would be really nice.




-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Any experience using shake defragmenter?

2011-02-01 Thread Jon Nelson
On Tue, Feb 1, 2011 at 1:24 PM, Greg Smith g...@2ndquadrant.com wrote:
 Mladen Gogala wrote:

 Did anyone try using shake while the cluster is active? Any problems
 with corruption or data loss? I ran the thing on my home directory and
 nothing was broken. I didn't develop any performance test, so cannot vouch
 for the effectiveness of the procedure. Did anyone play with that? Any
 positive or negative things to say about shake?


 Shake works by allocating a new file the size of the original, in what is
 presumed to be then be unfragmented space.  It copies the original over to
 this new space and then gets rid of the original.  That procedure will cause
 database corruption if the server happens to access the file it's moving
 while it's in the middle of doing so.  If the database isn't running,
 though, it is probably fine.

 On ext3 you can measure whether it was useful or not by taking the
 filesystem off-line and running fsck before/after using it.  Look for
 percentages given for non-contiguous files and directories.  If those were
 low to begin with, little reason to run the utility.  If they're high,
 running shake should bring them down afterwards if it's doing its job right.

 On a PostgreSQL database system, you can get the same basic effect while
 leaving the server up--but just with the table locked--using CLUSTER.  And
 that will clean up a bunch of other potential messes inside the database
 that shake can't touch.  I just do that instead if I'm worried a particular
 table has become fragmented on disk.

One thing to note is that, in my experiments, ext4 handles large files
(such as the 1GiB files postgresql uses for large relations) in a
*vastly* improved manner over ext3.  This is due to the use of
extents.  I found that, in some cases, heavily fragmented files under
ext3 could not be effectively defragmented - and yes, I tried shake
and some others (including one I wrote which *does* use fallocate /
fallocate_posix). There was improvement, but by far the biggest
improvement was switching to ext4.

Instead of something like 'shake' (which more or less works, even
though it doesn't use fallocate and friends) I frequently use either
CLUSTER (which is what Greg Smith is suggesting) or a series of ALTER
TABLE ... ALTER COLUMN... which rewrites the table.  With PG 9 perhaps
VACUUM FULL is more appropriate.  Of course, the advice regarding
using 'shake' (or any other defragmenter) on a live postgresql data
directory is excellent - the potential for causing damage if the
database is active during that time is very high.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] queries with lots of UNIONed relations

2011-01-15 Thread Jon Nelson
On Fri, Jan 14, 2011 at 2:11 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 If you have enough memory to de-dup them individually, you surely have
 enough to de-dup all at once.

 If everything were available up-front, sure.
 However, and please correct me if I'm wrong, but doesn't postgresql
 work in a fairly linear fashion, moving from table to table performing
 a series of operations on each?

 Doing a single sort+uniq works like that.  But the alternate plan you
 are proposing we should consider involves building all the lower
 hashtables, and then reading from them to fill the upper hashtable.
 Max memory consumption *is* worst case here.  Remember HashAggregate
 is incapable of swapping to disk (and if it did, you wouldn't be nearly
 as pleased with its performance).

 That's not exactly what I'm proposing - but it is probably due to a
 lack of understanding some of the underlying details of how postgresql
 works. I guess I had assumed that the result of a HashAggregate or any
 other de-duplication process was a table-like structure.

And I assumed wrong, I think. I dug into the code (nodeHash.c and
others) and I think I understand now why HashAggregate works only in
certain circumstances, and I think I understand your comments a bit
better now.  Basically, HashAggregate doesn't stream unique Nodes the
way nodeUnique.c does. nodeUnique basically emits Nodes and elides
subsequent, identical Nodes, which is why it relies on the input being
sorted.  HashAggregate works only on entire input sets at once, and
nodeHash.c doesn't emit Nodes at all, really.

This makes me wonder if nodeHash.c and nodeHashjoin.c couldn't be
modified to output Nodes in a streaming fashion. The memory
requirements would not be any worse than now.

Does postgresql support any sort of merge sort?  If it did, then if
the hashtable started consuming too much memory, it could be cleared
and the nodes output from the new hashtable could be directed to
another temporary file, and then a merge sort could be performed on
all of the temporary files (and thus Unique could be used to affect
the UNION operation).

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] queries with lots of UNIONed relations

2011-01-14 Thread Jon Nelson
On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 If you have enough memory to de-dup them individually, you surely have
 enough to de-dup all at once.

 If everything were available up-front, sure.
 However, and please correct me if I'm wrong, but doesn't postgresql
 work in a fairly linear fashion, moving from table to table performing
 a series of operations on each?

 Doing a single sort+uniq works like that.  But the alternate plan you
 are proposing we should consider involves building all the lower
 hashtables, and then reading from them to fill the upper hashtable.
 Max memory consumption *is* worst case here.  Remember HashAggregate
 is incapable of swapping to disk (and if it did, you wouldn't be nearly
 as pleased with its performance).

That's not exactly what I'm proposing - but it is probably due to a
lack of understanding some of the underlying details of how postgresql
works. I guess I had assumed that the result of a HashAggregate or any
other de-duplication process was a table-like structure.

Regarding being pleased with hash aggregate - I am! - except when it
goes crazy and eats all of the memory in the machine. I'd trade a bit
of performance loss for not using up all of the memory and crashing.

However, maybe I'm misunderstanding how SELECT DISTINCT works internally.
In the case where a hashtable is used, does postgresql utilize
table-like structure or does it remain a hashtable in memory?

If it's a hashtable, couldn't the hashtable be built on-the-go rather
than only after all of the underlying tuples are available?

I'd love a bit more explanation as to how this works.

Another example of where this might be useful:   I'm currently running
a SELECT DISTINCT query over some 500 million rows (120 contributory
tables). I expect a de-duplicated row count of well under 10% of that
500 million, probably below 1%. The plan as it stands is to execute a
series of sequential scans, appending each of the rows from each
contributory table and then aggregating them. If the expected
distinctness of each contributory subquery is, say, 5% then instead of
aggregating over 500 million tuples the aggregation would take place
over 25 million. In this case, that is a savings of 10 gigabytes,
approximately.

Yes, it's true, the same amount of data has to be scanned. However,
the amount of data that needs to be stored (in memory or on disk) in
order to provide a final de-duplication is much smaller.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] queries with lots of UNIONed relations

2011-01-13 Thread Jon Nelson
I was recently asked to look into why one particular set of queries
was taking a long time. The queries are all of the same form.  They
select the UNION of a few
columns from around 100 tables.

The query in particular was taking some 7-8 minutes to run.

On a whim, I changed the query from this form:

SELECT a, b FROM FOO_a WHERE conditions
UNION
SELECT a,b FROM FOO_b WHERE conditions


to:

SELECT DISTINCT a,b FROM FOO_a WHERE conditions
UNION
SELECT DISTINCT a,b FROM FOO_b WHERE conditions
...

and the query time dropped to under a minute.

In the former case, the query plan was a bitmap heap scan for each
table. Then those results were Appended, Sorted, Uniqued, Sorted
again, and then returned.

In the latter, before Appending, each table's results were run through
HashAggregate.

The total number of result rows is in the 500K range. Each table holds
approximately 150K matching rows (but this can vary a bit).

What I'm asking is this: since adding DISTINCT to each participating
member of the UNION query reduced the total number of appended rows,
is there some sort of heuristic that postgresql could use to do this
automatically?  The 12x speedup was quite nice.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] queries with lots of UNIONed relations

2011-01-13 Thread Jon Nelson
On Thu, Jan 13, 2011 at 11:13 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 In the former case, the query plan was a bitmap heap scan for each
 table. Then those results were Appended, Sorted, Uniqued, Sorted
 again, and then returned.

 In the latter, before Appending, each table's results were run through
 HashAggregate.

 Probably the reason it did that is that each individual de-duplication
 looked like it would fit in work_mem, but a single de-duplication
 didn't.  Consider raising work_mem, at least for this one query.

I raised work_mem to as high as 512MB (SET LOCAL work_mem = '512MB',
within the transaction).  Nice. Instead of 7-10 minutes the result is
now about a minute (the same as with individual de-duplication).

Your comment regarding each individual de-duplication looked like it
would fit in work_mem doesn't really make sense, exactly. Maybe I'm
misunderstanding you.

What I'm asking is this: can postgresql apply a de-duplication to each
member of a UNION (as I did with SELECT DISTINCT) in order to reduce
the total number of rows that need to be de-duplicated when all of the
rows have been Appended?

The results of the various plans/tweaks are:

Initial state: (work_mem = 16MB, no DISTINCT, run time of 7-10 minutes):
Unique (Sort (Append ( Lots of Bitmap Heap Scans Here ) ) )

and (work_mem = 16MB, with DISTINCT, run time of ~ 1 minute):
HashAggregate ( Append ( Lots Of HashAggregate( Bitmap Heap Scan ) ) )

and (work_mem = 64kB, DISTINCT, run time of *15+ minutes*):
Unique (Sort ( Append ( Lots Of HashAggregate( Bitmap Heap Scan ) ) ) )

So I take from this the following:

1. if the result set fits in work_mem, hash aggregate is wicked fast.
About 1 jillion times faster than Unique+Sort.

2. it would be nifty if postgresql could be taught that, in a UNION,
to de-duplicate each contributory relation so as to reduce the total
set of rows that need to be re-de-duplicated. It's extra work, true,
and maybe there are some tricks here, but it seems to make a big
difference. This is useful so that the total result set is small
enough that hash aggregate might apply.

NOTE:

I have to have work_mem really low as a global on this machine because
other queries involving the same tables (such as those that involve
UNION ALL for SUM() or GROUP BY operations) cause the machine to run
out of memory. Indeed, even with work_mem at 1MB I run the machine out
of memory if I don't explicitly disable hashagg for some queries. Can
anything be done about that?


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] queries with lots of UNIONed relations

2011-01-13 Thread Jon Nelson
On Thu, Jan 13, 2011 at 2:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 Your comment regarding each individual de-duplication looked like it
 would fit in work_mem doesn't really make sense, exactly. Maybe I'm
 misunderstanding you.

 Yeah.  What I was suggesting was to NOT add the DISTINCT's, but instead
 raise work_mem high enough so you get just one HashAggregation step at
 the top level.  (Although I think this only works in 8.4 and up.)
 That should be faster than two levels of de-duplication.

Gave it a try -- performance either way doesn't seem to change -
although the final set that has to undergo de-duplication is rather
larger (WITHOUT DISTINCT) so I still run the risk of not getting Hash
Aggregation.

Since having the DISTINCT doesn't seem to hurt, and it avoids
(potential) significant pain, I'll keep it.

I still think that having UNION do de-duplication of each contributory
relation is a beneficial thing to consider -- especially if postgresql
thinks the uniqueness is not very high.

Thanks!

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] queries with lots of UNIONed relations

2011-01-13 Thread Jon Nelson
I was recently asked to look into why one particular set of queries
was taking a long time.
The queries are all of the same form.  They select the UNION of a few
columns from around 100 tables.
The query in particular was taking some 7-8 minutes to run.
On a whim, I changed the query from this form:

SELECT a, b FROM FOO_a WHERE conditions
UNION
SELECT a,b FROM FOO_b WHERE conditions


to:

SELECT DISTINCT a,b FROM FOO_a WHERE conditions
UNION
SELECT DISTINCT a,b FROM FOO_b WHERE conditions

and the query time dropped to under a minute.

In the former case, the query plan was a bitmap heap scan for each
table. Then those results were Appended, Sorted, Uniqued, Sorted
again, and then returned.

In the latter, before Appending, each table's results were run through
HashAggregate.

The total number of result rows is in the 500K range. Each table holds
approximately 150K matching rows (but this can vary a bit).

What I'm asking is this: since adding DISTINCT to each participating
member of the UNION query reduced the total number of appended rows,
is there some sort of heuristic that postgresql could use to do this
automatically?  The 12x speedup was quite nice.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] queries with lots of UNIONed relations

2011-01-13 Thread Jon Nelson
On Thu, Jan 13, 2011 at 4:49 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jan 13, 2011 at 5:47 PM, Andy Colson a...@squeakycode.net wrote:
 I don't believe there is any case where hashing each individual relation
 is a win compared to hashing them all together.  If the optimizer were
 smart enough to be considering the situation as a whole, it would always
 do the latter.

 You might be right, but I'm not sure.  Suppose that there are 100
 inheritance children, and each has 10,000 distinct values, but none of
 them are common between the tables.  In that situation, de-duplicating
 each individual table requires a hash table that can hold 10,000
 entries.  But deduplicating everything at once requires a hash table
 that can hold 1,000,000 entries.

 Or am I all wet?

 Yeah, I'm all wet, because you'd still have to re-de-duplicate at the
 end.  But then why did the OP get a speedup?  *scratches head*

 Because it all fix it memory and didnt swap to disk?

 Doesn't make sense.  The re-de-duplication at the end should use the
 same amount of memory regardless of whether the individual relations
 have already been de-duplicated.

I don't believe that to be true.
Assume 100 tables each of which produces 10,000 rows from this query.
Furthermore, let's assume that there are 3,000 duplicates per table.

Without DISTINCT:
uniqify (100 * 10,000 = 1,000,000 rows)

With DISTINCT:
uniqify (100 * (10,000 - 3,000) = 700,000 rows)

300,000 rows times (say, 64 bytes/row) = 18.75MB.
Not a lot, but more than the work_mem of 16MB.

Or maybe *I'm* all wet?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] queries with lots of UNIONed relations

2011-01-13 Thread Jon Nelson
On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Jan 13, 2011 at 5:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't believe there is any case where hashing each individual relation
 is a win compared to hashing them all together.  If the optimizer were
 smart enough to be considering the situation as a whole, it would always
 do the latter.

 You might be right, but I'm not sure.  Suppose that there are 100
 inheritance children, and each has 10,000 distinct values, but none of
 them are common between the tables.  In that situation, de-duplicating
 each individual table requires a hash table that can hold 10,000
 entries.  But deduplicating everything at once requires a hash table
 that can hold 1,000,000 entries.

 Or am I all wet?

 If you have enough memory to de-dup them individually, you surely have
 enough to de-dup all at once.  It is not possible for a single hashtable
 to have worse memory consumption than N hashtables followed by a union
 hashtable, and in fact if there are no common values then the latter eats
 twice as much space because every value appears twice in two different
 hashtables.

If everything were available up-front, sure.
However, and please correct me if I'm wrong, but doesn't postgresql
work in a fairly linear fashion, moving from table to table performing
a series of operations on each? That seems to indicate that is what
the plan is:

Compare:

for each table LOOP
  scan table for result rows, append to results
END LOOP
hash / sort + unique results

versus:

for each table LOOP
  scan table for result rows, append to table-results
  hash / sort+unique table-results, append to results
END LOOP
hash / sort + unique results

In the former case, all of the result rows from all tables are
appended together before the de-duplification process can start.

In the latter case, only enough memory for each table's result set is
necessary for de-duplification, and it would only be necessary to
allocate it for that table.

Is that not how this works?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Update problem on large table

2010-12-06 Thread Jon Nelson
On Mon, Dec 6, 2010 at 1:46 PM, bricklen brick...@gmail.com wrote:
 On Sat, Dec 4, 2010 at 11:45 AM, felix crucialfe...@gmail.com wrote:
 Ok, I caught one : an update that is stuck in waiting.
 the first one blocks the second one.
 ns      |    5902 | nssql   | UPDATE fastadder_fastadderstatus SET built
 = false WHERE fastadder_fastadderstatus.service_id = 1

 Not sure if anyone replied about killing your query, but you can do it like 
 so:

 select pg_cancel_backend(5902);  -- assuming 5902 is the pid of the
 query you want canceled.

How does this differ from just killing the pid?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Defaulting wal_sync_method to fdatasync on Linux for 9.1?

2010-11-17 Thread Jon Nelson
On Wed, Nov 17, 2010 at 3:24 PM, Greg Smith g...@2ndquadrant.com wrote:
 Scott Carey wrote:

 Did you recompile your test on the RHEL6 system?

 On both systems I showed, I checked out a fresh copy of the PostgreSQL 9.1
 HEAD from the git repo, and compiled that on the server, to make sure I was
 pulling in the appropriate kernel headers.  I wasn't aware of exactly how
 the kernel sync stuff was refactored though, thanks for the concise update
 on that.  I can do similar tests on a RHEL5 system, but not on the same
 hardware.  Can only make my laptop boot so many operating systems at a time
 usefully.

One thing to note is that where on a disk things sit can make a /huge/
difference - depending on if Ubuntu is /here/ and RHEL is /there/ and
so on can make a factor of 2 or more difference.  The outside tracks
of most modern SATA disks can do around 120MB/s. The inside tracks
aren't even half of that.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] do temporary tables have hint bits?

2010-11-13 Thread Jon Nelson
I was doing some testing with temporary tables using this sql:

begin;
select pg_sleep(30);
create temporary TABLE foo (a int, b int, c int, d text);
insert into foo SELECT (x%1000) AS a,(x%1001) AS b, (x % 650) as c, ''
as d  FROM generate_series( 1, 100 ) AS x;
-- create temporary TABLE foo AS SELECT (x%1000) AS a,(x%1001) AS b,
(x % 650) as c, '' as d  FROM generate_series( 1, 100 ) AS x;
select count(1) from foo;


While it was in pg_sleep, I would attach to the backend process with strace.
I observed a few things that I don't yet understand, but one thing I
did notice was an I/O pattern (following the count(1)) that seemed to
suggest that the table was getting its hint bits set. I thought hint
bits were just for the mvcc side of things?  If this is a temporary
table, is there any need or benefit to setting hint bits?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] temporary tables, indexes, and query plans

2010-11-13 Thread Jon Nelson
On Sat, Nov 13, 2010 at 9:41 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 OK. This is a highly distilled example that shows the behavior.

 BEGIN;
 CREATE TEMPORARY TABLE foo AS SELECT x AS A, chr(x % 75 + 32) AS b,
 ''::text AS c from generate_series(1,500) AS x;
 UPDATE foo SET c = 'foo' WHERE b = 'A' ;
 CREATE INDEX foo_b_idx on foo (b);
 [ and the rest of the transaction can't use that index ]

 OK, this is an artifact of the HOT update optimization.  Before
 creating the index, you did updates on the table that would have been
 executed differently if the index had existed.  When the index does get
 created, its entries for those updates are incomplete, so the index
 can't be used in transactions that could in principle see the unmodified
 rows.

Aha!  When you indicated that HOT updates were part of the problem, I
googled HOT updates for more detail and ran across this article:
http://pgsql.tapoueh.org/site/html/misc/hot.html
which was very useful in helping me to understand things.

If I understand things correctly, after a tuple undergoes a HOT-style
update, there is a chain from the original tuple to the updated tuple.
If an index already exists on the relation (and involves the updated
column), a *new entry* in the index is created.  However, if an index
does not already exist and one is created (which involves a column
with tuples that underwent HOT update) then it seems as though the
index doesn't see either version. Is that description inaccurate?

What would the effect be of patching postgresql to allow indexes to
see and follow the HOT chains during index creation?

The reason I did the update before the index creation is that the
initial update (in the actual version, not this test version) updates
2.8 million of some 7.5 million rows (or a bit under 40% of the entire
table), and such a large update seems like it would have a deleterious
effect on the index (although in either case the planner properly
chooses a sequential scan for this update).

 You could avoid this effect either by creating the index before you do
 any updates on the table, or by not wrapping the entire process into a
 single transaction.

I need the whole thing in a single transaction because I make
/extensive/ use of temporary tables and many dozens of statements that
need to either succeed or fail as one.

Is this HOT update optimization interaction with indexes documented
anywhere? It doesn't appear to be common knowledge as there are now 20
messages in this topic and this is the first mention of the HOT
updates / index interaction. I would like to suggest that an update to
the CREATE INDEX documentation might contain some caveats about
creating indexes in transactions on relations that might have HOT
updates.

Again, I'd like to thank everybody for helping me to figure this out.
It's not a huge burden to create the index before the updates, but
understanding *why* it wasn't working (even if it violates the
principle-of-least-surprise) helps quite a bit.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] postmaster consuming /lots/ of memory with hash aggregate. why?

2010-11-12 Thread Jon Nelson
On Thu, Nov 11, 2010 at 10:38 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 2010/11/12 Jon Nelson jnelson+pg...@jamponi.net:
 On Thu, Nov 11, 2010 at 10:26 PM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 Hello

 look on EXPLAIN ANALYZE command. Probably your statistic are out, and
 then planner can be confused. EXPLAIN ANALYZE statement show it.

 As I noted earlier, I did set statistics to 1000 an re-ran vacuum
 analyze and the plan did not change.

 this change can do nothing. this is default in config. did you use
 ALTER TABLE ALTER COLUMN SET STATISTIC = ... ? and ANALYZE

No. To be clear: are you saying that changing the value for
default_statistics_target, restarting postgresql, and re-running
VACUUM ANALYZE does *not* change the statistics for columns
created/populated *prior* to the sequence of operations, and that one
/must/ use ALTER TABLE ALTER COLUMN SET STATISTICS ... and re-ANALYZE?

That does not jive with the documentation, which appears to suggest
that setting a new default_statistics_target, restarting postgresql,
and then re-ANALYZE'ing a table should be sufficient (provided the
columns have not had a statistics target explicitly set).

 What other diagnostics can I provide? This still doesn't answer the
 4 row question, though. It seems absurd to me that the planner
 would give up and just use 4 rows (0.02 percent of the actual
 result).


 there can be some not well supported operation, then planner use a
 some % from rows without statistic based estimation

The strange thing is that the value 4 keeps popping up in totally
diffferent contexts, with different tables, databases, etc... I tried
digging through the code and the only thing I found was that numGroups
was being set to 4 but I couldn't see where.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] temporary tables, indexes, and query plans

2010-11-12 Thread Jon Nelson
On Thu, Oct 28, 2010 at 9:23 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 On Wed, Oct 27, 2010 at 5:36 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 It thinks it's faster, or there is some reason why it *can't* use the
 index, like a datatype mismatch.  You could tell which by trying set
 enable_seqscan = off to see if that will make it change to another
 plan; if so, the estimated costs of that plan versus the original
 seqscan would be valuable information.

 When I place the index creation and ANALYZE right after the bulk
 update, follow it with 'set enable_seqscan = false', the next query
 (also an UPDATE - should be about 7 rows) results in this plan:

 Seq Scan on foo_table  (cost=100.00..1004998.00 rows=24 
 width=236)

 OK, so it thinks it can't use the index.  (The cost=100 bit is
 the effect of enable_seqscan = off: it's not possible to just never use
 seqscans, but we assign an artificially high cost to discourage the
 planner from selecting them if there's any other alternative.)

 So we're back to wondering why it can't use the index.  I will say
 once more that we could probably figure this out quickly if you'd
 post an exact example instead of handwaving.

OK. This is a highly distilled example that shows the behavior.
The ANALYZE doesn't appear to change anything, nor the SET STATISTICS
(followed by ANALYZE), nor disabling seqential scans. Re-writing the
table with ALTER TABLE does, though.
If the initial UPDATE (the one before the index creation) is commented
out, then the subsequent updates don't use sequential scans.

\timing off
BEGIN;
CREATE TEMPORARY TABLE foo AS SELECT x AS A, chr(x % 75 + 32) AS b,
''::text AS c from generate_series(1,500) AS x;
UPDATE foo SET c = 'foo' WHERE b = 'A' ;
CREATE INDEX foo_b_idx on foo (b);

-- let's see what it looks like
EXPLAIN UPDATE foo SET c='bar' WHERE b = 'C';

-- does forcing a seqscan off help?
set enable_seqscan = false;
EXPLAIN UPDATE foo SET c='bar' WHERE b = 'C';

-- what about analyze?
ANALYZE VERBOSE foo;
EXPLAIN UPDATE foo SET c='bar' WHERE b = 'C';

-- what about statistics?
ALTER TABLE foo ALTER COLUMN b SET STATISTICS 1;
ANALYZE VERBOSE foo;
EXPLAIN UPDATE foo SET c='bar' WHERE b = 'C';

-- let's re-write the table
ALTER TABLE foo ALTER COLUMN a TYPE int;
EXPLAIN UPDATE foo SET c='bar' WHERE b = 'C';

ROLLBACK;

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] anti-join chosen even when slower than old plan

2010-11-11 Thread Jon Nelson
On Thu, Nov 11, 2010 at 1:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Besides the fully-scanned object size relative to relation size
 costing adjustment idea, the only one which seemed to be likely to
 be useful for this sort of issue was the costing factors by user
 ID idea -- the interactive queries hitting the well-cached portion
 of the tables are run through a read-only user ID, while the weekly
 maintenance scripts (obviously) are not.  With the settings I
 initially had assigned to the cluster the maintenance scripts would
 never have seen this issue; it was tuning to resolve end-user
 complaints of slowness in the interactive queries which set up the
 conditions for failure, and if I'd had per-user settings, I probably
 would have (and definitely *should* have) used them.

 Erm ... you can in fact do ALTER USER SET random_page_cost today.
 As long as the settings are GUC parameters we have quite a lot of
 flexibility about how to control them.  This gets back to my earlier
 point that our current form of per-relation properties (reloptions) is
 considerably less flexible than a GUC.  I think that if we create any
 strong planner dependence on such properties, we're going to end up
 needing to be able to set them in all the same ways you can set a GUC.

In Kevin's particular case, would this mechanism not help? By that I
mean he could have two users: one user for the daily, the
tables-ought-to-be-in-hot-cache use case. The other use could make use
of the ALTER USER SET ... mechanism to drive the weekly reporting
(tables are probably not hot) use case.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] postmaster consuming /lots/ of memory with hash aggregate. why?

2010-11-11 Thread Jon Nelson
On Fri, Nov 5, 2010 at 7:26 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 I have a postgres 8.4.5 instance on CentOS 5 (x86_64) which appears to
 go crazy with the amount of memory it consumes.
 When I run the query below, in a matter of a few seconds memory
 balloons to 5.3G (virtual), 4.6G (resident) and 1840 (shared), and
 eventually the oom killer is invoked, killing the entire process.

 Physical memory is 8GB but other processes on the box consume
 approximately 4GB of that.

 The settings changed from their defaults:

 effective_cache_size = 4GB
 work_mem = 16MB
 maintenance_work_mem = 128MB
 wal_buffers = 16MB
 checkpoint_segments = 16
 shared_buffers = 384MB
 checkpoint_segments = 64

 and

 default_statistics_target = 100

 The query is this:

 insert into d_2010_09_13_sum
        select FOO.i, FOO.n, sum(FOO.cc) as cc, sum(FOO.oc) as oc
        from (
          select * from d_2010_09_12_sum
          union all
          select * from d_2010_09_13
        ) AS FOO group by i, n;

 here is the explain:

  Subquery Scan *SELECT*  (cost=1200132.06..1201332.06 rows=4 width=80)
   -  HashAggregate  (cost=1200132.06..1200732.06 rows=4 width=41)
         -  Append  (cost=0.00..786531.53 rows=41360053 width=41)
               -  Seq Scan on d_2010_09_12_sum  (cost=0.00..520066.48
 rows=27272648 width=42)
               -  Seq Scan on d_2010_09_13  (cost=0.00..266465.05
 rows=14087405 width=40)

 Both source tables freshly vacuum analyze'd.
 The row estimates are correct for both source tables.

 If I use set enable_hashagg = false I get this plan:

  Subquery Scan *SELECT*  (cost=8563632.73..9081838.25 rows=4 width=80)
   -  GroupAggregate  (cost=8563632.73..9081238.25 rows=4 width=41)
         -  Sort  (cost=8563632.73..8667033.84 rows=41360441 width=41)
               Sort Key: d_2010_09_12_sum.i, d_2010_09_12_sum.n
               -  Result  (cost=0.00..786535.41 rows=41360441 width=41)
                     -  Append  (cost=0.00..786535.41 rows=41360441 width=41)
                           -  Seq Scan on d_2010_09_12_sum
 (cost=0.00..520062.04 rows=27272204 width=42)
                           -  Seq Scan on d_2010_09_13
 (cost=0.00..266473.37 rows=14088237 width=40)

 and postmaster's memory never exceeds (roughly) 548M (virtual), 27M
 (resident), 5M (shared).

 I even set default_statistics_target to 1000 and re-ran vacuum
 analyze verbose on both tables - no change.
 If I set work_mem to 1MB (from 16MB) then the GroupAggregate variation
 is chosen instead.
 Experimentally, HashAggregate is chosen when work_mem is 16MB, 8MB,
 6MB, 5MB but not 4MB and on down.

 Two things I don't understand:

 1. Why, when hash aggregation is allowed, does memory absolutely
 explode (eventually invoking the wrath of the oom killer). 16MB for
 work_mem does not seem outrageously high. For that matter, neither
 does 5MB.

 2. Why do both HashAggregate and GroupAggregate say the cost estimate
 is 4 rows?

Unfortunately, I've found that as my database size grows, I've
generally had to disable hash aggregates for fear of even simple
seeming queries running out of memory, even with work_mem = 1MB.

In some cases I saw memory usage (with hashagg) grow to well over 5GB
and with group aggregate it barely moves.  Am *I* doing something
wrong? Some of these queries are on partitioned tables (typically
querying the parent) and the resulting UNION or UNION ALL really
starts to hurt, and when the server runs out of memory and kills of
the postmaster process a few minutes or even hours into the query it
doesn't make anybody very happy.

Is there some setting I can turn on to look to see when memory is
being allocated (and, apparently, not deallocated)?

The latest query has a HashAggregate that looks like this:
HashAggregate  (cost=19950525.30..19951025.30 rows=4 width=37)
but there are, in reality, approximately 200 million rows (when I run
the query with GroupAggregate, that's what I get).

Why does it keep choosing 40,000 rows?

I suppose I could use the newly-learned ALTER USER trick to disable
hash aggregation for the primary user, because disabling hash
aggregation system-wide sounds fairly drastic. However, if I *don't*
disable it, the query quickly balloons memory usage to the point where
the process is killed off.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] postmaster consuming /lots/ of memory with hash aggregate. why?

2010-11-11 Thread Jon Nelson
On Thu, Nov 11, 2010 at 10:26 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 Hello

 look on EXPLAIN ANALYZE command. Probably your statistic are out, and
 then planner can be confused. EXPLAIN ANALYZE statement show it.

As I noted earlier, I did set statistics to 1000 an re-ran vacuum
analyze and the plan did not change.

What other diagnostics can I provide? This still doesn't answer the
4 row question, though. It seems absurd to me that the planner
would give up and just use 4 rows (0.02 percent of the actual
result).

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] postmaster consuming /lots/ of memory with hash aggregate. why?

2010-11-06 Thread Jon Nelson
I also found this. Perhaps it is related?

http://postgresql.1045698.n5.nabble.com/Hash-Aggregate-plan-picked-for-very-large-table-out-of-memory-td1883299.html


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Running PostgreSQL as fast as possible no matter the consequences

2010-11-05 Thread Jon Nelson
On Fri, Nov 5, 2010 at 7:08 AM, Guillaume Cottenceau g...@mnc.ch wrote:
 Marti Raudsepp marti 'at' juffo.org writes:

 On Fri, Nov 5, 2010 at 13:32, A B gentosa...@gmail.com wrote:
 I was just thinking about the case where I will have almost 100%
 selects, but still needs something better than a plain key-value
 storage so I can do some sql queries.
 The server will just boot, load data, run,  hopefully not crash but if
 it would, just start over with load and run.

 If you want fast read queries then changing
 fsync/full_page_writes/synchronous_commit won't help you.

 That illustrates how knowing the reasoning of this particular
 requests makes new suggestions worthwhile, while previous ones
 are now seen as useless.

I disagree that they are useless - the stated mechanism was start,
load data, and run. Changing the params above won't likely change
much in the 'run' stage but would they help in the 'load' stage?


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] postmaster consuming /lots/ of memory with hash aggregate. why?

2010-11-05 Thread Jon Nelson
I have a postgres 8.4.5 instance on CentOS 5 (x86_64) which appears to
go crazy with the amount of memory it consumes.
When I run the query below, in a matter of a few seconds memory
balloons to 5.3G (virtual), 4.6G (resident) and 1840 (shared), and
eventually the oom killer is invoked, killing the entire process.

Physical memory is 8GB but other processes on the box consume
approximately 4GB of that.

The settings changed from their defaults:

effective_cache_size = 4GB
work_mem = 16MB
maintenance_work_mem = 128MB
wal_buffers = 16MB
checkpoint_segments = 16
shared_buffers = 384MB
checkpoint_segments = 64

and

default_statistics_target = 100

The query is this:

insert into d_2010_09_13_sum
select FOO.i, FOO.n, sum(FOO.cc) as cc, sum(FOO.oc) as oc
from (
  select * from d_2010_09_12_sum
  union all
  select * from d_2010_09_13
) AS FOO group by i, n;

here is the explain:

 Subquery Scan *SELECT*  (cost=1200132.06..1201332.06 rows=4 width=80)
   -  HashAggregate  (cost=1200132.06..1200732.06 rows=4 width=41)
 -  Append  (cost=0.00..786531.53 rows=41360053 width=41)
   -  Seq Scan on d_2010_09_12_sum  (cost=0.00..520066.48
rows=27272648 width=42)
   -  Seq Scan on d_2010_09_13  (cost=0.00..266465.05
rows=14087405 width=40)

Both source tables freshly vacuum analyze'd.
The row estimates are correct for both source tables.

If I use set enable_hashagg = false I get this plan:

 Subquery Scan *SELECT*  (cost=8563632.73..9081838.25 rows=4 width=80)
   -  GroupAggregate  (cost=8563632.73..9081238.25 rows=4 width=41)
 -  Sort  (cost=8563632.73..8667033.84 rows=41360441 width=41)
   Sort Key: d_2010_09_12_sum.i, d_2010_09_12_sum.n
   -  Result  (cost=0.00..786535.41 rows=41360441 width=41)
 -  Append  (cost=0.00..786535.41 rows=41360441 width=41)
   -  Seq Scan on d_2010_09_12_sum
(cost=0.00..520062.04 rows=27272204 width=42)
   -  Seq Scan on d_2010_09_13
(cost=0.00..266473.37 rows=14088237 width=40)

and postmaster's memory never exceeds (roughly) 548M (virtual), 27M
(resident), 5M (shared).

I even set default_statistics_target to 1000 and re-ran vacuum
analyze verbose on both tables - no change.
If I set work_mem to 1MB (from 16MB) then the GroupAggregate variation
is chosen instead.
Experimentally, HashAggregate is chosen when work_mem is 16MB, 8MB,
6MB, 5MB but not 4MB and on down.

Two things I don't understand:

1. Why, when hash aggregation is allowed, does memory absolutely
explode (eventually invoking the wrath of the oom killer). 16MB for
work_mem does not seem outrageously high. For that matter, neither
does 5MB.

2. Why do both HashAggregate and GroupAggregate say the cost estimate
is 4 rows?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] typoed column name, but postgres didn't grump

2010-10-29 Thread Jon Nelson
I've been having trouble with a query.
The query is a cross join between two tables.
Initially, I mis-typed the query, and one of the columns specified in
the query doesn't exist, however the query ran nonetheless.

The actual query:
select gid from t2, t3 where t2.name = t3.name and t3.scope = 'city'
and t3.hierarchy = 'STANDARD' and t2.adiv = t3.adiv limit 1 ;
However, there *is* no column 'name' in table 't2'.
When I ran the query, it took a *really* long time to run (670 seconds).
When I corrected the query to use the right column name (city_name),
the query ran in 28ms.

The question, then, is why didn't the postgres grump about the
non-existent column name?

The version is 8.4.5 on x86_64, openSUSE 11.3

 PostgreSQL 8.4.5 on x86_64-unknown-linux-gnu, compiled by GCC gcc
(SUSE Linux) 4.5.0 20100604 [gcc-4_5-branch revision 160292], 64-bit


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] temporary tables, indexes, and query plans

2010-10-28 Thread Jon Nelson
On Wed, Oct 27, 2010 at 5:36 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 I'd like to zoom out a little bit and, instead of focusing on the
 specifics, ask more general questions:
..
 - is there some way for me to determine /why/ the planner chooses a
 sequential scan over other options?

 It thinks it's faster, or there is some reason why it *can't* use the
 index, like a datatype mismatch.  You could tell which by trying set
 enable_seqscan = off to see if that will make it change to another
 plan; if so, the estimated costs of that plan versus the original
 seqscan would be valuable information.

When I place the index creation and ANALYZE right after the bulk
update, follow it with 'set enable_seqscan = false', the next query
(also an UPDATE - should be about 7 rows) results in this plan:

Seq Scan on foo_table  (cost=100.00..1004998.00 rows=24 width=236)

The subsequent queries all have the same first-row cost and similar
last-row costs, and of course the rows value varies some as well. All
of them, even the queries which update exactly 1 row, have similar
cost:

Seq Scan on foo_table  (cost=100.00..1289981.17 rows=1 width=158)

I cranked the logging up a bit, but I don't really know what to fiddle
there, and while I got a lot of output, I didn't see much in the way
of cost comparisons.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] temporary tables, indexes, and query plans

2010-10-27 Thread Jon Nelson
On Wed, Oct 27, 2010 at 12:44 PM, Mladen Gogala
mladen.gog...@vmsinfo.com wrote:
 On 10/27/2010 1:29 PM, Jon Nelson wrote:
 How big is your default statistics target? The default is rather small, it
 doesn't produce very good or usable histograms.

Currently, default_statistics_target is 50.

I note that if I create a indexes earlier in the process (before the
copy) then they are used.
I'm not trying creating them after the first UPDATE (which updates
2.8million of the 10million rows).
The subsequent UPDATE statements update very few (3-4 thousand for 2
of them, less than a few dozen for the others) and the ones that use
the index only update *1* row.

I'll also try setting a higher default_statistics_target and let you know!

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] temporary tables, indexes, and query plans

2010-10-27 Thread Jon Nelson
On Wed, Oct 27, 2010 at 12:59 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Wed, Oct 27, 2010 at 12:44 PM, Mladen Gogala
 mladen.gog...@vmsinfo.com wrote:
 On 10/27/2010 1:29 PM, Jon Nelson wrote:
 How big is your default statistics target? The default is rather small, it
 doesn't produce very good or usable histograms.

 Currently, default_statistics_target is 50.

I set it to 500 and restarted postgres. No change in (most of) the query plans!
The update statement that updates 7 rows? No change.
The one that updates 242 rows? No change.
3714? No change.
I killed the software before it got to the 1-row-only statements.

 I'm not trying creating them after the first UPDATE (which updates
 2.8million of the 10million rows).

I mean to say that I (tried) creating the indexes after the first
UPDATE statement. This did not improve things.
I am now trying to see how creating the indexes before between the
COPY and the UPDATE performs.
I didn't really want to do this because I know that the first UPDATE
statement touches about 1/3 of the table, and this would bloat the
index and slow the UPDATE (which should be a full table scan anyway).
It's every subsequent UPDATE that touches (at most) 4000 rows (out of
10 million) that I'm interested in.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] temporary tables, indexes, and query plans

2010-10-27 Thread Jon Nelson
On Wed, Oct 27, 2010 at 1:32 PM, Reid Thompson reid.thomp...@ateb.com wrote:
 On Wed, 2010-10-27 at 13:23 -0500, Jon Nelson wrote:
 set it to 500 and restarted postgres.

 did you re-analyze?

Not recently. I tried that, initially, and there was no improvement.
I'll try it again now that I've set the stats to 500.
The most recent experiment shows me that, unless I create whatever
indexes I would like to see used *before* the large (first) update,
then they just don't get used. At all. Why would I need to ANALYZE the
table immediately following index creation? Isn't that part of the
index creation process?

Currently executing is a test where I place an ANALYZE foo after the
COPY, first UPDATE, and first index, but before the other (much
smaller) updates.

..

Nope. The ANALYZE made no difference. This is what I just ran:

BEGIN;
CREATE TEMPORARY TABLE foo
COPY ...
UPDATE ... -- 1/3 of table, approx
CREATE INDEX foo_rowB_idx on foo (rowB);
ANALYZE ...
-- queries from here to 'killed' use WHERE rowB = 'someval'
UPDATE ... -- 7 rows. seq scan!
UPDATE ... -- 242 rows, seq scan!
UPDATE .. -- 3700 rows, seq scan!
UPDATE .. -- 3100 rows, seq scan!
killed.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] temporary tables, indexes, and query plans

2010-10-27 Thread Jon Nelson
On Wed, Oct 27, 2010 at 1:52 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Wed, Oct 27, 2010 at 1:32 PM, Reid Thompson reid.thomp...@ateb.com wrote:
 On Wed, 2010-10-27 at 13:23 -0500, Jon Nelson wrote:
 set it to 500 and restarted postgres.

 did you re-analyze?

 Not recently. I tried that, initially, and there was no improvement.
 I'll try it again now that I've set the stats to 500.
 The most recent experiment shows me that, unless I create whatever
 indexes I would like to see used *before* the large (first) update,
 then they just don't get used. At all. Why would I need to ANALYZE the
 table immediately following index creation? Isn't that part of the
 index creation process?

 Currently executing is a test where I place an ANALYZE foo after the
 COPY, first UPDATE, and first index, but before the other (much
 smaller) updates.

 ..

 Nope. The ANALYZE made no difference. This is what I just ran:

 BEGIN;
 CREATE TEMPORARY TABLE foo
 COPY ...
 UPDATE ... -- 1/3 of table, approx
 CREATE INDEX foo_rowB_idx on foo (rowB);
 ANALYZE ...
 -- queries from here to 'killed' use WHERE rowB = 'someval'
 UPDATE ... -- 7 rows. seq scan!
 UPDATE ... -- 242 rows, seq scan!
 UPDATE .. -- 3700 rows, seq scan!
 UPDATE .. -- 3100 rows, seq scan!
 killed.


Even generating the index beforehand (sans ANALYZE) was no help.
If I generate *all* of the indexes ahead of time, before the COPY,
that's the only time index usage jives with my expectations.

Here is an example of the output from auto analyze (NOTE: the WHERE
clause in this statement specifies a single value in the same column
that has a UNIQUE index on it):

Seq Scan on foo_table  (cost=0.00..289897.04 rows=37589 width=486)

and yet the actual row count is exactly 1.

If I change the order so that the index creation *and* analyze happen
*before* the first (large) update, then things appear to proceed
normally and the indexes are used when expected, although in some
cases the stats are still way off:

Bitmap Heap Scan on foo_table  (cost=40.96..7420.39 rows=1999 width=158)

and yet there are only 7 rows that match. The others seem closer (only
off by 2x rather than 250x).

It seems as though creating an index is not enough. It seems as though
ANALYZE after index creation is not enough, either. I am theorizing
that I have to touch (or just scan?) some percentage of the table in
order for the index to be used?  If that's true, then what is ANALYZE
for?  I've got the stats cranked up to 500. Should I try 1000?


Jason Pitts:
RE: changing default_statistics_target (or via ALTER TABLE SET STATS)
not taking effect until ANALYZE is performed.

I did already know that, but it's probably good to put into this
thread. However, you'll note that this is a temporary table created at
the beginning of a transaction.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] temporary tables, indexes, and query plans

2010-10-27 Thread Jon Nelson
On Wed, Oct 27, 2010 at 2:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 The most recent experiment shows me that, unless I create whatever
 indexes I would like to see used *before* the large (first) update,
 then they just don't get used. At all.

 You're making a whole lot of assertions here that don't square with
 usual experience.  I think there is some detail about what you're
 doing that affects the outcome, but since you haven't shown a concrete
 example, it's pretty hard to guess what the critical detail is.

First, let me supply all of the changed (from the default) params:

default_statistics_target = 500
maintenance_work_mem = 240MB
work_mem = 256MB
effective_cache_size = 1GB
checkpoint_segments = 128
shared_buffers = 1GB
max_connections = 30
wal_buffers = 64MB
shared_preload_libraries = 'auto_explain'

The machine is a laptop with 4GB of RAM running my desktop. Kernel is
2.6.36, filesystem is ext4 (for data) and ext2 (for WAL logs). The
disk is a really real disk, not an SSD.

The sequence goes exactly like this:

BEGIN;
CREATE TEMPORARY TABLE (20 columns, mostly text, a few int).
COPY (approx 8 million rows, ~900 MB)[1]
UPDATE (2.8 million of the rows)
UPDATE (7 rows)
UPDATE (250 rows)
UPDATE (3500 rows)
UPDATE (3100 rows)
a bunch of UPDATE (1 row)
...

Experimentally, I noticed that performance was not especially great.
So, I added some indexes (three indexes on one column each). One index
is UNIQUE.
The first UPDATE can't use any of the indexes. The rest should be able to.

In my experiments, I found that:

If I place the index creation *before* the copy, the indexes are used.
If I place the index creation *after* the copy but before first
UPDATE, the indexes are used.
If I place the index creation at any point after the first UPDATE,
regardless of whether ANALYZE is run, the indexes are not used (at
least, according to auto_analyze).

Does that help?


[1] I've been saying 10 million. It's really more like 8 million.
-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] temporary tables, indexes, and query plans

2010-10-27 Thread Jon Nelson
On Wed, Oct 27, 2010 at 4:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 The sequence goes exactly like this:

 BEGIN;
 CREATE TEMPORARY TABLE (20 columns, mostly text, a few int).
 COPY (approx 8 million rows, ~900 MB)[1]
 UPDATE (2.8 million of the rows)
 UPDATE (7 rows)
 UPDATE (250 rows)
 UPDATE (3500 rows)
 UPDATE (3100 rows)
 a bunch of UPDATE (1 row)
 ...

 Experimentally, I noticed that performance was not especially great.
 So, I added some indexes (three indexes on one column each). One index
 is UNIQUE.
 The first UPDATE can't use any of the indexes. The rest should be able to.

 Please ... there is *nothing* exact about that.  It's not even clear
 what the datatypes of the indexed columns are, let alone what their
 statistics are, or whether there's something specific about how you're
 declaring the table or the indexes.

The indexed data types are:
- an INT (this is a unique ID, and it is declared so)
- two TEXT fields. The initial value of one of the text fields is
NULL, and it is updated to be not longer than 10 characters long. The
other text field is not more than 4 characters long. My guesstimate as
to the distribution of values in this column is not more than 2 dozen.

I am not doing anything when I define the table except using TEMPORARY.
The indexes are as bog-standard as one can get. No where clause, no
functions, nothing special at all.

I'd like to zoom out a little bit and, instead of focusing on the
specifics, ask more general questions:

- does the table being temporary effect anything? Another lister
emailed me and wondered if ANALYZE on a temporary table might behave
differently.
- is there some way for me to determine /why/ the planner chooses a
sequential scan over other options? I'm already using auto explain.
- in the general case, are indexes totally ready to use after creation
or is an analyze step necessary?
- do hint bits come into play here at all?



-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres insert performance and storage requirement compared to Oracle

2010-10-26 Thread Jon Nelson
On Tue, Oct 26, 2010 at 4:02 PM, Mladen Gogala
mladen.gog...@vmsinfo.com wrote:
 On 10/26/2010 11:41 AM, Merlin Moncure wrote:

 yup, that's exactly what I mean -- this will give you more uniform
 insert performance (your temp table doesn't even need indexes).  Every
 N records (say 1) you send to permanent and truncate the temp
 table.  Obviously, this is more fragile approach so weigh the
 pros/cons carefully.

 merlin

 Truncate temporary table? What a horrible advice! All that you need is the
 temporary table to delete rows on commit.

I believe Merlin was suggesting that, after doing 1 inserts into
the temporary table, that something like this might work better:

start loop:
  populate rows in temporary table
  insert from temporary table into permanent table
  truncate temporary table
  loop

I do something similar, where I COPY data to a temporary table, do
lots of manipulations, and then perform a series of INSERTS from the
temporary table into a permanent table.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] odd postgresql performance (excessive lseek)

2010-10-19 Thread Jon Nelson
No replies?

This is another situation where using pread would have saved a lot of
time and sped things up a bit, but failing that, keeping track of the
file position ourselves and only lseek'ing when necessary would also
help. Postgresql was spending 37% of it's time in redundant lseek!

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] odd postgresql performance (excessive lseek)

2010-10-19 Thread Jon Nelson
On Tue, Oct 19, 2010 at 8:25 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Tue, Oct 19, 2010 at 9:10 AM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 No replies?

 This is another situation where using pread would have saved a lot of
 time and sped things up a bit, but failing that, keeping track of the
 file position ourselves and only lseek'ing when necessary would also
 help. Postgresql was spending 37% of it's time in redundant lseek!

 37% of cpu time?  Is that according to strace -T? how did you measure it?

Per the original post, it (redundant lseek system calls) accounted for
37% of the time spent in the kernel.

strace -f -p pid -c


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] odd postgresql performance (excessive lseek)

2010-10-19 Thread Jon Nelson
On Tue, Oct 19, 2010 at 9:36 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 This is another situation where using pread would have saved a lot of
 time and sped things up a bit, but failing that, keeping track of the
 file position ourselves and only lseek'ing when necessary would also
 help.

 No, it wouldn't; you don't have the slightest idea what's going on
 there.  Those lseeks are for the purpose of detecting the current EOF
 location, ie, finding out whether some other backend has extended the
 file recently.  We could get rid of them, but only at the cost of
 putting in some other communication mechanism instead.

That's a little harsh (it's not untrue, though).

It's true I don't know how postgresql works WRT how it manages files,
but now I've been educated (some). I'm guessing, then, that due to how
each backend may extend files without the other backends knowing of
it, that using fallocate or some-such is also likely a non-starter. I
ask because, especially when allocating files 8KB at a time, file
fragmentation on a busy system is potentially high. I recently saw an
ext3 filesystem (dedicated to postgresql) with 38% file fragmentation
and, yes, it does make a huge performance difference in some cases.
After manually defragmenting some files (with pg offline) I saw a read
speed increase for single-MB-per-second to
high-double-digit-MB-per-second.  However, after asking pg to rewrite
some of the worst files (by way of CLUSTER or ALTER TABLE) I saw no
improvement - I'm guessing due to the 8KB-at-a-time allocation
mechanism.

Has any work been done on making use of shared memory for file stats
or using fallocate (or posix_fallocate) to allocate files in larger
chunks?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] unexpected query failure: ERROR: GIN indexes do not support whole-index scans

2010-10-18 Thread Jon Nelson
I have a table with an array column.
I added a GIN index to the array:

CREATE INDEX foo_idx ON t USING GIN (alternatecodes) WHERE
alternatecodes IS NOT NULL;

That's all well and good.
However, some queries started failing and I was able to reproduce the
behavior in psql!

SELECT * FROM t WHERE alternatecodes IS NOT NULL;
returns:
ERROR:  GIN indexes do not support whole-index scans

Whaaa?  Adding an *index* makes my /queries/ stop working? How can this be?
This really violated my principle of least surprise. If GIN indexes
don't support whole-index scans, fine, don't use them, but don't make
a perfectly valid query fail because of it.

This seems like a bug. Is it?

PostgreSQL version:

 PostgreSQL 8.4.5 on x86_64-redhat-linux-gnu, compiled by GCC gcc
(GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] unexpected query failure: ERROR: GIN indexes do not support whole-index scans

2010-10-18 Thread Jon Nelson
On Mon, Oct 18, 2010 at 6:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 CREATE INDEX foo_idx ON t USING GIN (alternatecodes) WHERE
 alternatecodes IS NOT NULL;
 SELECT * FROM t WHERE alternatecodes IS NOT NULL;
 ERROR:  GIN indexes do not support whole-index scans

 Yep, this is a known issue.  It's going to take major surgery on GIN to
 fix it, so don't hold your breath.  In the particular case, what good do
 you think the WHERE clause is doing anyway?  GIN won't index nulls at
 all ... which indeed is an aspect of the underlying issue --- see recent
 discussions, eg here:
 http://archives.postgresql.org/pgsql-hackers/2010-10/msg00521.php

OK, so GIN doesn't index NULLs. I guess the IS NOT NULL part comes
about as a habit - that particular column is fairly sparse. However,
I'm honestly quite surprised at two things:

1. if GIN indexes ignore NULLs, then either it should grump when one
specifics WHERE ... IS NOT NULL or it should be treated as a no-op

2. (and this is by far the more surprising) that the /presence/ of an
INDEX can *break* a SELECT. It's not that the engine ignores the index
- that would be reasonable - but that I can't issue a SELECT with a
WHERE statement that matches the same as the index.

However, I see that this also surprised Josh Berkus, and not that long
ago (11 days!), so I'll just shush.

Thanks!



-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] oracle to psql migration - slow query in postgres

2010-10-14 Thread Jon Nelson
Just my take on this.

The first thing I'd do is think real hard about whether you really
really want 'numeric' instead of boolean, smallint, or integer.  The
second thing is that none of your indices (which specify a whole bunch
of fields, by the way) have only just emailok, emailbounced, or only
the pair of them. Without knowing the needs of your app, I would
reconsider your index choices and go with fewer columns per index.

For this particular query I would think either two indexes (depending
on the cardinality of the data, one for each of emailok, emailbounced)
or one index (containing both emailok, emailbounced) would make quite
a bit of difference. Consider creating the indexes using a WITH
clause, for example:

CREATE INDEX members_just_an_example_idx ON members (emailok,
emailbounced) WHERE emailok = 1 AND emailbounced = 0;

Obviously that index is only useful in situations where both fields
are specified with those values. Furthermore, if the result is such
that a very high percentage of the table has those conditions a
sequential scan is going to be cheaper, anyway.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] odd postgresql performance (excessive lseek)

2010-10-14 Thread Jon Nelson
postgres 8.4.4 on openSUSE 11.3 (2.6.36rc7, x86_64).

I was watching a fairly large query run and observed that the disk
light went out. I checked 'top' and postgres was using 100% CPU so I
strace'd the running process.
This is what I saw:

lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(87, 0, SEEK_END)  = 585531392
lseek(94, 270680064, SEEK_SET)  = 270680064
read(94, elided..., 8192) = 8192

and I observed that pattern quite a bit.

I know lseek is cheap, but a superfluous systemcall is a superfluous
systemcall, and over a short period amounted to 37% (according to
strace) of the time spent in the system.

What's with the excess calls to lseek?

The query plan was a nested loop anti-join (on purpose).

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow count(*) again...

2010-10-12 Thread Jon Nelson
On Tue, Oct 12, 2010 at 7:27 AM, Mladen Gogala
mladen.gog...@vmsinfo.com wrote:

 So, the results weren't cached the first time around. The explanation is the
 fact that Oracle, as of the version 10.2.0, reads the table in the private
 process memory, not in the shared buffers.  This table alone is  35GB in
 size,  Oracle took 2 minutes 47 seconds to read it using the full table
 scan. If I do the same thing with PostgreSQL and a comparable table,
 Postgres is, in fact, faster:


Well, I didn't quite mean that - having no familiarity with Oracle I
don't know what the alter system statement does, but I was talking
specifically about the linux buffer and page cache. The easiest way to
drop the linux caches in one fell swoop is:

echo 3  /proc/sys/vm/drop_caches

Is there a command to tell postgresql to drop/clear/reset it's buffer_cache?

Clearing/dropping both the system (Linux) and the DB caches is
important when doing benchmarks that involve I/O.



-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow count(*) again...

2010-10-12 Thread Jon Nelson
On Tue, Oct 12, 2010 at 8:18 AM, Greg Smith g...@2ndquadrant.com wrote:
 No.  Usually the sequence used to remove all cached data from RAM before a
 benchmark is:

All cached data (as cached in postgresql - *not* the Linux system
caches)..., right?


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] read only transactions

2010-10-12 Thread Jon Nelson
Are there any performance implications (benefits) to executing queries
in a transaction where
SET TRANSACTION READ ONLY;
has been executed?


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow count(*) again...

2010-10-10 Thread Jon Nelson
On Sun, Oct 10, 2010 at 12:14 PM, Mladen Gogala
mladen.gog...@vmsinfo.com wrote:



 In other words, when I batched the sequential scan to do 128 blocks I/O, it
 was 4 times faster than when I did the single block I/O.
 Does that provide enough of an evidence and, if not, why not?


These numbers tell us nothing because, unless you dropped the caches
between runs, then at least part of some runs was very probably
cached.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance

2010-10-06 Thread Jon Nelson
On Wed, Oct 6, 2010 at 5:31 PM, Ivan Voras ivo...@freebsd.org wrote:
 On 10/04/10 20:49, Josh Berkus wrote:

 The other major bottleneck they ran into was a kernel one: reading from
 the heap file requires a couple lseek operations, and Linux acquires a
 mutex on the inode to do that. The proper place to fix this is
 certainly in the kernel but it may be possible to work around in
 Postgres.

 Or we could complain to Kernel.org.  They've been fairly responsive in
 the past.  Too bad this didn't get posted earlier; I just got back from
 LinuxCon.

 So you know someone who can speak technically to this issue? I can put
 them in touch with the Linux geeks in charge of that part of the kernel
 code.

 Hmmm... lseek? As in lseek() then read() or write() idiom? It AFAIK
 cannot be fixed since you're modifying the global strean position
 variable and something has got to lock that.

 OTOH, pread() / pwrite() don't have to do that.

While lseek is very cheap it is like any other system call in that
when you multiple cheap times a jillion you end up with notable
or even lots. I've personally seen notable performance improvements
by switching to pread/pwrite instead of lseek+{read,write}. For
platforms that don't implement pread or pwrite, wrapper calls are
trivial to produce. One less system call is, in this case, 50% fewer.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Identical query slower on 8.4 vs 8.3

2010-07-15 Thread Jon Nelson
On Thu, Jul 15, 2010 at 9:41 AM, Patrick Donlin pdon...@oaisd.org wrote:
 I have two servers with equal specs, one of them running 8.3.7 and the new
 server running 8.4.4. The only tweak I have made from the default install
 (from Ubuntu repositories) is increasing shared_buffers to 768MB. Both
 servers are running 64-bit, but are different releases of Ubuntu.

^^^ Right there.  *different releases*. I've seen fairly significant
differences in identical hardware with even minor O/S point releases.

After you run a full vacuum and then reindex and then vacuum analyze
(probably not entirely necessary) if there is still a difference I'd
point at the O/S.




-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow Bulk Delete

2010-05-17 Thread Jon Nelson
On Mon, May 17, 2010 at 5:10 AM, Pierre C li...@peufeu.com wrote:
 - or use a JOIN delete with a virtual VALUES table
 - or fill a temp table with ids and use a JOIN DELETE

What is a virtual VALUES table? Can you give me an example of using a
virtual table with selects, joins, and also deletes?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow Bulk Delete

2010-05-17 Thread Jon Nelson
2010/5/17 Віталій Тимчишин tiv...@gmail.com:


 2010/5/17 Jon Nelson jnelson+pg...@jamponi.net

 On Mon, May 17, 2010 at 5:10 AM, Pierre C li...@peufeu.com wrote:
  - or use a JOIN delete with a virtual VALUES table
  - or fill a temp table with ids and use a JOIN DELETE

 What is a virtual VALUES table? Can you give me an example of using a
 virtual table with selects, joins, and also deletes?



 delete from a using (values (1),(2),(5),(8)) b(x) where a.id=b.x
 See http://www.postgresql.org/docs/8.4/static/sql-values.html

This syntax I'm familiar with. The author of the previous message
(Pierre C) indicated that there is a concept of a virtual table which
could be joined to.  I'd like to know what this virtual table thing
is, specifically in the context of joins.


-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow Bulk Delete

2010-05-17 Thread Jon Nelson
On Mon, May 17, 2010 at 7:28 AM, A. Kretschmer
andreas.kretsch...@schollglas.com wrote:
 In response to Jon Nelson :
 On Mon, May 17, 2010 at 5:10 AM, Pierre C li...@peufeu.com wrote:
  - or use a JOIN delete with a virtual VALUES table
  - or fill a temp table with ids and use a JOIN DELETE

 What is a virtual VALUES table? Can you give me an example of using a
 virtual table with selects, joins, and also deletes?

 Something like this:
...

delete from foo using (values (1),(2) ) as bla where foo.c1=bla.column1;
...

Aha! Cool. That's not quite what I envisioned when you said virtual
table, but it surely clarifies things.
Thanks!

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Best suiting OS

2009-10-03 Thread Jon Nelson
On Thu, Oct 1, 2009 at 4:46 AM, S Arvind arvindw...@gmail.com wrote:

 Is it FreeBSD, CentOS, Fedora, Redhat xxx??

FreeBSD isn't Linux.
Don't run Fedora, it undergoes way too much Churn.
No real difference between CentOS and RedHat.

I personally prefer openSUSE (or SLES/SLED if you want their
commerical offering). I find it faster, more up-to-date (but no
churn), in general higher quality. I find postgresql *substantially*
faster on openSUSE than CentOS, but that's purely anecdotal and I
don't have any raw numbers to compare.

openSUSE 11.1 has 8.3.8 and 11.2 (not out yet - a few months) will have 8.4.X.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Best suiting OS

2009-10-02 Thread Jon Nelson
On Thu, Oct 1, 2009 at 4:46 AM, S Arvind arvindw...@gmail.com wrote:
 Hi everyone,
   What is the best Linux flavor for server which runs postgres alone.
 The postgres must handle greater number of database around 200+. Performance
 on speed is the vital factor.
 Is it FreeBSD, CentOS, Fedora, Redhat xxx??

FreeBSD isn't Linux.

I don't recommend that you run Fedora, it undergoes way too much churn.

I don't find any real difference between CentOS and RedHat.

I personally prefer openSUSE (or SLES/SLED if you want their
commerical offering). I find it faster, more up-to-date (but no
churn), and in general higher quality - it just works. I find
postgresql *substantially* faster on openSUSE than CentOS, but that's
purely anecdotal and I don't have any raw numbers to compare.

openSUSE 11.1 has 8.3.8 and 11.2 (not out yet - a few months) will have 8.4.X.

--
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] hyperthreaded cpu still an issue in 8.4?

2009-07-28 Thread Jon Nelson
On Tue, Jul 28, 2009 at 4:11 PM, Scott Marlowescott.marl...@gmail.com wrote:
 On Tue, Jul 28, 2009 at 2:58 PM, Merlin Moncuremmonc...@gmail.com wrote:
 On Mon, Jul 27, 2009 at 2:05 PM, Dave Youattd...@meteorsolutions.com wrote:
 On 01/-10/-28163 11:59 AM, Greg Smith wrote:
 On Tue, 21 Jul 2009, Doug Hunley wrote:

 Just wondering is the issue referenced in
 http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php
 is still present in 8.4 or if some tunable (or other) made the use of
 hyperthreading a non-issue. We're looking to upgrade our servers soon
 for performance reasons and am trying to determine if more cpus (no
 HT) or less cpus (with HT) are the way to go.

 If you're talking about the hyperthreading in the latest Intel Nehalem
 processors, I've been seeing great PostgreSQL performance from those.
 The kind of weird behavior the old generation hyperthreading designs
 had seems gone.  You can see at
 http://archives.postgresql.org/message-id/alpine.gso.2.01.0907222158050.16...@westnet.com
 that I've cleared 90K TPS on a 16 core system (2 quad-core
 hyperthreaded processors) running a small test using lots of parallel
 SELECTs.  That would not be possible if there were HT spinlock
 problems still around. There have been both PostgreSQL scaling
 improvments and hardware improvements since the 2005 messages you saw
 there that have combined to clear up the issues there.  While true
 cores would still be better if everything else were equal, it rarely
 is, and I wouldn't hestitate to jump on Intel's bandwagon right now.

 Greg, those are compelling numbers for the new Nehalem processors.
 Great news for postgresql.  Do you think it's due to the new internal
 interconnect, that bears a strong resemblance to AMD's hypertransport

I'd love to see some comparisons on the exact same hardware, same
kernel and everything but with HT enabled and disabled. Don't forget
that newer (Linux) kernels have vastly improved SMP performance.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance