Re: [PERFORM] unexpected stable function behavior

2011-03-14 Thread Andres Freund
On Thursday, March 10, 2011 05:26:00 PM Julius Tuskenis wrote:
> 3) As STABLE function should be executed once for every different set of 
> parameters
Thats not true. Thats not what any of the volatility information (like STABLE, 
IMMUTABLE, VOLATILE) does.

See http://www.postgresql.org/docs/current/interactive/xfunc-volatility.html

It *does* change how often a function is executed though. I.e.

SELECT g.i, some_stable_func(1) FROM generate_series(1, 1000) g(i)

will call some_stable_func only once because it can determine all the 
parameters beforehand.


Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] postgres 9 query performance

2011-01-30 Thread Andres Freund
On Sunday 30 January 2011 23:18:15 Tom Lane wrote:
> Andres Freund  writes:
> > What happens if you change the
> > 
> > left join event.origin on event.id = origin.eventid
> > 
> > into
> > 
> > join event.origin on event.id = origin.eventid
> > 
> > ?
> > 
> > The EXISTS() requires that origin is not null anyway. (Not sure why the
> > planner doesn't recognize that though).
> 
> Sloppy thinking in reduce_outer_joins() is why. 
Wow. Nice one, thanks.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] postgres 9 query performance

2011-01-28 Thread Andres Freund
On Friday, January 28, 2011 06:30:19 PM yazan suleiman wrote:
> I am evaluating postgres 9 to migrate away from Oracle.  The following
> query runs too slow, also please find the explain plan:
First:

explain analyze
SELECT DISTINCT
EVENT.ID
,ORIGIN.ID AS ORIGINID
,EVENT.PREFERRED_ORIGIN_ID AS PREFERRED_ORIGIN
,EVENT.CONTRIBUTOR
,ORIGIN.TIME
,ORIGIN.LATITUDE
,ORIGIN.LONGITUDE
,ORIGIN.DEPTH
,ORIGIN.EVTYPE
,ORIGIN.CATALOG
,ORIGIN.AUTHOR OAUTHOR
,ORIGIN.CONTRIBUTOR OCONTRIBUTOR
,MAGNITUDE.ID AS MAGID
,MAGNITUDE.MAGNITUDE
,MAGNITUDE.TYPE AS MAGTYPE
FROM
event.event
left join event.origin on event.id = origin.eventid
left join event.magnitude on origin.id = event.magnitude.origin_id
WHERE
EXISTS(
select origin_id
from event.magnitude
where magnitude.magnitude >= 7.2 and origin.id = origin_id
)
order by
ORIGIN.TIME desc
,MAGNITUDE.MAGNITUDE desc
,EVENT.ID
,EVENT.PREFERRED_ORIGIN_ID
,ORIGIN.ID

I am honestly stumped if anybody can figure something sensible out of the 
original formatting of the query...

What happens if you change the
left join event.origin on event.id = origin.eventid
into
join event.origin on event.id = origin.eventid
?

The EXISTS() requires that origin is not null anyway. (Not sure why the 
planner doesn't recognize that though).

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] High load,

2011-01-27 Thread Andres Freund
On Thursday, January 27, 2011 07:13:17 PM Scott Marlowe wrote:
> On Thu, Jan 27, 2011 at 10:54 AM, Stephen Frost  wrote:
> > * Michael Kohl (michael.k...@tupalo.com) wrote:
> >> HDD: 2x 120 GB OCZ Vertex 2 SSD; RAID 1
> > 
> > I'm amazed no one else has mentioned this yet, but you should look into
> > splitting your data and your WALs.  Obviously, having another set of
> > SSDs to put your WALs on would be ideal.
> 
> Actually spinning media would be a better choice.  A pair of fast
> 15krpm drives in a mirror will almost always outrun an SSD for
> sequential write speed.  Even meh-grade 7200RPM SATA drives will win.
Unless he is bulk loading or running with synchronous_commit=off sequential 
speed wont be the limit for WAL. The number of syncs will be the limit.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] High load,

2011-01-27 Thread Andres Freund
On Thursday, January 27, 2011 02:23:48 PM Cédric Villemain wrote:
> 2011/1/27 Andres Freund :
> > On Thursday, January 27, 2011 12:24:10 PM Cédric Villemain wrote:
> >> > maintenance_work_mem = 512MB
> >> 
> >> 128MB is usualy enough
> > 
> > Uhm, I don't want to be picky, but thats not really my experience. Sorts
> > for index creation are highly dependent on a high m_w_m. Quite regularly
> > I find the existing 1GB limit a probleme here...
> 
> That is right for index creation, but not for 'pure' maintenance
> stuff. Once the database is running as usual, there is no really point
> to give auto-vacuum or auto-analyze much more (depend on the raid card
> memory too ...)
Even that I cannot agree with, sorry ;-). If you have a database with much 
churn a high m_w_m helps to avoid multiple scans during vacuum of the database 
because the amount of dead tuples doesn't fit m_w_m.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] High load,

2011-01-27 Thread Andres Freund
On Thursday, January 27, 2011 12:24:10 PM Cédric Villemain wrote:
> > maintenance_work_mem = 512MB
> 128MB is usualy enough
Uhm, I don't want to be picky, but thats not really my experience. Sorts for 
index creation are highly dependent on a high m_w_m. Quite regularly I find the 
existing 1GB limit a probleme here...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] the XID question

2011-01-19 Thread Andres Freund
On Wednesday, January 19, 2011 07:06:58 PM Chris Browne wrote:
> kevin.gritt...@wicourts.gov ("Kevin Grittner") writes:
> > Filip Rembia*kowski wrote:
> >> 2011/1/19 Charles.Hou :
> >>> " select * from mybook" SQL command also increase the XID ?
> >> 
> >> Yes. Single SELECT is a transaction. Hence, it needs a transaction
> >> ID.
> > 
> > No, not in recent versions of PostgreSQL.  There's virtual
> > transaction ID, too; which is all that's needed unless the
> > transaction writes something.
> > 
> > Also, as a fine point, if you use explicit database transactions
> > (with BEGIN or START TRANSACTION) then you normally get one XID for
> > the entire transaction, unless you use SAVEPOINTs.
> 
> Erm, "not *necessarily* in recent versions of PostgreSQL."
> 
> A read-only transaction won't consume XIDs, but if you don't expressly
> declare it read-only, they're still liable to get eaten...
No. The Xid is generally only allocated at the first place a real xid is 
needed. See GetCurrentTransactionId, AssignTransactionId in xact.c and the 
caller of the former.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Wrong docs on checkpoint_segments?

2011-01-07 Thread Andres Freund
On Friday, January 07, 2011 02:45:02 PM Florian Weimer wrote:
> * Andres Freund:
> > On Friday, January 07, 2011 01:45:25 PM Florian Weimer wrote:
> >> On 9.0, this configuration
> >> 
> >> checkpoint_segments = 512 # in logfile segments, min 1, 16MB each
> >> 
> >> results in 1034 segments, so the effective logfile segment size is 32
> >> MB.
> > 
> > Um. Is it possible that you redefined XLOG_SEG_SIZE or used --with-wal-
> > segsize=SEGSIZE?
> 
> No, the individual files are still 16 MB.  It's just that the
> checkpoint_segments limit is not a hard limit, and you end up with
> slightly more than twice the configured number of segments on disk.
Thats documented:
"
There will always be at least one WAL segment file, and will normally not be 
more files than the higher of wal_keep_segments or (2 + 
checkpoint_completion_target) * checkpoint_segments + 1. Each segment file is 
normally 16 MB (though this size can be altered when building the server). You 
can use this to estimate space requirements for WAL. Ordinarily, when old log 
segment files are no longer needed, they are recycled (renamed to become the 
next segments in the numbered sequence). If, due to a short-term peak of log 
output rate, there are more than 3 * checkpoint_segments + 1 segment files, the 
unneeded segment files will be deleted instead of recycled until the system 
gets back under this limit. 
"

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Wrong docs on checkpoint_segments?

2011-01-07 Thread Andres Freund
On Friday, January 07, 2011 01:45:25 PM Florian Weimer wrote:
> On 9.0, this configuration
> 
> checkpoint_segments = 512 # in logfile segments, min 1, 16MB each
> 
> results in 1034 segments, so the effective logfile segment size is 32 MB.
Um. Is it possible that you redefined XLOG_SEG_SIZE or used --with-wal-
segsize=SEGSIZE?

The default is still:
and...@alap2:~/src/postgresql$ grep XLOG_SEG_SIZE src/include/pg_config.h
/* XLOG_SEG_SIZE is the size of a single WAL file. This must be a power of 2
   XLOG_BLCKSZ). Changing XLOG_SEG_SIZE requires an initdb. */
#define XLOG_SEG_SIZE (16 * 1024 * 1024)

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Create index on subfield returned by function that returns base type with sub fields

2010-12-23 Thread Andres Freund
Hi,

On Thursday 23 December 2010 17:53:24 Desmond Coertzen wrote:
> Is is possible to create an index on a field on a function that returns a
> data type that contains subfields?
> Is this possible? How would I write the statement?
I am not sure I understood you correctly. Maybe you mean something like that:

test=# CREATE FUNCTION blub(IN int, OUT a int, OUT b int) RETURNS record 
IMMUTABLE LANGUAGE sql AS $$SELECT $1+1, $1+2$$;
CREATE FUNCTION
Time: 1.665 ms

test=# CREATE INDEX foo__blub ON foo (((blub(data)).a));
CREATE INDEX
Time: 86.393 ms

Btw, this is the wrong list for this sort of question. The right place would 
be -general.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Compared MS SQL 2000 to Postgresql 9.0 on Windows

2010-12-07 Thread Andres Freund
On Tuesday 07 December 2010 18:34:25 Tom Polak wrote:
> Then I did the same test via Postgresql and it took 8.85 seconds!  I tried
> it again as I thought I did something wrong.  I did a few tweaks such as
> increasing the shared buffers.  Still the best I could get it to was 7.5
> seconds.  This is insanely slow compared to MSSQL 2000.  What am I missing.
> Here is my SQL statement for postgresql:
> select name,address,city,state,statename,stateid,other from pgtemp1 left
> join pgtemp2 on state=stateid
I think you would at least provide the exact schema and possibly some example 
data (pg_dump) to get us somewhere.

I would suggest you post the output of EXPLAIN ANALYZE $yourquery - that gives 
us information about how that query was executed.

Greetings,

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which gives good performance? separate database vs separate schema

2010-11-25 Thread Andres Freund
On Thursday 25 November 2010 13:02:08 t...@fuzzy.cz wrote:
> I don't think you'll get performance improvement from running two
> PostgreSQL clusters (one for DB1, one for DB2). And when running two
> databases within the same cluster, there's no measurable performance
> difference AFAIK.
That one is definitely not true in many circumstances. As soon as you start to 
hit contention (shared memory, locks) you may very well be better of with two 
separate clusters.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Defaulting wal_sync_method to fdatasync on Linux for 9.1?

2010-11-16 Thread Andres Freund
On Wednesday 17 November 2010 02:04:28 Tom Lane wrote:
> Andres Freund  writes:
> > On Wednesday 17 November 2010 01:51:28 Tom Lane wrote:
> >> Well, there's a forced fsync after writing the last page of an xlog
> >> file, but I don't believe that proves that more than 16MB of xlog
> >> buffers is useless.  Other processes could still be busy filling the
> >> buffers.
> > 
> > Maybe I am missing something, but I think the relevant
> > AdvanceXLInsertBuffer() is currently called with WALInsertLock held?
> 
> The fsync is associated with the write, which is not done with insert
> lock held.  We're not quite that dumb.
Ah, I see. The XLogWrite in AdvanceXLInsertBuffer is only happening if the head 
of the buffer gets to the tail - which is more likely if the wal buffers are 
small...

Andres


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Defaulting wal_sync_method to fdatasync on Linux for 9.1?

2010-11-16 Thread Andres Freund
On Wednesday 17 November 2010 01:51:28 Tom Lane wrote:
> Andres Freund  writes:
> > On Wednesday 17 November 2010 00:31:34 Tom Lane wrote:
> >> Well, we're not going to increase the default to gigabytes
> > 
> > Especially not as I don't think it will have any effect after
> > wal_segment_size as that will force a write-out anyway. Or am I
> > misremembering the implementation?
> 
> Well, there's a forced fsync after writing the last page of an xlog
> file, but I don't believe that proves that more than 16MB of xlog
> buffers is useless.  Other processes could still be busy filling the
> buffers.
Maybe I am missing something, but I think the relevant AdvanceXLInsertBuffer() 
is currently called with WALInsertLock held?

Andres


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Defaulting wal_sync_method to fdatasync on Linux for 9.1?

2010-11-16 Thread Andres Freund
On Wednesday 17 November 2010 00:31:34 Tom Lane wrote:
> Josh Berkus  writes:
> > On 11/16/10 12:39 PM, Greg Smith wrote:
> >> I want to next go through and replicate some of the actual database
> >> level tests before giving a full opinion on whether this data proves
> >> it's worth changing the wal_sync_method detection.  So far I'm torn
> >> between whether that's the right approach, or if we should just increase
> >> the default value for wal_buffers to something more reasonable.
> > 
> > We'd love to, but wal_buffers uses sysV shmem.
> 
> Well, we're not going to increase the default to gigabytes
Especially not as I don't think it will have any effect after wal_segment_size 
as that will force a write-out anyway. Or am I misremembering the 
implementation?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] anti-join chosen even when slower than old plan

2010-11-11 Thread Andres Freund
On Thursday 11 November 2010 19:58:49 Tom Lane wrote:
> I wrote:
> > I do think that something based around a settable-per-table caching
> > percentage might be a reasonable way to proceed.
> 
> BTW ... on reflection it seems that this would *not* solve the use-case
> Kevin described at the start of this thread.  What he's got AIUI is some
> large tables whose recent entries are well-cached, and a lot of queries
> that tend to hit that well-cached portion, plus a few queries that hit
> the whole table and so see largely-not-cached behavior.  We can't
> represent that very well with a caching knob at the table level.  Either
> a high or a low setting will be wrong for one set of queries or the
> other.
> 
> It might work all right if he were to partition the table and then have
> a different caching value attached to the currently-latest partition,
> but that doesn't sound exactly maintenance-free either.  Also, that only
> works with the current statically-planned approach to partitioned
> tables.  I think where we're trying to go with partitioning is that
> the planner doesn't consider the individual partitions, but the executor
> just hits the right one at runtime --- so cost modifiers attached to
> individual partitions aren't going to work in that environment.
> 
> The most practical solution for his case still seems to be to twiddle
> some GUC or other locally in the maintenance scripts that do the
> full-table-scan queries.  Unfortunately we don't have an equivalent
> of per-session SET (much less SET LOCAL) for per-relation attributes.
> Not sure if we want to go there.
As dicussed in another thread some time ago another possibility is to probe 
how well the data i cached using mincore() or similar...
While it presents problem with cache ramp-up it quite cool for other use-cases 
(like this one).

Andre

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Defaulting wal_sync_method to fdatasync on Linux for 9.1?

2010-11-08 Thread Andres Freund
Hi,

On Monday 08 November 2010 23:12:57 Greg Smith wrote:
> This seems to be ignoring the fact that unless you either added a 
> non-volatile cache or specifically turned off all write caching on your 
> drives, the results of all power-fail testing done on earlier versions 
> of Linux was that it failed.  The default configuration of PostgreSQL on 
> Linux has been that any user who has a simple SATA drive gets unsafe 
> writes, unless they go out of their way to prevent them.
Which is about *no* argument in favor of any of the options, right?

> Whatever newer kernels do by default cannot be worse.  The open question 
> is whether it's still broken, in which case we might as well favor the 
> known buggy behavior rather than the new one, or whether everything has 
> improved enough to no longer be unsafe with the new defaults.
Either I majorly misunderstand you, or ... I dont know.

There simply *is* no new implementation relevant for this discussion. Full 
Stop. What changed is that O_DSYNC is defined differently from O_SYNC these 
days 
and O_SYNC actually does what it should. Which causes pg to move open_datasync 
first in the preference list doing what the option with the lowest preference 
did up to now.

That does not *at all* change the earlier fdatasync() or fsync() 
implementations/tests. It simply makes open_datasync the default doing what 
open_sync did earlier.
For that note that open_sync was the method of *least* preference till now... 
And that fdatasync() thus was the default till now. Which it is not anymore.

I don't argue *at all* that we have to test the change moving fdatasync before 
open_datasync on the *other* operating systems. What I completely don't get is 
all that talking about data consistency on linux. Its simply irrelevant in 
that context.

Andres




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Defaulting wal_sync_method to fdatasync on Linux for 9.1?

2010-11-07 Thread Andres Freund
On Monday 08 November 2010 00:35:29 Greg Smith wrote:
> Marti Raudsepp wrote:
> > I will grant you that the details were wrong, but I stand by the
> > conclusion. I can state for a fact that PostgreSQL's default
> > wal_sync_method varies depending on the  header.
> 
> Yes; it's supposed to, and that logic works fine on some other
> platforms.  The question is exactly what the new Linux O_DSYNC behavior
> is doing, in regards to whether it flushes drive caches out or not.
> Until you've quantified which of the cases do that--which is required
> for reliable operation of PostgreSQL--and which don't, you don't have
> any data that can be used to draw a conclusion from.  If some setups are
> faster because they write less reliably, that doesn't automatically make
> them the better choice.
I think thats FUD. Sorry.

Can you explain to me why fsync() may/should/could be *any* less reliable than 
O_DSYNC? On *any* platform. Or fdatasync() in the special way its used with 
pg, namely completely preallocated files.

I think the reasons why O_DSYNC is, especially, but not only, in combination 
with a small wal_buffers setting, slow in most circumstances are pretty clear.

Making a setting which is only supported on a small range of systems highest 
in the preferences list is even more doubtfull than the already strange choice 
of making O_DSYNC the default given the way it works (i.e. no reordering, 
synchronous writes in the bgwriter, synchronous writes on wal_buffers pressure 
etc).

Greetings,

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Defaulting wal_sync_method to fdatasync on Linux for 9.1?

2010-11-05 Thread Andres Freund
On Friday 05 November 2010 22:10:36 Greg Smith wrote:
> Andres Freund wrote:
> > On Sunday 31 October 2010 20:59:31 Greg Smith wrote:
> >> Writes only are sync'd out when you do a commit, or the database does a
> >> checkpoint.
> > 
> > Hm?  WAL is written out to disk after an the space provided by
> > wal_buffers(def 8) * XLOG_BLCKSZ (def 8192) is used. The default is 64kb
> > which you reach pretty quickly - especially after a checkpoint.
> Fair enough; I'm so used to bumping wal_buffers up to 16MB nowadays that
> I forget sometimes that people actually run with the default where this
> becomes an important consideration.
If you have relatively frequent checkpoints (quite a sensible in some 
environments given the burstiness/response time problems you can get) even a 
16MB wal_buffers can cause significantly more synchronous writes with O_DSYNC 
because of the amounts of wal traffic due to full_page_writes. For one the 
background wal writer wont keep up and for another all its writes will be 
synchronous...

Its simply a pointless setting.

> > Not having a real O_DSYNC on linux until recently makes it even more
> > dubious to have it as a default...
> If Linux is now defining O_DSYNC, and it's buggy, that's going to break
> more software than just PostgreSQL.  It wasn't defined before because it
> didn't work.  If the kernel developers have made changes to claim it's
> working now, but it doesn't really, I would think they'd consider any
> reports of actual bugs here as important to fix.  There's only so much
> the database can do in the face of incorrect information reported by the
> operating system.
I don't see it being buggy so far. Its just doing what it should. Which is 
simply a terrible thing for our implementation. Generally. Independent from 
linux.

> Anyway, I haven't actually seen reports that proves there's any problem
> here, I was just pointing out that we haven't seen any positive reports
> about database stress testing on these kernel versions yet either.  The
> changes here are theoretically the right ones, and defaulting to safe
> writes that flush out write caches is a long-term good thing.
I have seen several database which run under 2.6.33 with moderate to high load 
for some time now. And two 2.6.35.
Loads of problems, but none kernel related so far ;-)

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Major Linux performance regression; shouldn't we be worried about RHEL6?

2010-11-05 Thread Andres Freund
On Friday 05 November 2010 21:15:20 Josh Berkus wrote:
> All,
> 
> Domas (of Facebook/Wikipedia, MySQL geek) pointed me to this report:
> 
> http://www.phoronix.com/scan.php?page=article&item=linux_perf_regressions&n
I guess thats the O_DSYNC thingy.  See the "Defaulting wal_sync_method to 
fdatasync on Linux for 9.1?" (performance) and "Revert default wal_sync_method 
to fdatasync on Linux 2.6.33+" on hackers.

O_DSYNC got finally properly implemented on linux with 2.6.33 (and thus 2.6.32-
rc1).

> um=1 http://www.phoronix.com/scan.php?page=article&item=ext4_then_now&num=6
That one looks pretty uninteresting. Barriers are slower then no barriers. No 
surprise there.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Insert performance with composite index

2010-11-01 Thread Andres Freund
On Monday 01 November 2010 15:28:19 Divakar Singh wrote:
> Do you mean these parameters have been removed starting 9.X?
> As I see on 
> http://www.network-theory.co.uk/docs/postgresql/vol3/BackgroundWriter.html 
> ,these parameters were added starting from 8.0 right?
No, I mean setting to 0 is a bit of a strange value in many situations.

And you have comments like:
#max_fsm_pages = 2# min max_fsm_relations*16, 6 bytes each
#max_fsm_relations = 1000# min 100, ~70 bytes each

Which reference config options which do not exist anymore. And you have 
shared_buffers = 81920
Which indicates that you started from 8.1/8.2 or so...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Insert performance with composite index

2010-11-01 Thread Andres Freund
On Monday 01 November 2010 15:16:49 Divakar Singh wrote:
> I am using 9.0.1
Either thats not true or you cargo culted loads of your config from a 
significantly older pg version.

Things like:

#bgwriter_delay = 200# 10-1 milliseconds between rounds
bgwriter_lru_percent = 0# 0-100% of LRU buffers scanned/round
#bgwriter_lru_maxpages = 5# 0-1000 buffers max written/round
#bgwriter_all_percent = 0.333# 0-100% of all buffers scanned/round
bgwriter_all_maxpages = 0# 0-1000 buffers max written/round

make me very suspicious.

As I said, I would check the variables I referenced in my first post...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Insert performance with composite index

2010-11-01 Thread Andres Freund
On Monday 01 November 2010 15:08:10 Divakar Singh wrote:
> here are my parameters:
Which pg version is that?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Insert performance with composite index

2010-11-01 Thread Andres Freund
Hi,

On Monday 01 November 2010 13:49:14 Divakar Singh wrote:
> When I tried inserting 1M rows into a table with a Primary Key, it took
> almost 62 seconds.
> After adding a composite index of 2 columns, the performance degrades to
> 125 seconds.
> I am using COPY to insert all data in 1 transaction.
Without seeing your config its hard to suggest anything here. Did you do basic 
tuning of your pg installation?

wal_buffers, shared_buffers, checkpoint_segments, maintenance_work_mem are 
likely most relevant for that specific case.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Defaulting wal_sync_method to fdatasync on Linux for 9.1?

2010-10-31 Thread Andres Freund
On Sunday 31 October 2010 20:59:31 Greg Smith wrote:
> Writes only are sync'd out when you do a commit, or the database does a 
> checkpoint.
Hm?  WAL is written out to disk after an the space provided by wal_buffers(def 
8) * XLOG_BLCKSZ (def 8192) is used. The default is 64kb which you reach 
pretty quickly - especially after a checkpoint. With O_D?SYNC that will 
synchronously get written out during a normal XLogInsert if hits a page 
boundary.
*Additionally* its gets written out at a commit if sync commit is not on.

Not having a real O_DSYNC on linux until recently makes it even more dubious 
to have it as a default...


Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Andres Freund
On Thursday 21 October 2010 21:42:06 Kevin Grittner wrote:
> Bruce Momjian  wrote:
> > I assume we send a full 8k to the controller, and a failure during
> > that write is not registered as a write.
> 
> On what do you base that assumption?  I assume that we send a full
> 8K to the OS cache, and the file system writes disk sectors
> according to its own algorithm.  With either platters or BBU cache,
> the data is persisted on fsync; why do you see a risk with one but
> not the other?
At least on linux pages can certainly get written out in < 8kb batches if 
youre under memory pressure.

Andres



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Advice configuring ServeRAID 8k for performance

2010-08-17 Thread Andres Freund
On Tuesday 17 August 2010 10:29:10 Greg Smith wrote:
> Andres Freund wrote:
> > An fsync() equals a barrier so it has the effect of stopping
> > reordering around it - especially on systems with larger multi-disk
> > arrays thats pretty expensive.
> > You can achieve surprising speedups, at least in my experience, by
> > forcing the kernel to start writing out pages *without enforcing
> > barriers* first and then later enforce a barrier to be sure its
> > actually written out.
> 
> Standard practice on high performance systems with good filesystems and
> a battery-backed controller is to turn off barriers anyway.  That's one
> of the first things to tune on XFS for example, when you have a reliable
> controller.  I don't have enough data on ext4 to comment on tuning for
> it yet.
> 
> The sole purpose for the whole Linux write barrier implementation in my
> world is to flush the drive's cache, when the database does writes onto
> cheap SATA drives that will otherwise cache dangerously.  Barriers don't
> have any place on a serious system that I can see.  The battery-backed
> RAID controller you have to use to make fsync calls fast anyway can do
> some simple write reordering, but the operating system doesn't ever have
> enough visibility into what it's doing to make intelligent decisions
> about that anyway.
Even if were not talking about a write barrier in an "ensure its written out 
of the cache" way it still stops the io-scheduler from reordering. I 
benchmarked it (custom app) and it was very noticeable on a bunch of different 
systems (with a good BBUed RAID).

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Advice configuring ServeRAID 8k for performance

2010-08-16 Thread Andres Freund
On Mon, Aug 16, 2010 at 04:54:19PM -0400, Greg Smith wrote:
> Andres Freund wrote:
> >A new checkpointing logic + a new syncing logic
> >(prepare_fsync() earlier and then fsync() later) would be a nice
> >thing. Do you plan to work on that?
> The background writer already caches fsync calls into a queue, so
> the prepare step you're thinking needs to be there is already.  The
> problem is that the actual fsync calls happen in a tight loop.  That
> we're busy fixing.
That doesn't help that much on many systems with a somewhat deep
queue. An fsync() equals a barrier so it has the effect of stopping
reordering around it - especially on systems with larger multi-disk
arrays thats pretty expensive.
You can achieve surprising speedups, at least in my experience, by
forcing the kernel to start writing out pages *without enforcing
barriers* first and then later enforce a barrier to be sure its
actually written out. Which, in a simplified case, turns the earlier
needed multiple barriers into a single one (in practise you want to
call fsync() anyway, but thats not a big problem if its already
written out).

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Advice configuring ServeRAID 8k for performance

2010-08-16 Thread Andres Freund
On Mon, Aug 16, 2010 at 04:13:22PM -0400, Greg Smith wrote:
> Andres Freund wrote:
> >Or use -o sync. Or configure a ridiciuosly low dirty_memory amount
> >(which has a problem on large systems because 1% can still be too
> >much. Argh.)...
>
> -o sync completely trashes performance, and trying to set the
> dirty_ratio values to even 1% doesn't really work due to things like
> the "congestion avoidance" code in the kernel.  If you sync a lot
> more often, which putting the WAL on the same disk as the database
> accidentally does for you, that works surprisingly well at avoiding
> this whole class of problem on ext3.  A really good solution is
> going to take a full rewrite of the PostgreSQL checkpoint logic
> though, which will get sorted out during 9.1 development.  (cue
> dramatic foreshadowing music here)
-o sync works ok enough for the data partition (surely not the wal) if you make 
the
background writer less aggressive.

But yes. A new checkpointing logic + a new syncing logic
(prepare_fsync() earlier and then fsync() later) would be a nice
thing. Do you plan to work on that?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Advice configuring ServeRAID 8k for performance

2010-08-16 Thread Andres Freund
On Mon, Aug 16, 2010 at 01:46:21PM -0400, Greg Smith wrote:
> Scott Carey wrote:
> >This is because an fsync on ext3 flushes _all dirty pages in the file 
> >system_ to disk, not just those for the file being fsync'd.
> >One partition for WAL, one for data.  If using ext3 this is
> >essentially a performance requirement no matter how your array is
> >set up underneath.
>
> Unless you want the opposite of course.  Some systems split out the
> WAL onto a second disk, only to discover checkpoint I/O spikes
> become a problem all of the sudden after that.  The fsync calls for
> the WAL writes keep the write cache for the data writes from ever
> getting too big.  This slows things down on average, but makes the
> worst case less stressful.  Free lunches are so hard to find
> nowadays...
Or use -o sync. Or configure a ridiciuosly low dirty_memory amount
(which has a problem on large systems because 1% can still be too
much. Argh.)...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizing NOT IN plans / verify rewrite

2010-08-02 Thread Andres Freund
On Mon, Aug 02, 2010 at 01:35:13PM -0700, Maciek Sakrejda wrote:
> >> All fields involved are declared NOT NULL, but thanks for the heads up.
> >Afair the planner doesnt use that atm.
>
> I was referring to not having to care about the strange NULL semantics
> (as per your original comment), since I have no NULLs. Given that, I
> think the NOT EXISTS could be a good solution, even on 8.3 (we're
> planning to upgrade, but it's not a feasible solution to this
> particular problem), no?
The point is that only 8.4 will optimize that case properly. 8.3 will
generate plans which are inefficient in many (or most) cases for both
variants. I would suggest to use manual antijoins...

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizing NOT IN plans / verify rewrite

2010-08-02 Thread Andres Freund
On Mon, Aug 02, 2010 at 01:06:00PM -0700, Maciek Sakrejda wrote:
> All fields involved are declared NOT NULL, but thanks for the heads up.
Afair the planner doesnt use that atm.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizing NOT IN plans / verify rewrite

2010-08-02 Thread Andres Freund
Hi,

On Mon, Aug 02, 2010 at 12:12:51PM -0700, Maciek Sakrejda wrote:
> I'm running PostgreSQL 8.3 and I have a query with a couple of NOT IN
> subqueries:
With 8.3 you will have to use manual antijoins (i.e LEFT JOIN
... WHERE NULL). If you use 8.4 NOT EXISTS() will do that
automatically in many cases (contrary to NOT IN () which has strange
NULL semantics).

Anbdres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.

2010-07-28 Thread Andres Freund
On Wed, Jul 28, 2010 at 04:10:08PM -0400, Tom Lane wrote:
> Robert Haas  writes:
> > However, if we don't support that, we can't do any sort of pooling-ish
> > thing without the ability to pass file descriptors between processes;
> > and Tom seems fairly convinced there's no portable way to do that.
>
> Well, what it would come down to is: are we prepared to not support
> pooling on platforms without such a capability?  It's certainly possible
> to do it on many modern platforms, but I don't believe we can make it
> happen everywhere.  Generally we've tried to avoid having major features
> that don't work everywhere ...
Which platforms do you have in mind here? All of the platforms I found
documented to be supported seem to support at least one of SCM_RIGHTS,
WSADuplicateSocket or STREAMS/FD_INSERT.
Most if not all beside windows support SCM_RIGHTS. The ones I am
dubious about support FD_INSERT...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 158x query improvement when removing 2 (noop) WHERE conditions

2010-07-28 Thread Andres Freund
On Wednesday 28 July 2010 12:27:44 Louis-David Mitterrand wrote:
> The EXPLAIN ANALYSE output is attached with, first the fast version and
> then the slow one.
I think you forgot to attach it.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Questions on query planner, join types, and work_mem

2010-07-27 Thread Andres Freund
Hi,

On Tue, Jul 27, 2010 at 04:08:16PM -0700, Peter Hussey wrote:
> Now for the questions:
> 1)  If we tell the customer to set his work_mem value to 500MB or 1GB in
> postgres.config, what problems might they see?  the documentation and the
> guidelines we received from Rupinder Singh in support suggest a much lower
> value, e.g. a max work_mem of 10MB.  Other documentation such as the "Guide
> to Posting Slow Query Questions" suggest at least testing up to 1GB.  What
> is a reasonable maximum to configure for all connnections?
Well. That depends on the amount of expected concurrency and available
memory. Obviously you can set it way much higher in an OLAPish, low
concurrency setting than in an OLTP environment.

That setting is significantly complex to estimate in my opinion. For
one the actualy usage depends on the complexity of the queries, for
another to be halfway safe you have to use avail_mem/(max_connections
* max_nodes_of_most_complex_query). Which is often a very pessimistic
and unusably low estimate.

> 2) How is work_mem used by a query execution?  For example, does each hash
> table in an execution get allocated a full work_mem's worth of memory ?   Is
> this memory released when the query is finished, or does it stay attached to
> the connection or some other object?
Each Node of the query can use one work_mem worth of data (sometimes a
bit more). The memory is released after the query finished (or
possibly earlier, dependent of the structure of the query).
The specific allocation pattern and implementation details (of malloc)
influence how and when that memory is actually returned to the os.

> 3) is there a reason why the planner doesn't seem to recognize the condition
> when the hash table won't fit in the current work_mem, and choose a
> low-memory plan instead?
Hard to say without more information. Bad estimates maybe? Best show
your query plan (EXPLAIN ANALYZE), the table definition and some
details about common hardware (i.e. whether it has 1GB of memory or
256GB).

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Testing Sandforce SSD

2010-07-26 Thread Andres Freund
On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote:
> On Mon, Jul 26, 2010 at 1:45 PM, Greg Smith  wrote:
> > Yeb Havinga wrote:
> >> I did some ext3,ext4,xfs,jfs and also ext2 tests on the just-in-memory
> >> read/write test. (scale 300) No real winners or losers, though ext2 isn't
> >> really faster and the manual need for fix (y) during boot makes it
> >> impractical in its standard configuration.
> >>
> >
> > That's what happens every time I try it too.  The theoretical benefits of
> > ext2 for hosting PostgreSQL just don't translate into significant
> > performance increases on database oriented tests, certainly not ones that
> > would justify the downside of having fsck issues come back again.  Glad to
> > see that holds true on this hardware too.
> I know I'm talking development now but is there a case for a pg_xlog block
> device to remove the file system overhead and guaranteeing your data is
> written sequentially every time?
For one I doubt that its a relevant enough efficiency loss in
comparison with a significantly significantly complex implementation
(for one you cant grow/shrink, for another you have to do more
complex, hw-dependent things like rounding to hardware boundaries,
page size etc to stay efficient) for another my experience is that at
a relatively low point XlogInsert gets to be the bottleneck - so I
don't see much point in improving at that low level (yet at least).

Where I would like to do some hw dependent measuring (because I see
significant improvements there) would be prefetching for seqscan,
indexscans et al. using blktrace... But I currently dont have the
time. And its another topic ;-)

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.

2010-07-23 Thread Andres Freund
On Sat, Jul 24, 2010 at 01:23:08AM -0400, Greg Smith wrote:
> Joshua Tolley wrote:
> >Relatively minor, but it would be convenient to avoid having to query
> >$external_pooler to determine the client_addr of an incoming connection.
>
> You suggest this as a minor concern, but I consider it to be one of
> the most compelling arguments in favor of in-core pooling.  A
> constant pain with external poolers is the need to then combine two
> sources of data in order to track connections fully, which is
> something that everyone runs into eventually and finds annoying.
> It's one of the few things that doesn't go away no matter how much
> fiddling you do with pgBouncer, it's always getting in the way a
> bit.  And it seems to seriously bother systems administrators and
> developers, not just the DBAs.
But you have to admit that this problem won't vanish as people will
continue to use poolers on other machines for resource reasons.
So providing a capability to do something sensible here seems to be
useful independent of in-core pooling.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.

2010-07-23 Thread Andres Freund
On Fri, Jul 23, 2010 at 01:28:53PM -0400, Robert Haas wrote:
> On Fri, Jul 23, 2010 at 11:58 AM, Hannu Krosing  wrote:
> > On Thu, 2010-07-22 at 20:57 -0400, Robert Haas wrote:
> >> On Thu, Jul 22, 2010 at 3:15 PM, Hannu Krosing  wrote:
> >> > On Thu, 2010-07-22 at 14:36 -0400, Robert Haas wrote:
> >> >> On Mon, Jul 12, 2010 at 6:58 AM, Craig Ringer
> >> >>  wrote:
> >> >> > So rather than asking "should core have a connection pool" perhaps
> >> >> > what's needed is to ask "what can an in-core pool do that an external
> >> >> > pool cannot do?"
> >> >>
> >> >> Avoid sending every connection through an extra hop.
> >> >
> >> > not really. in-core != magically-in-right-backend-process
> >>
> >> Well, how about if we arrange it so it IS in the right backend
> >> process?  I don't believe magic is required.
> >
> > Do you have any design in mind, how you can make it so ?
>
> Well, if we could change the backends so that they could fully
> reinitialize themselves (disconnect from a database to which they are
> bound, etc.), I don't see why we couldn't use the Apache approach.
> There's a danger of memory leaks but that's why Apache has
> MaxRequestsPerChild, and it works pretty darn well.  Of course,
> passing file descriptors would be even nicer (you could pass the
> connection off to a child that was already bound to the correct
> database, perhaps) but has pointed out more than once, that's not
> portable.
Its not *that bad* though. To my knowledge its 2 or 3 implementations that
one would need to implement to support most if not all platforms.

- sendmsg/cmsg/SCM_RIGHTS based implementation (most if not all *nixes
  including solaris, linux, (free|open|net)bsd, OSX, AIX, HPUX, others)
- WSADuplicateSocket (windows)
- if needed: STREAMS based stuff (I_SENDFD) (at least solaris,  hpux, aix, 
tru64,
  irix, unixware allow this)


Note that I am still not convinced that its a good idea...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Pooling in Core WAS: Need help in performance tuning.

2010-07-22 Thread Andres Freund
On Thu, Jul 22, 2010 at 02:33:43PM -0400, Robert Haas wrote:
> On Mon, Jul 12, 2010 at 6:58 AM, Craig Ringer
>  wrote:
> > On 12/07/10 17:45, Matthew Wakeling wrote:
> >>
> >> I'm surprised. Doesn't apache httpd do this? Does it have to do a whole
> >> load of non-portable stuff? It seems to work on a whole load of platforms.
> >
> > A lot of what Apache HTTPd does is handled via the Apache Portable
> > Runtime (APR). It contains a lot of per-platform handlers for various
> > functionality.
>
> Apache just has all of the worker processes call accept() on the
> socket, and whichever one the OS hands it off to gets the job.
As an inconsequential detail - afaik they keep the os from doing that
by protecting it with a mutex for various reasons (speed - as some
implementations wake up and see theres nothing to do, multiple
sockets, fairness)

> The problem is harder for us because a backend can't switch identities
> once it's been assigned to a database.  I haven't heard an adequate
> explanation of why that couldn't be changed, though.
Possibly it might decrease the performance significantly enough by
reducing the cache locality (syscache, prepared plans)?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] order by slowing down a query by 80 times

2010-06-28 Thread Andres Freund
On Monday 28 June 2010 13:39:27 Yeb Havinga wrote:
> It looks like seq_scans are disabled, since the index scan has only a 
> filter expression but not an index cond.
Or its using it to get an ordered result...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How filesystems matter with PostgreSQL

2010-06-04 Thread Andres Freund
On Friday 04 June 2010 20:26:27 Jon Schewe wrote:
> ext3 barrier=1: ~15 minutes
> ext4 nobarrier: ~15 minutes
Any message in the kernel log about barriers or similar?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance regression with Linux 2.6.33 and glibc 2.12

2010-06-04 Thread Andres Freund
On Friday 04 June 2010 15:59:05 Tom Lane wrote:
> Marc Cousin  writes:
> > I hope I'm not going to expose an already known problem, but I couldn't
> > find it mailing list archives (I only found
> > http://archives.postgresql.org/pgsql- hackers/2009-12/msg01543.php).
> 
> You sure this isn't the well-known "ext4 actually implements fsync
> where ext3 didn't" issue?
I doubt it. It reads to me like he is testing the two methods on the same 
installation with the same kernel 

> > with wal_sync_method = open_datasync (new default)
> > marc=# INSERT INTO test SELECT generate_series(1,10);
> > INSERT 0 10
> > Time: 16083,912 ms
> > 
> > with wal_sync_method = fdatasync (old default)
> > 
> > marc=# INSERT INTO test SELECT generate_series(1,10);
> > INSERT 0 10
> > Time: 954,000 ms
Its not actually surprising that in such a open_datasync is hugely slower than 
fdatasync. With open_datasync every single write will be synchronous, very 
likely not reordered/batched/whatever. In contrast to that with fdatasync it 
will only synced in way much bigger batches.

Or am I missing something?

I always thought the synchronous write methods to be a fallback kludge and 
didnt realize its actually the preferred method...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How filesystems matter with PostgreSQL

2010-06-04 Thread Andres Freund
On Friday 04 June 2010 16:25:30 Tom Lane wrote:
> Andres Freund  writes:
> > On Friday 04 June 2010 14:17:35 Jon Schewe wrote:
> >> XFS (logbufs=8): ~4 hours to finish
> >> ext4: ~1 hour 50 minutes to finish
> >> ext3: 15 minutes to finish
> >> ext3 on LVM: 15 minutes to finish
> > 
> > My guess is that some of the difference comes from barrier differences.
> > ext4 uses barriers by default, ext3 does not.
> Or, to put it more clearly: the reason ext3 is fast is that it's unsafe.
Jon: To verify you can enable it via the barrier=1 option during mounting..

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How filesystems matter with PostgreSQL

2010-06-04 Thread Andres Freund
On Friday 04 June 2010 14:17:35 Jon Schewe wrote:
> Some interesting data about different filesystems I tried with
> PostgreSQL and how it came out.
> 
> I have an application that is backed in postgres using Java JDBC to
> access it. The tests were all done on an opensuse 11.2 64-bit machine,
> on the same hard drive (just ran mkfs between each test) on the same
> input with the same code base. All filesystems were created with the
> default options.
> 
> XFS (logbufs=8): ~4 hours to finish
> ext4: ~1 hour 50 minutes to finish
> ext3: 15 minutes to finish
> ext3 on LVM: 15 minutes to finish
My guess is that some of the difference comes from barrier differences. ext4 
uses barriers by default, ext3 does not.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance of temporary vs. regular tables

2010-05-25 Thread Andres Freund
On Tuesday 25 May 2010 11:00:24 Joachim Worringen wrote:
> Am 25.05.2010 10:49, schrieb Grzegorz Jaśkiewicz:
> > temporary tables are handled pretty much like the regular table. The
> > magic happens on schema level, new schema is setup for connection, so
> > that it can access its own temporary tables.
> > Temporary tables also are not autovacuumed.
> > And that's pretty much the most of the differences.
> 
> Thanks. So, the Write-Ahead-Logging (being used or not) does not matter?
It does matter quite significantly in my experience. Both from an io and a cpu 
overhead perspective.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Huge table searching optimization

2010-04-05 Thread Andres Freund
Hi,

On Monday 05 April 2010 16:28:35 Oliver Kindernay wrote:
> Hi, I have table with just on column named url (it's not real url,
> just random string for testing purposes), type text. I have lots of
> entries in it (it's dynamic, i add and remove them on the fly), 100
> 000 and more. I've created index on this table to optimize
> "searching". I just want to test if some "url" is in in the table, so
> i am using this request:
> 
> select url from test2 where url ~* '^URLVALUE\\s*$';
> 
> there's \\s* because of padding. Here is the analyze:
> 
> postgres=# explain analyze select url from test2  where url ~*
> '^zyxel\\s*$'; WARNING:  nonstandard use of \\ in a string literal
> LINE 1: ...plain analyze select url from test2  where url ~* '^zyxel\\s...
>  ^
> HINT:  Use the escape string syntax for backslashes, e.g., E'\\'.
>   QUERY PLAN
> ---
>  Seq Scan on test2  (cost=0.00..1726.00 rows=10
> width=9) (actual
> time=156.489..156.502 rows=1 loops=1)
>Filter: (url ~* '^zyxel\\s*$'::text)
>  Total runtime: 156.538 ms
> (3 rows)
> 
> It takes 156 ms, it's too much for my purposes, so i want to decrease
> it. So what can I use for optimizing this request? Again, I just want
> to test if "url" ("zyxel" in this examlpe) is in the table.
> 
Depending on your locale it might be sensible to create a text_pattern_ops 
index - see the following link: 
http://www.postgresql.org/docs/current/static/indexes-opclass.html

Like suggested by depesz it would be far better to remove the padding and do 
exact lookups though.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Building multiple indexes concurrently

2010-03-19 Thread Andres Freund
On Wednesday 17 March 2010 22:18:47 Hannu Krosing wrote:
> On Wed, 2010-03-17 at 16:49 -0400, Greg Smith wrote:
> > Alvaro Herrera wrote:
> > > Andres Freund escribió:
> > >> I find it way much easier to believe such issues exist on a tables in
> > >> constrast to indexes. The likelihood to get sequential accesses on an
> > >> index is small enough on a big table to make it unlikely to matter
> > >> much.
> > > 
> > > Vacuum walks indexes sequentially, for one.
> > 
> > That and index-based range scans were the main two use-cases I was
> > concerned would be degraded by interleaving index builds, compared with
> > doing them in succession.
> 
> I guess that tweaking file systems to allocate in bigger chunks help
> here ? I know that xfs can be tuned in that regard, but how about other
> common file systems like ext3 ?
ext4 should do that now by allocating the space for the files only after some 
time or uppon things like fsync (xfs does the same).
ext3 has, as far as I know, neither the ability to change allocation size nor 
can do delayed allocation.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Building multiple indexes concurrently

2010-03-17 Thread Andres Freund
On Wednesday 17 March 2010 19:44:56 Greg Smith wrote:
> Rob Wultsch wrote:
> > On Wed, Mar 17, 2010 at 7:30 AM, Tom Lane  wrote:
> >> No, it's not optimistic in the least, at least not since we implemented
> >> synchronized seqscans (in 8.3 or thereabouts).
> > 
> > Where can I find details about this in the documentation?
> 
> It's a behind the scenes optimization so it's not really documented on
> the user side very well as far as I know; easy to forget it's even there
> as I did this morning.
> http://j-davis.com/postgresql/syncscan/syncscan.pdf is a presentation
> covering it, and http://j-davis.com/postgresql/83v82_scans.html is also
> helpful.
> 
> While my pessimism on this part may have been overwrought, note the
> message interleaved on the list today with this discussion from Bob
> Lunney discussing the other issue I brought up:  "When using 8-way
> parallel restore against a six-disk RAID 10 group I found that table and
> index scan performance dropped by about 10x.  I/O performance was
> restored by either clustering the tables one at a time, or by dropping
> and restoring them one at a time.  The only reason I can come up with
> for this behavior is file fragmentation and increased seek times."  Now,
> Bob's situation may very well involve a heavy dose of table
> fragmentation from multiple active loading processes rather than index
> fragmentation, but this class of problem is common when trying to do too
> many things at the same time.  I'd hate to see you chase a short-term
> optimization (reduce total index built time) at the expense of long-term
> overhead (resulting indexes are not as efficient to scan).
I find it way much easier to believe such issues exist on a tables in 
constrast to indexes. The likelihood to get sequential accesses on an index is 
small enough on a big table to make it unlikely to matter much.

Whats your theory to make it matter much?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-10 Thread Andres Freund
On Monday 08 February 2010 05:53:23 Robert Haas wrote:
> On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
> 
>  wrote:
> > Andres Freund escribió:
> >> I personally think the fsync on the directory should be added to the
> >> stable branches - other opinions?
> >> If wanted I can prepare patches for that.
> > 
> > Yeah, it seems there are two patches here -- one is the addition of
> > fsync_fname() and the other is the fsync_prepare stuff.
> 
> Andres, you want to take a crack at splitting this up?
I hope I didnt duplicate Gregs work, but I didnt hear back from him, so...

Everything <8.1 is hopeless because cp is used there... I didnt see it worth 
to replace that. The patch applies cleanly for 8.1 to 8.4 and survives the 
regression tests

Given pg's heavy commit model I didnt see a point to split the patch for 9.0 
as well...

Andres
diff --git a/src/port/copydir.c b/src/port/copydir.c
index 72fbf36..b057ffa 100644
*** a/src/port/copydir.c
--- b/src/port/copydir.c
*** copydir(char *fromdir, char *todir, bool
*** 50,55 
--- 50,56 
  {
  	DIR		   *xldir;
  	struct dirent *xlde;
+ 	int dirfd;
  	char		fromfile[MAXPGPATH];
  	char		tofile[MAXPGPATH];
  
*** copydir(char *fromdir, char *todir, bool
*** 91,96 
--- 92,116 
  	}
  
  	FreeDir(xldir);
+ 
+ 	/*
+ 	 * fsync the directory to make sure data has reached the
+ 	 * disk. While needed by most filesystems, the window got bigger
+ 	 * with newer ones like ext4.
+ 	 */
+ 	dirfd = BasicOpenFile(todir,
+ 	  O_RDONLY | PG_BINARY,
+ 	  S_IRUSR | S_IWUSR);
+ 	if(dirfd == -1)
+ 		ereport(ERROR,
+ 		(errcode_for_file_access(),
+ 		 errmsg("could not open directory for fsync \"%s\": %m", todir)));
+ 
+ 	if(pg_fsync(dirfd) == -1)
+ 		ereport(ERROR,
+ (errcode_for_file_access(),
+  errmsg("could not fsync directory \"%s\": %m", todir)));
+ 	close(dirfd);
  }
  
  /*

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Deferred constraint and delete performance

2010-02-10 Thread Andres Freund
On Wednesday 10 February 2010 15:56:40 Tom Lane wrote:
> Franck Routier  writes:
> > I am wondering if deferring foreign key constraints (instead of
> > disableing them) would increase performance, compared to non deferred
> > constraints
> 
> No, it wouldn't make any noticeable difference AFAICS.  It would
> postpone the work from end-of-statement to end-of-transaction,
> but not make the work happen any more (or less) efficiently.
It could make a difference if the transaction is rather long and updates the 
same row repeatedly because of better cache usage. But I admit thats a bit of 
a constructed scenario (where one likely would get into trigger-queue size 
problems as well)

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-08 Thread Andres Freund
On Monday 08 February 2010 19:34:01 Greg Stark wrote:
> On Mon, Feb 8, 2010 at 4:53 AM, Robert Haas  wrote:
> > On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
> > 
> >> Yeah, it seems there are two patches here -- one is the addition of
> >> fsync_fname() and the other is the fsync_prepare stuff.
> 
> Sorry, I'm just catching up on my mail from FOSDEM this past weekend.
> 
> I had come to the same conclusion as Greg that I might as well just
> commit it with Tom's "pg_flush_data()" name and we can decide later if
> and when we have pg_fsync_start()/pg_fsync_finish() whether it's worth
> keeping two apis or not.
> 
> So I was just going to commit it like that but I discovered last week
> that I don't have cvs write access set up yet. I'll commit it as soon
> as I generate a new ssh key and Dave installs it, etc. I intentionally
> picked a small simple patch that nobody was waiting on because I knew
> there was a risk of delays like this and the paperwork. I'm nearly
> there.
Do you still want me to split the patches into two or do you want to do it 
yourself?
One in multiple versions for the directory fsync and another one for 9.0?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-07 Thread Andres Freund
On Monday 08 February 2010 05:53:23 Robert Haas wrote:
> On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
> 
>  wrote:
> > Andres Freund escribió:
> >> I personally think the fsync on the directory should be added to the
> >> stable branches - other opinions?
> >> If wanted I can prepare patches for that.
> > 
> > Yeah, it seems there are two patches here -- one is the addition of
> > fsync_fname() and the other is the fsync_prepare stuff.
> 
> Andres, you want to take a crack at splitting this up?
Will do. Later today or tomorrow morning.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-07 Thread Andres Freund
On Sunday 07 February 2010 19:27:02 Andres Freund wrote:
> On Sunday 07 February 2010 19:23:10 Robert Haas wrote:
> > On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane  wrote:
> > > Greg Smith  writes:
> > >> This is turning into yet another one of those situations where
> > >> something simple and useful is being killed by trying to generalize
> > >> it way more than it needs to be, given its current goals and its lack
> > >> of external interfaces.  There's no catversion bump or API breakage
> > >> to hinder future refactoring if this isn't optimally designed
> > >> internally from day one.
> > > 
> > > I agree that it's too late in the cycle for any major redesign of the
> > > patch.  But is it too much to ask to use a less confusing name for the
> > > function?
> > 
> > +1.  Let's just rename the thing, add some comments, and call it good.
> 
> Will post a updated patch in the next hours unless somebody beats me too
> it.
Here we go.

I left the name at my suggestion pg_fsync_prepare instead of Tom's 
prepare_for_fsync because it seemed more consistend with the naming in the 
rest of the file. Obviously feel free to adjust.

I personally think the fsync on the directory should be added to the stable 
branches - other opinions?
If wanted I can prepare patches for that.

Andres
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7ffa2eb..bc5753a 100644
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
*** pg_fdatasync(int fd)
*** 320,325 
--- 320,361 
  }
  
  /*
+  * pg_fsync_prepare --- try to make a later fsync on the same file faster
+  *
+  * A call to this function does not guarantee anything!
+  *
+  * The idea is to tell the kernel to write out its cache so that a
+  * fsync later on has less to write out synchronously. This allows
+  * that write requests get reordered more freely.
+  *
+  * In the current implementation this has the additional effect of
+  * dropping the cache in that region and thus can be used to avoid
+  * cache poisoning. This may or may not be wanted.
+  *
+  * XXX: Ideally this API would use sync_file_range (or similar on
+  * platforms other than linux) and a seperate one for cache
+  * control. We are not there yet.
+  *
+  * Look at the thread below 200912282354.51892.and...@anarazel.de in
+  * pgsql-hackers for a longer discussion.
+  */
+ int
+ pg_fsync_prepare(int fd, off_t offset, off_t amount)
+ {
+ 	if (enableFsync)
+ 	{
+ #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+ 		return posix_fadvise(fd, offset, amount, POSIX_FADV_DONTNEED);
+ #else
+ 		return 0;
+ #endif
+ 	}
+ 	else
+ 		return 0;
+ }
+ 
+ 
+ /*
   * InitFileAccess --- initialize this module during backend startup
   *
   * This is called during either normal or standalone backend start.
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 21cb024..b1a4b49 100644
*** a/src/include/storage/fd.h
--- b/src/include/storage/fd.h
*** extern int	pg_fsync_no_writethrough(int 
*** 99,104 
--- 99,106 
  extern int	pg_fsync_writethrough(int fd);
  extern int	pg_fdatasync(int fd);
  
+ extern int	prepare_for_fsync(int fd, off_t offset, off_t amount);
+ 
  /* Filename components for OpenTemporaryFile */
  #define PG_TEMP_FILES_DIR "pgsql_tmp"
  #define PG_TEMP_FILE_PREFIX "pgsql_tmp"
diff --git a/src/port/copydir.c b/src/port/copydir.c
index 72fbf36..eef3cfb 100644
*** a/src/port/copydir.c
--- b/src/port/copydir.c
***
*** 37,42 
--- 37,43 
  
  
  static void copy_file(char *fromfile, char *tofile);
+ static void fsync_fname(char *fname);
  
  
  /*
*** copydir(char *fromdir, char *todir, bool
*** 64,69 
--- 65,73 
  (errcode_for_file_access(),
   errmsg("could not open directory \"%s\": %m", fromdir)));
  
+ 	/*
+ 	 * Copy all the files
+ 	 */
  	while ((xlde = ReadDir(xldir, fromdir)) != NULL)
  	{
  		struct stat fst;
*** copydir(char *fromdir, char *todir, bool
*** 89,96 
--- 93,127 
  		else if (S_ISREG(fst.st_mode))
  			copy_file(fromfile, tofile);
  	}
+ 	FreeDir(xldir);
+ 
+ 	/*
+ 	 * Be paranoid here and fsync all files to ensure we catch problems.
+ 	 */
+ 	xldir = AllocateDir(fromdir);
+ 	if (xldir == NULL)
+ 		ereport(ERROR,
+ (errcode_for_file_access(),
+  errmsg("could not open directory \"%s\": %m", fromdir)));
+ 
+ 	while ((xlde = ReadDir(xldir, fromdir)) != NULL)
+ 	{
+ 		if (strcmp(xlde->d_name, ".") == 0 ||
+ 			strcmp(xlde->d_name, "..") == 0)
+ 			continue;
  
+ 		snprintf(tofile, MAXPGPATH, "%s/%s", todir, xlde->d_name);
+ 		fsync_fname(tofile);
+ 	}
  	FreeDir(xldir);
+ 
+ 	/* It's important to fsync the desti

Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-07 Thread Andres Freund
On Sunday 07 February 2010 19:23:10 Robert Haas wrote:
> On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane  wrote:
> > Greg Smith  writes:
> >> This is turning into yet another one of those situations where something
> >> simple and useful is being killed by trying to generalize it way more
> >> than it needs to be, given its current goals and its lack of external
> >> interfaces.  There's no catversion bump or API breakage to hinder future
> >> refactoring if this isn't optimally designed internally from day one.
> > 
> > I agree that it's too late in the cycle for any major redesign of the
> > patch.  But is it too much to ask to use a less confusing name for the
> > function?
> 
> +1.  Let's just rename the thing, add some comments, and call it good.
Will post a updated patch in the next hours unless somebody beats me too it.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-06 Thread Andres Freund
On Saturday 06 February 2010 06:03:30 Greg Smith wrote:
> Andres Freund wrote:
> > On 02/03/10 14:42, Robert Haas wrote:
> >> Well, maybe we should start with a discussion of what kernel calls
> >> you're aware of on different platforms and then we could try to put an
> >> API around it.
> > 
> > In linux there is sync_file_range. On newer Posixish systems one can
> > emulate that with mmap() and msync() (in batches obviously).
> > 
> > No idea about windows.
> The effective_io_concurrency feature had proof of concept test programs
> that worked using AIO, but actually following through on that
> implementation would require a major restructuring of how the database
> interacts with the OS in terms of reads and writes of blocks.  It looks
> to me like doing something similar to sync_file_range on Windows would
> be similarly difficult.
Looking a bit arround it seems one could achieve something approximediately 
similar to pg_prepare_fsync() by using
CreateFileMapping && MapViewOfFile && FlushViewOfFile 

If I understand it correctly that will flush, but not wait. Unfortunately you 
cant event make it wait, so its not possible to implement sync_file_range or 
similar fully.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-03 Thread Andres Freund

On 02/03/10 14:42, Robert Haas wrote:

On Wed, Feb 3, 2010 at 6:53 AM, Greg Stark  wrote:

On Tue, Feb 2, 2010 at 7:45 PM, Robert Haas  wrote:

I think you're probably right, but it's not clear what the new name
should be until we have a comment explaining what the function is
responsible for.


So I wrote some comments but wasn't going to repost the patch with the
unchanged name without explanation... But I think you're right though
I was looking at it the other way around. I want to have an API for a
two-stage sync and of course if I do that I'll comment it to explain
that clearly.

The gist of the comments was that the function is preparing to fsync
to initiate the i/o early and allow the later fsync to fast -- but
also at the same time have the beneficial side-effect of avoiding
cache poisoning. It's not clear that the two are necessarily linked
though. Perhaps we need two separate apis, though it'll be hard to
keep them separate on all platforms.


Well, maybe we should start with a discussion of what kernel calls
you're aware of on different platforms and then we could try to put an
API around it.
In linux there is sync_file_range. On newer Posixish systems one can 
emulate that with mmap() and msync() (in batches obviously).


No idea about windows.

Andres

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-03 Thread Andres Freund

On 02/03/10 12:53, Greg Stark wrote:

On Tue, Feb 2, 2010 at 7:45 PM, Robert Haas  wrote:

I think you're probably right, but it's not clear what the new name
should be until we have a comment explaining what the function is
responsible for.


So I wrote some comments but wasn't going to repost the patch with the
unchanged name without explanation... But I think you're right though
I was looking at it the other way around. I want to have an API for a
two-stage sync and of course if I do that I'll comment it to explain
that clearly.

The gist of the comments was that the function is preparing to fsync
to initiate the i/o early and allow the later fsync to fast -- but
also at the same time have the beneficial side-effect of avoiding
cache poisoning. It's not clear that the two are necessarily linked
though. Perhaps we need two separate apis, though it'll be hard to
keep them separate on all platforms.
I vote for two seperate apis - sure, there will be some unfortunate 
overlap for most unixoid platforms but its sure better possibly to allow 
adding more platforms later at a centralized place than having to 
analyze every place where the api is used.


Andres

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Andres Freund
On Tuesday 02 February 2010 20:06:32 Robert Haas wrote:
> On Tue, Feb 2, 2010 at 1:34 PM, Andres Freund  wrote:
> > For now it could - but it very well might be converted to sync_file_range
> > or similar, which would have different "sideeffects".
> > 
> > As the potential code duplication is rather small I would prefer to
> > describe the prime effect not the sideeffects...
> 
> Hmm, in that case, I think the problem is that this function has no
> comment explaining its intended charter.
I agree there. Greg, do you want to update the patch with some comments or 
shall I?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Andres Freund
On Tuesday 02 February 2010 19:14:40 Robert Haas wrote:
> On Tue, Feb 2, 2010 at 12:50 PM, Tom Lane  wrote:
> > Andres Freund  writes:
> >> On Tuesday 02 February 2010 18:36:12 Robert Haas wrote:
> >>> I took a look at this patch today and I agree with Tom that
> >>> pg_fsync_start() is a very confusing name.  I don't know what the
> >>> right name is, but this doesn't fsync so I don't think it shuld have
> >>> fsync in the name.  Maybe something like pg_advise_abandon() or
> >>> pg_abandon_cache().  The current name is really wishful thinking:
> >>> you're hoping that it will make the kernel start the fsync, but it
> >>> might not.  I think pg_start_data_flush() is similarly optimistic.
> >> 
> >> What about: pg_fsync_prepare().
> > 
> > prepare_for_fsync()?
> 
> It still seems mis-descriptive to me.  Couldn't the same routine be
> used simply to abandon undirtied data that we no longer care about
> caching?
For now it could - but it very well might be converted to sync_file_range or 
similar, which would have different "sideeffects".

As the potential code duplication is rather small I would prefer to describe 
the prime effect not the sideeffects...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Andres Freund
On Tuesday 02 February 2010 18:36:12 Robert Haas wrote:
> On Fri, Jan 29, 2010 at 1:56 PM, Greg Stark  wrote:
> > On Tue, Jan 19, 2010 at 3:25 PM, Tom Lane  wrote:
> >> That function *seriously* needs documentation, in particular the fact
> >> that it's a no-op on machines without the right kernel call.  The name
> >> you've chosen is very bad for those semantics.  I'd pick something
> >> else myself.  Maybe "pg_start_data_flush" or something like that?
> > 
> > I would like to make one token argument in favour of the name I
> > picked. If it doesn't convince I'll change it since we can always
> > revisit the API down the road.
> > 
> > I envision having two function calls, pg_fsync_start() and
> > pg_fsync_finish(). The latter will wait until the data synced in the
> > first call is actually synced. The fall-back if there's no
> > implementation of this would be for fsync_start() to be a noop (or
> > something unreliable like posix_fadvise) and fsync_finish() to just be
> > a regular fsync.
> > 
> > I think we can accomplish this with sync_file_range() but I need to
> > read up on how it actually works a bit more. In this case it doesn't
> > make a difference since when we call fsync_finish() it's going to be
> > for the entire file and nothing else will have been writing to these
> > files. But for wal writing and checkpointing it might have very
> > different performance characteristics.
> > 
> > The big objection to this is that then we don't really have an api for
> > FADV_DONT_NEED which is more about cache policy than about syncing to
> > disk. So for example a sequential scan might want to indicate that it
> > isn't planning on reading the buffers it's churning through but
> > doesn't want to force them to be written sooner than otherwise and is
> > never going to call fsync_finish().
> 
> I took a look at this patch today and I agree with Tom that
> pg_fsync_start() is a very confusing name.  I don't know what the
> right name is, but this doesn't fsync so I don't think it shuld have
> fsync in the name.  Maybe something like pg_advise_abandon() or
> pg_abandon_cache().  The current name is really wishful thinking:
> you're hoping that it will make the kernel start the fsync, but it
> might not.  I think pg_start_data_flush() is similarly optimistic.
What about: pg_fsync_prepare(). That gives the reason why were doing that and 
doesnt promise that it is actually doing an fsync.
I dislike really having "cache" in the name, because the primary aim is not to 
discard the cache...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Benchmark shows very slow bulk delete

2010-01-27 Thread Andres Freund
On Wednesday 27 January 2010 15:49:06 Matthew Wakeling wrote:
> On Wed, 27 Jan 2010, Thom Brown wrote:
> > Had a quick look at a benchmark someone put together of MySQL vs
> > PostgreSQL, and while PostgreSQL is generally faster, I noticed the bulk
> > delete was very slow:
> > http://www.randombugs.com/linux/mysql-postgresql-benchmarks.html
> > 
> > Is this normal?
> 
> On the contrary, TRUNCATE TABLE is really rather fast.
> 
> Seriously, the Postgres developers, when designing the system, decided on
> a database layout that was optimised for the most common cases. Bulk
> deletion of data is not really that common an operation, unless you are
> deleting whole categories of data, where setting up partitioning and
> deleting whole partitions would be sensible.
> 
> Other complications are that the server has to maintain concurrent
> integrity - that is, another transaction must be able to see either none
> of the changes or all of them. As a consequence of this, Postgres needs to
> do a sequential scan through the table and mark the rows for deletion in
> the transaction, before flipping the transaction committed status and
> cleaning up afterwards.
> 
> I'd be interested in how mysql manages to delete a whole load of rows in
> 0.02 seconds. How many rows is that?
Afair mysql detects that case and converts it into some truncate equivalent.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] splitting data into multiple tables

2010-01-25 Thread Andres Freund
On Tuesday 26 January 2010 01:39:48 nair rajiv wrote:
> On Tue, Jan 26, 2010 at 1:01 AM, Craig James 
wrote:
>   I am working on a project that will take out structured content
> from wikipedia
> and put it in our database. Before putting the data into the database I
> wrote a script to
> find out the number of rows every table would be having after the data is
> in and I found
> there is a table which will approximately have 50,000,000 rows after data
> harvesting.
> Is it advisable to keep so much data in one table ?
Depends on your access patterns. I.e. how many rows are you accessing at the 
same time - do those have some common locality and such.


>   I have read about 'partitioning' a table. An other idea I have is
> to break the table into
> different tables after the no of rows  in a table has reached a certain
> limit say 10,00,000.
> For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b'
> each having 10,00,000 rows.
> I needed advice on whether I should go for partitioning or the approach I
> have thought of.
Your approach is pretty close to partitioning - except that partitioning makes 
that mostly invisible to the outside so it is imho preferrable.

>   We have a HP server with 32GB ram,16 processors. The storage has
> 24TB diskspace (1TB/HD).
> We have put them on RAID-5. It will be great if we could know the
> parameters that can be changed in the
> postgres configuration file so that the database makes maximum utilization
> of the server we have.
> For eg parameters that would increase the speed of inserts and selects.
Not using RAID-5 possibly would be a good start - many people (me included) 
experienced bad write performance on it. It depends a great deal on the 
controller/implementation though.
RAID-10 is normally to be considered more advantageous despite its lower 
usable space ratio.
Did you create one big RAID-5 out of all disks? Thats not a good idea, because 
its pretty likely that another disk fails while you restore a previously 
failed disk. Unfortunately in that configuration that means you have lost your 
complete data (in the most common implementations at least).

Andres

PS: Your lines are strangely wrapped...

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-19 Thread Andres Freund
On Tuesday 19 January 2010 15:57:14 Greg Stark wrote:
> On Tue, Jan 19, 2010 at 2:52 PM, Greg Stark  wrote:
> > Barring any objections shall I commit it like this?
> 
> Actually before we get there could someone who demonstrated the
> speedup verify that this patch still gets that same speedup?
At least on the three machines I tested last time the result is still in the 
same ballpark.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-19 Thread Andres Freund
Hi Greg,

On Tuesday 19 January 2010 15:52:25 Greg Stark wrote:
> On Mon, Jan 18, 2010 at 4:35 PM, Greg Stark  wrote:
> > Looking at this patch for the commitfest I have a few questions.
> 
> So I've touched this patch up a bit:
> 
> 1) moved the posix_fadvise call to a new fd.c function
> pg_fsync_start(fd,offset,nbytes) which initiates an fsync without
> waiting on it. Currently it's only implemented with
> posix_fadvise(DONT_NEED) but I want to look into using sync_file_range
> in the future -- it looks like this call might be good enough for our
> checkpoints.
Why exactly should that depend on fsync? Sure, thats where most of the pain 
comes from now but avoiding that cache poisoning wouldnt hurt otherwise as 
well.

I would rather have it called pg_flush_cache_range or such...

> 2) advised each 64k chunk as we write it which should avoid poisoning
> the cache if you do a large create database on an active system.
> 
> 3) added the promised but afaict missing fsync of the directory -- i
> think we should actually backpatch this.
I think as well. You need it during recursing as well though (where I had 
added it) and not only for the final directory.

> Barring any objections shall I commit it like this?
Other than the two things above it looks fine to me.

Thanks,

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-19 Thread Andres Freund
Hi Greg,

On Monday 18 January 2010 17:35:59 Greg Stark wrote: 
> 2) Why does the second pass to do the fsyncs read through fromdir to
> find all the filenames. I find that odd and counterintuitive. It would
> be much more natural to just loop through the files in the new
> directory. But I suppose it serves as an added paranoia check that the
> files are in fact still there and we're not fsyncing any files we
> didn't just copy. I think it should still work, we should have an
> exclusive lock on the template database so there really ought to be no
> differences between the directory trees.
If it weren't safe we would already have a big problem

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-19 Thread Andres Freund
On Tuesday 19 January 2010 15:52:25 Greg Stark wrote:
> On Mon, Jan 18, 2010 at 4:35 PM, Greg Stark  wrote:
> > Looking at this patch for the commitfest I have a few questions.
> 
> So I've touched this patch up a bit:
> 
> 1) moved the posix_fadvise call to a new fd.c function
> pg_fsync_start(fd,offset,nbytes) which initiates an fsync without
> waiting on it. Currently it's only implemented with
> posix_fadvise(DONT_NEED) but I want to look into using sync_file_range
> in the future -- it looks like this call might be good enough for our
> checkpoints.
> 
> 2) advised each 64k chunk as we write it which should avoid poisoning
> the cache if you do a large create database on an active system.
> 
> 3) added the promised but afaict missing fsync of the directory -- i
> think we should actually backpatch this.
Yes, that was a bit stupid from me - I added the fsync for directories which 
get recursed into (by not checking if its a file) but not for the uppermost 
level.
So all directories should get fsynced right now but the topmost one.

I will review the patch later when I finally will have some time off again... 
~4h.

Thanks!

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-29 Thread Andres Freund
On Monday 28 December 2009 23:59:43 Andres Freund wrote:
> On Monday 28 December 2009 23:54:51 Andres Freund wrote:
> > On Saturday 12 December 2009 21:38:41 Andres Freund wrote:
> > > On Saturday 12 December 2009 21:36:27 Michael Clemmons wrote:
> > > > If ppl think its worth it I'll create a ticket
> > >
> > > Thanks, no need. I will post a patch tomorrow or so.
> >
> > Well. It was a long day...
> >
> > Anyway.
> > In this patch I delay the fsync done in copy_file and simply do a second
> >  pass over the directory in copy_dir and fsync everything in that pass.
> > Including the directory - which was not done before and actually might be
> > necessary in some cases.
> > I added a posix_fadvise(..., FADV_DONTNEED) to make it more likely that
> > the copied file reaches storage before the fsync. Without the speed
> > benefits were quite a bit smaller and essentially random (which seems
> > sensible).
> >
> > This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s
> > on my laptop.  Still slower than with fsync off (~0.25) but quite a
> > worthy improvement.
> >
> > The benefits are obviously bigger if the template database includes
> >  anything added.
> 
> Obviously the patch would be helpfull.
And it should also be helpfull not to have annoying oversights in there. A  
FreeDir(xldir); is missing at the end of copydir().

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-29 Thread Andres Freund
On Tuesday 29 December 2009 11:48:10 Greg Stark wrote:
> On Tue, Dec 29, 2009 at 2:05 AM, Andres Freund  wrote:
> >  Reads Completed:2,8KiB  Writes Completed: 2362,  
> >  29672KiB New:
> >  Reads Completed:0,0KiB  Writes Completed:  550,
> > 5960KiB
> 
> It looks like the new method is only doing 1/6th as much i/o. Do you
> know what's going on there?
While I was surprised by the amount of difference I am not surprised at all 
that there is a significant one - currently the fsync will write out a whole 
bunch of useless stuff every time its called (all metadata, directory structure 
and so on)

This is reproducible...

6MB sounds sensible for the operation btw - the template database is around 
5MB.


Will try to analyze later what exactly causes the additional io.


Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-28 Thread Andres Freund
On Tuesday 29 December 2009 04:04:06 Michael Clemmons wrote:
> Maybe not crash out but in this situation.
> N=0
> while(N>=0):
> CREATE DATABASE new_db_N;
> Since the fsync is the part which takes the memory and time but is
>  happening in the background want the fsyncs pile up in the background
>  faster than can be run filling up the memory and stack.
> This is very likely a mistake on my part about how postgres/processes
The difference should not be visible outside the "CREATE DATABASE ..." at all.
Currently the process simplifiedly works like:


for file in source directory:
copy_file(source/file, target/file);
fsync(target/file);


I changed it to:

-
for file in source directory:
copy_file(source/file, target/file);

/*please dear kernel, write this out, but dont block*/
posix_fadvise(target/file, FADV_DONTNEED); 

for file in source directory:
fsync(target/file);
-

If at any point in time there is not enough cache available to cache anything 
copy_file() will just have to wait for the kernel to write out the data.
fsync() does not use memory itself.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-28 Thread Andres Freund
On Tuesday 29 December 2009 03:53:12 Michael Clemmons wrote:
> Andres,
> Great job.  Looking through the emails and thinking about why this works I
> think this patch should significantly speedup 8.4 on most any file
> system(obviously some more than others) unless the system has significantly
> reduced memory or a slow single core. On a Celeron with 256 memory I
>  suspect it'll crash out or just hit the swap  and be a worse bottleneck. 
>  Anyone have something like this to test on?
Why should it crash? The kernel should just block on writing and write out the 
dirty memory before continuing?
Pg is not caching anything here...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-28 Thread Andres Freund
On Tuesday 29 December 2009 01:46:21 Greg Smith wrote:
> Andres Freund wrote:
> > As I said the real benefit only occurred after adding posix_fadvise(..,
> > FADV_DONTNEED) which is somewhat plausible, because i.e. the directory
> > entries don't need to get scheduled for every file and because the kernel
> > can reorder a whole directory nearly sequentially. Without the advice it
> > the kernel doesn't know in time that it should write that data back and
> > it wont do it for 5 seconds by default on linux or such...
> It would be interesting to graph the "Dirty" and "Writeback" figures in
> /proc/meminfo over time with and without this patch in place.  That
> should make it obvious what the kernel is doing differently in the two
> cases.
I did some analysis using blktrace (usefull tool btw) and the results show that
the io pattern is *significantly* different.

For one with the direct fsyncing nearly no hardware queuing is used and for
another nearly no requests are merged on software side.

Short stats:

OLD:

Total (8,0):
 Reads Queued:   2,8KiB  Writes Queued:7854,29672KiB
 Read Dispatches:2,8KiB  Write Dispatches: 1926,29672KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:2,8KiB  Writes Completed: 2362,29672KiB
 Read Merges:0,0KiB  Write Merges: 5492,21968KiB
 PC Reads Queued:0,0KiB  PC Writes Queued:0,0KiB
 PC Read Disp.:436,0KiB  PC Write Disp.:  0,0KiB
 PC Reads Req.:  0   PC Writes Req.:  0
 PC Reads Compl.:0   PC Writes Compl.: 2362
 IO unplugs:  2395   Timer unplugs: 557


New:

Total (8,0):
 Reads Queued:   0,0KiB  Writes Queued:1716, 5960KiB
 Read Dispatches:0,0KiB  Write Dispatches:  324, 5960KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:0,0KiB  Writes Completed:  550, 5960KiB
 Read Merges:0,0KiB  Write Merges: 1166, 4664KiB
 PC Reads Queued:0,0KiB  PC Writes Queued:0,0KiB
 PC Read Disp.:226,0KiB  PC Write Disp.:  0,0KiB
 PC Reads Req.:  0   PC Writes Req.:  0
 PC Reads Compl.:0   PC Writes Compl.:  550
 IO unplugs:   503   Timer unplugs:  30


Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-28 Thread Andres Freund
On Tuesday 29 December 2009 01:30:17 da...@lang.hm wrote:
> On Tue, 29 Dec 2009, Greg Stark wrote:
> > On Mon, Dec 28, 2009 at 10:54 PM, Andres Freund  
wrote:
> >> fsync everything in that pass.
> >> Including the directory - which was not done before and actually might
> >> be necessary in some cases.
> >
> > Er. Yes. At least on ext4 this is pretty important. I wish it weren't,
> > but it doesn't look like we're going to convince the ext4 developers
> > they're crazy any day soon and it would really suck for a database
> > created from a template to have files in it go missin.
> 
> actually, as I understand it you need to do this on all filesystems except
> ext3, and on ext3 fsync is horribly slow because it writes out
> _everything_ that's pending, not just stuff related to the file you do the
> fsync on.
I dont think its all filesystems (ext2 should not be affected...), but 
generally 
youre right. At least jfs, xfs are affected as well.

Its btw not necessarily nearly-safe and slow on ext3 as well (data=writeback).

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-28 Thread Andres Freund
On Tuesday 29 December 2009 01:27:29 Greg Stark wrote:
> On Mon, Dec 28, 2009 at 10:54 PM, Andres Freund  wrote:
> > fsync everything in that pass.
> > Including the directory - which was not done before and actually might be
> > necessary in some cases.
> 
> Er. Yes. At least on ext4 this is pretty important. I wish it weren't,
> but it doesn't look like we're going to convince the ext4 developers
> they're crazy any day soon and it would really suck for a database
> created from a template to have files in it go missin.
Actually it was necessary on ext3 as well - the window to hit the problem just 
was much smaller, wasnt it?

Actually that part should possibly get backported.


Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-28 Thread Andres Freund
On Tuesday 29 December 2009 00:06:28 Tom Lane wrote:
> Andres Freund  writes:
> > This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s
> > on my laptop.  Still slower than with fsync off (~0.25) but quite a
> > worthy improvement.
> 
> I can't help wondering whether that's real or some kind of
> platform-specific artifact.  I get numbers more like 3.5s (fsync off)
> vs 4.5s (fsync on) on a machine where I believe the disks aren't lying
> about write-complete.  It makes sense that an fsync at the end would be
> a little bit faster, because it would give the kernel some additional
> freedom in scheduling the required I/O, but it isn't cutting the total
> I/O required at all.  So I find it really hard to believe a 10x speedup.
I only comfortably have access to two smaller machines without BBU from here 
(being in the Hacker Jeopardy at the ccc congress ;-)) and both show this 
behaviour. I guess its somewhat filesystem dependent. 

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-28 Thread Andres Freund
On Tuesday 29 December 2009 00:06:28 Tom Lane wrote:
> Andres Freund  writes:
> > This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s
> > on my laptop.  Still slower than with fsync off (~0.25) but quite a
> > worthy improvement.
> I can't help wondering whether that's real or some kind of
> platform-specific artifact.  I get numbers more like 3.5s (fsync off)
> vs 4.5s (fsync on) on a machine where I believe the disks aren't lying
> about write-complete.  It makes sense that an fsync at the end would be
> a little bit faster, because it would give the kernel some additional
> freedom in scheduling the required I/O, but it isn't cutting the total
> I/O required at all.  So I find it really hard to believe a 10x speedup.
Well, a template database is about 5.5MB big here - that shouldnt take too 
long when written near-sequentially?
As I said the real benefit only occurred after adding posix_fadvise(.., 
FADV_DONTNEED) which is somewhat plausible, because i.e. the directory entries 
don't need to get scheduled for every file and because the kernel can reorder a 
whole directory nearly sequentially. Without the advice it the kernel doesn't 
know in time that it should write that data back and it wont do it for 5 
seconds by default on linux or such...

I looked at the strace output - it looks sensible timewise to me. If youre 
interested I can give you output of that.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-28 Thread Andres Freund
On Monday 28 December 2009 23:54:51 Andres Freund wrote:
> On Saturday 12 December 2009 21:38:41 Andres Freund wrote:
> > On Saturday 12 December 2009 21:36:27 Michael Clemmons wrote:
> > > If ppl think its worth it I'll create a ticket
> >
> > Thanks, no need. I will post a patch tomorrow or so.
> 
> Well. It was a long day...
> 
> Anyway.
> In this patch I delay the fsync done in copy_file and simply do a second
>  pass over the directory in copy_dir and fsync everything in that pass.
> Including the directory - which was not done before and actually might be
> necessary in some cases.
> I added a posix_fadvise(..., FADV_DONTNEED) to make it more likely that the
> copied file reaches storage before the fsync. Without the speed benefits
>  were quite a bit smaller and essentially random (which seems sensible).
> 
> This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s on
>  my laptop.  Still slower than with fsync off (~0.25) but quite a worthy
>  improvement.
> 
> The benefits are obviously bigger if the template database includes
>  anything added.
Obviously the patch would be helpfull.

Andres
From bd80748883d1328a71607a447677b0bfb1f54ab0 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 28 Dec 2009 23:43:57 +0100
Subject: [PATCH] Delay fsyncing files during copying in CREATE DATABASE - this
 dramatically speeds up CREATE DATABASE on non battery backed
 rotational storage.
 Additionally fsync() the directory to ensure all metadata reaches
 storage.

---
 src/port/copydir.c |   58 +--
 1 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/src/port/copydir.c b/src/port/copydir.c
index a70477e..cde3dc7 100644
*** a/src/port/copydir.c
--- b/src/port/copydir.c
***
*** 37,42 
--- 37,43 
  
  
  static void copy_file(char *fromfile, char *tofile);
+ static void fsync_fname(char *fname);
  
  
  /*
*** copydir(char *fromdir, char *todir, bool
*** 64,69 
--- 65,73 
  (errcode_for_file_access(),
   errmsg("could not open directory \"%s\": %m", fromdir)));
  
+ 	/*
+ 	 * Copy all the files
+ 	 */
  	while ((xlde = ReadDir(xldir, fromdir)) != NULL)
  	{
  		struct stat fst;
*** copydir(char *fromdir, char *todir, bool
*** 89,96 
  		else if (S_ISREG(fst.st_mode))
  			copy_file(fromfile, tofile);
  	}
- 
  	FreeDir(xldir);
  }
  
  /*
--- 93,120 
  		else if (S_ISREG(fst.st_mode))
  			copy_file(fromfile, tofile);
  	}
  	FreeDir(xldir);
+ 
+ 	/*
+ 	 * Be paranoid here and fsync all files to ensure we catch problems.
+ 	 */
+ 	xldir = AllocateDir(fromdir);
+ 	if (xldir == NULL)
+ 		ereport(ERROR,
+ (errcode_for_file_access(),
+  errmsg("could not open directory \"%s\": %m", fromdir)));
+ 
+ 	while ((xlde = ReadDir(xldir, fromdir)) != NULL)
+ 	{
+ 		struct stat fst;
+ 
+ 		if (strcmp(xlde->d_name, ".") == 0 ||
+ 			strcmp(xlde->d_name, "..") == 0)
+ 			continue;
+ 
+ 		snprintf(tofile, MAXPGPATH, "%s/%s", todir, xlde->d_name);
+ 		fsync_fname(tofile);
+ 	}
  }
  
  /*
*** copy_file(char *fromfile, char *tofile)
*** 150,162 
  	}
  
  	/*
! 	 * Be paranoid here to ensure we catch problems.
  	 */
! 	if (pg_fsync(dstfd) != 0)
! 		ereport(ERROR,
! (errcode_for_file_access(),
!  errmsg("could not fsync file \"%s\": %m", tofile)));
! 
  	if (close(dstfd))
  		ereport(ERROR,
  (errcode_for_file_access(),
--- 174,185 
  	}
  
  	/*
! 	 * We tell the kernel here to write the data back in order to make
! 	 * the later fsync cheaper.
  	 */
! #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
! 	posix_fadvise(dstfd, 0, 0, POSIX_FADV_DONTNEED);
! #endif
  	if (close(dstfd))
  		ereport(ERROR,
  (errcode_for_file_access(),
*** copy_file(char *fromfile, char *tofile)
*** 166,168 
--- 189,212 
  
  	pfree(buffer);
  }
+ 
+ /*
+  * fsync a file
+  */
+ static void
+ fsync_fname(char *fname)
+ {
+ 	int	fd = BasicOpenFile(fname, O_RDWR| PG_BINARY,
+ 		  S_IRUSR | S_IWUSR);
+ 
+ 	if (fd < 0)
+ 		ereport(ERROR,
+ (errcode_for_file_access(),
+  errmsg("could not create file \"%s\": %m", fname)));
+ 
+ 	if (pg_fsync(fd) != 0)
+ 		ereport(ERROR,
+ (errcode_for_file_access(),
+  errmsg("could not fsync file \"%s\": %m", fname)));
+ 	close(fd);
+ }
-- 
1.6.5.12.gd65df24


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-28 Thread Andres Freund
On Saturday 12 December 2009 21:38:41 Andres Freund wrote:
> On Saturday 12 December 2009 21:36:27 Michael Clemmons wrote:
> > If ppl think its worth it I'll create a ticket
> Thanks, no need. I will post a patch tomorrow or so.
Well. It was a long day...

Anyway.
In this patch I delay the fsync done in copy_file and simply do a second pass 
over the directory in copy_dir and fsync everything in that pass.
Including the directory - which was not done before and actually might be 
necessary in some cases.
I added a posix_fadvise(..., FADV_DONTNEED) to make it more likely that the 
copied file reaches storage before the fsync. Without the speed benefits were 
quite a bit smaller and essentially random (which seems sensible).

This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s on my 
laptop.  Still slower than with fsync off (~0.25) but quite a worthy 
improvement.

The benefits are obviously bigger if the template database includes anything 
added.


Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 8.4.1 ubuntu karmic slow createdb

2009-12-12 Thread Andres Freund
On Saturday 12 December 2009 21:36:27 Michael Clemmons wrote:
> If ppl think its worth it I'll create a ticket
Thanks, no need. I will post a patch tomorrow or so.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 8.4.1 ubuntu karmic slow createdb

2009-12-11 Thread Andres Freund
Hi,

On Saturday 12 December 2009 00:59:13 Scott Marlowe wrote:
> On Fri, Dec 11, 2009 at 3:52 PM, Michael Clemmons
> >  Createdb takes
> > 12secs on my system(9.10 pg8.4 and ext4)  which is impossibly slow for
> > running 200unittests.
> >  Fsync got it to .2secs or so which is blazing but
> > also the speed I expected being used to 8.3 and xfs.  This dev box is my
> > laptop and the data is litterally unimportant and doesn't exist longer
> > than 20sec but Im all about good practices.  Will definately try
> > synchronous commit tonight once Im done working for the day.  I've got
> > some massive copying todo later though so this will probably help in the
> > future as well.
> Yeah, I'd probably resort to fsync off in that circumstance too
> especially if syn commit off didn't help that much.
How should syn commit help with creating databases?

The problem with 8.4 and creating databases is that the number of files 
increased hugely because of the introduction of relation forks.
It probably wouldnt be that hard to copy all files first, then reopen and fsync 
them. Actually that should be a patch doable in an hour or two.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 8.4.1 ubuntu karmic slow createdb

2009-12-10 Thread Andres Freund
Hi,

On Thursday 10 December 2009 23:01:08 Michael Clemmons wrote:
> Im not sure what that means ppl in my office with slower hd speeds using
>  8.4 can create a db in 2s vs my 8-12s.
- Possibly their config is different - they could have disabled the "fsync" 
parameter which turns the database to be not crashsafe anymore but much faster 
in some circumstances.

- Possibly you have much data in your template1 database?
You could check whether

CREATE DATABASE speedtest TEMPLATE template1; takes more time than
CREATE DATABASE speedtest TEMPLATE template0;.

You should issue both multiple times to ensure caching on the template 
database doesnt play a role.

>  Could using md5 instead of ident do it?
Seems unlikely.
Is starting psql near-instantaneus? Are you using "createdb" or are you 
issuing "CREATE DATABASE ..."?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 8.4.1 ubuntu karmic slow createdb

2009-12-10 Thread Andres Freund
On Thursday 10 December 2009 21:41:08 Michael Clemmons wrote:
> Hey,
> I've got a computer which runs but 8.3 and 8.4.  To create a db it takes 4s
> for 8.3 and 9s for 8.4.  I have many unit tests which create databases all
> of the time and now run much slower than 8.3 but it seems to be much longer
> as I remember at one point creating databases I considered an instantaneous
> thing.  Does any on the list know why this is true and if I can get it back
> to normal.
Possibly you had fsync=off at the time?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Checkpoint spikes

2009-12-08 Thread Andres Freund
On Wednesday 09 December 2009 03:05:40 Greg Smith wrote:
> On Linux having the WAL on a separate disk can improve things much more
> than you might expect, simply because of how brain-dead the filesystem
> fsync implementation is.  Reducing the seeks for WAL traffic can help a
> lot too.
Not using ext3's data=ordered helps massively already. 

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance penalty between Postgresql 8.3.8 and 8.4.1

2009-12-08 Thread Andres Freund
Hi David,

On Tuesday 08 December 2009 10:59:51 Schmitz, David wrote:
> >> With our data it is a performance difference from 1h16min
> >> (8.3.8) to 2h43min (8.4.1)
> On Postgresql 8.4.1
> Total runtime: 101.446 ms
> and on Postgresql  8.3.8:
> Total runtime: 29.366 ms
Hm. There obviously is more going on than these queries?

> Hash Join  (cost=2603.57..124518.03 rows=60772 width=61) (actual 
time=62.359..97.268 rows=1444 loops=1)
> Nested Loop  (cost=21.00..3790.18 rows=81 width=61) (actual 
time=0.210..26.098 rows=1444 loops=1)
Both misestimate the resultset quite a bit. It looks like happenstance that 
the one on 8.3 turns out to be better...

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance penalty between Postgresql 8.3.8 and 8.4.1

2009-12-07 Thread Andres Freund
Hi David,

On Monday 07 December 2009 23:05:14 Schmitz, David wrote:
> With our data it is a performance difference from 1h16min (8.3.8) to
>  2h43min (8.4.1)
Can you afford a explain analyze run overnight or so for both?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Large DB, several tuning questions: Index sizes, VACUUM, REINDEX, Autovacuum

2009-12-06 Thread Andres Freund
On Sunday 06 December 2009 19:20:17 Andreas Thiel wrote:
> Hi Andres,
> 
> Thanks a lot for your answers. As bottom line I think the answer is I
> have to rethink my DB structure.
Can't answer that one without knowing much more ;)

> > Could you please properly quote the email? The way you did it is quite
> > unreadable because you always have to guess who wrote what.
> I try to, is it now getting better? My apologies, still trying to adopt
> to using Office 07:-)
Better, yes.


> Well, I know the data types of my columns sum up to 32 bytes right now
> (was about 100 before). As I only see a reduction of relpages/reltuples
> by 30% not by a factor 3, I assume that the row overhead kicks in. The
> data definition of the new table looks like this:
> bigint REFERENCES test_orig(test_id) ON DELETE CASCADE
> bigint REFERENCES part_orig(prt_id) ON DELETE CASCADE
> smallint
> bit(16)
> real
> text (usually empty in most rows)
> smallint
> I did calculate 32 Bytes per row (if text is empty), but actually
> relpages/reltuples is about ~63 bytes. This would result in a per row
> overhead of 31 bytes. Would it change anything if I remove the 2 FOREIGN
> KEY constraints?
If you remove those columns entirely, sure. If you remove only the constraint, 
no.

The row overhead in 8.3/8.4 is 28bytes afaik. You miss two points in your 
calculation - one is alignment (i.e. a integer will only start at a 4byte 
boundary) and the other is that for text you need to store the length of the 
column as well.

> > Its pg_prepared_xacts (note the s), sorry my mind played me.
> Nothing inside this table as well. (I did also - while trying to improve
> postgresql.conf a few days ago - restart the server a couple of times, I
> think that would have removed any hanging transactions or prepares,
> shouldn't it?)
No, prepared transactions do not get removed by restarting. But thats fine 
then.

> > > > > max_fsm_relations = 4194304 # min 100, ~70 bytes
fsm_relations is the max number of relations you want to store in the fsm - 
currently that means you could have 4 mio tables+indexes.

> No, but it seems at least VACUUM is now running fine and no longer
> complaining about too small number for max_fsm_pages. Do you think if I
> reduce those two numbers, I'll have a better chance to run VACUUM FULL?
> Currently max_fsm_pages is slightly larger than relpages of my largest
> table. I read somewhere, max_fsm_pages should be about 1/2 of the total
> number of relpages in a DB, maybe another way to say it should be larger
> than the largest table...
The largest table does not really have any special influence on the fsm, so I 
wouldnt count that rule as very good.
Its not that easy to calculate the size of the fsm correctly - thats why its 
gone in 8.4...

I know of several instances running with a larger fsm_pages - you could try to 
reduce the fsm_relations setting - I dont know if there are problems lurking 
with such a oversized value.

I actually doubt that thats related to the oom youre seeing though - whats 
your "maintenance_work_mem" setting and whats your 
/proc/sys/vm/overcommit_ratio and how much swap do you have?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Large DB, several tuning questions: Index sizes, VACUUM, REINDEX, Autovacuum

2009-12-06 Thread Andres Freund
Hi Andreas,

Could you please properly quote the email? The way you did it is quite 
unreadable because you always have to guess who wrote what.

On Sunday 06 December 2009 17:06:39 Andreas Thiel wrote:
> > I'm going to work on the table size of the largest table (result_orig)
> > itself by eliminating columns, stuffing n Booleans into bit(n)'s,
> > replacing double precision by reals, etc.. By this I should be able to
> > reduce the storage per row to ~1/3 of the bytes currently used.
> That sounds rather ambitous - did you factor in the per row overhead?
> I did now create the new table, I have now 63 instead of 94 bytes/row on
> average. So yes you're right I'm about to hit the bottom of the per row
> overhead.
How did you calculate that? Did you factor in the alignment requirements? The 
ddl would be helpfull...

> Btw, have you possibly left over some old prepared transactions or an
> idle in
> transaction connection? Both can lead to sever bloat.
> For the former you can check the system table pg_prepared_xact for the
> latter
> pg_stat_activity.
> Seems no the case, pg_prepared_xact doesn't even exist.
Its pg_prepared_xacts (note the s), sorry my mind played me.

> Where would I find that postmaster output? In syslog? There's nothing
> visible...
Depends on your setup. I have not the slightest clue about centos. If 
necessary start postmaster directly.

> > max_fsm_relations = 4194304 # min 100, ~70 bytes each
Have you corrected that value?


Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Large DB, several tuning questions: Index sizes, VACUUM, REINDEX, Autovacuum

2009-12-05 Thread Andres Freund
Hi,

On Saturday 05 December 2009 00:03:12 Andreas Thiel wrote:
> I'm running PostgreSQL 8.3.6 on a 32-Bit Centos 4 machine (which I
> probably should update to 64 Bit soon)
How much memory?


> I'm going to work on the table size of the largest table (result_orig)
> itself by eliminating columns, stuffing n Booleans into bit(n)'s,
> replacing double precision by reals, etc.. By this I should be able to
> reduce the storage per row to ~1/3 of the bytes currently used.
That sounds rather ambitous - did you factor in the per row overhead?

> I have the same information stored in an Oracle 10g DB which consumes
> only 70G data and 2G for indexes. The schema may be better optimized,
> but for sure there is a table with 4 billion rows inside as well. So
> it's about 10x smaller in disk space than PgSQL. I wonder why.
Thats hard to say without seeing the table definition for both. Could you post 
it?

2GB for indexes sounds rather small - those are btrees?

It might also be interesting to look into the freespacemap to see how much 
empty space there is - there is a contrib module pg_freespacemap for that.

You can also check how much dead tuples a 'ANALYZE VERBOSE tablename' sees.

> Is such disk usage for indexes expected? What can I do to optimize? I
> could not run yet a VACUUM on result_orig, as I hit into max_fsm_pages
> limit (still trying to adjust that one). I tried REINDEX, it didn't
> change anything.
So its quite possible that your relations are heavily bloated - altough if you 
reindex that shouldnt matter that much.

Btw, have you possibly left over some old prepared transactions or an idle in 
transaction connection? Both can lead to sever bloat.
For the former you can check the system table pg_prepared_xact for the latter 
pg_stat_activity.

> ### My Issue No. 2: relpages and VACUUM
> I have another table "test" which is - as starting point - created by
> INSERTs and then UPDATE'd. It has the same columns and roughly the same
> number of rows as table test_orig,  but consumes 160 times the number of
> pages. I tried VACUUM on this table but it did not change anything on
> its relpages count. Maybe this is just because VACUUM without FULL does
> not re-claim disk space, i.e. relpages stays as it is? I did observe
> that after VACUUM, a REINDEX on this table did considerably shrink down
> the size of its indexes (test_test_id, test_lt_id).
A normal VACUUM does not move tuples around - it only marks space as free so 
it can later be filled. 

(If the free space is trailing it tries to free it if there are no locks 
preventing it).

> ### My Issue No 3: VACCUM FULL out of memory
> I tried to do a VACCUM FULL on the two tables (test, result_orig)
> mentioned above. In both cases it fails with a very low number on out of
> memory like this:
> 
> ERROR:  out of memory
> DETAIL:  Failed on request of size 224.
Well, thats the number of memory its trying to allocate, not the amount it has 
allocated. Normally the postmaster should output some sort of memory map when 
that happens. Did you get anything like that?

> I use these kernel settings:
> kernel.shmmni = 4096
> kernel.shmall = 2097152
> kernel.shmmax = 2147483648
> vm.overcommit_memory = 2

> max_stack_depth = 8MB   # min 100kB
That sounds a bit too high if you count in that libc and consorts may use some 
stack space as well - although that should be unrelated to the current issue.

> max_fsm_pages = 7000# min max_fsm_relations*16, 6
> bytes each
As a very rough guide you can start with the sum of relpages in pg_class for 
that one.

> max_fsm_relations = 4194304 # min 100, ~70 bytes each
That seems kinda high. Do you have multiple millions of relations? It might be 
related to the oom situation during vacuum full, although it seems rather 
unlikely.

> ###My Issue No. 4: Autovacuum
> I have the feeling that Autovacuum is not really running, else why are
> tables and indexes growing that much, especially "test" table?
You should see notes about autovacuum in the locks. With an  
autovacuum_vacuum_scale_factor of  0.2 you need 
0.002 times the size of a table in changed tuples before autovacuum starts. 
For a billion thats quite a bit. I found that this setting often is too high.

> How would I check it is running correctly? I don't see any error
> messages in syslog from autovacuum.
You should see messages about it starting in the syslog.


Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Analyse without locking?

2009-11-26 Thread Andres Freund
On Thursday 26 November 2009 17:20:35 Richard Neill wrote:
> Dear All,
> 
> I'm wondering whether Vacuum/analyse (notably by the autovaccuum daemon)
> is responsible for some deadlocks/dropouts I'm seeing.
> 
> One particular table gets hit about 5 times a second (for single row
> updates and inserts) + associated index changes. This is a very light
> load for the hardware; we have 7 CPU cores idling, and very little disk
> activity. The query normally runs in about 20 ms.
> 
> However, the query must always respond within 200ms, or userspace gets
> nasty errors.  [we're routing books on a sorter machine, and the book
> misses its exit opportunity]. Although this is a low load, it's a bit
> like a heartbeat.
> 
> The question is, could the autovacuum daemon (running either in vacuum
> or in analyse mode) be taking out locks on this table that sometimes
> cause the query response time to go way up (exceeding 10 seconds)?
> 
> I think I've set up autovacuum to do "little and often", using
>autovacuum_vacuum_cost_delay = 20ms
>autovacuum_vacuum_cost_limit = 20
> but I'm not sure this is doing exactly what I think it is. In
> particular, the system-wide I/O (and CPU) limit of autovacuum is
> negligible, but it's possible that queries may be waiting on locks.
> 
> In particular, I want to make sure that the autovacuum daemon never
> holds any lock for more than about 50ms at a time. (or will release it
> immediately if something else wants it)
> 
> Or am I barking up the wrong tree entirely?
I would suggest enabling log_log_wait and setting deadlock_timeout to a low 
value - should give you more information.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Repeated Query is much slower in PostgreSQL8.2.4 than DB2 9.1

2009-07-16 Thread Andres Freund
On Thursday 16 July 2009 11:46:18 ning wrote:
> I'm sorry, they are in milliseconds, not seconds.
> The time used is quite same to the result of "explain analyze select
> " I pasted above,
> which was " Total runtime: 0.479 ms".
Yea. Unfortunately that time does not including planning time. If you work 
locally on the server using psql you can use '\timing' to make psql output 
timing information.

If I interpret those findings correcty the execution is approx. as fast as DB2, 
only DB2 is doing automated plan caching while pg is not.

If it _really_ is necessary that this is that fast, you can prepare the query 
like I showed.

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Repeated Query is much slower in PostgreSQL8.2.4 than DB2 9.1

2009-07-16 Thread Andres Freund
On Thursday 16 July 2009 11:30:00 ning wrote:
> Hi Andres,
>
> By executing
> #explain analyze execute test_query;
>
> the first execution cost 0.389 seconds
> the second cost 0.285 seconds
Seconds or milliseconds?

If seconds that would be by far slower than the plain SELECT, right?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Repeated Query is much slower in PostgreSQL8.2.4 than DB2 9.1

2009-07-16 Thread Andres Freund
On Thursday 16 July 2009 03:11:29 ning wrote:
> Hi Andres,
>
> The log for the test you suggested is as follows in PostgreSQL8.2.4,
> but I cannot find a clue to prove or prove not PostgreSQL is doing
> plan caching.
Well. How long is the PREPARE and the EXECUTEs taking?


Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Repeated Query is much slower in PostgreSQL8.2.4 than DB2 9.1

2009-07-15 Thread Andres Freund
On Wednesday 15 July 2009 10:27:50 Craig Ringer wrote:
> On Wed, 2009-07-15 at 12:10 +0900, ning wrote:
> > First execution: PostgreSQL 0.006277 seconds / DB2 0.009028 seconds
> > Second execution: PostgreSQL 0.005932 seconds / DB2 0.000332 seconds
>
> Actually, on second thoughts that looks a lot like DB2 is caching the
> query results and is just returning the cached results when you repeat
> the query.
Are you sure getting the query *result* is causing the delay? If my faint 
memory serves right DB2 does plan caching - PG does not.
To test this theory you could prepare it and execute it twice.

Prepare it:
PREPARE test_query AS SELECT void,nameId,tag FROM (SELECT void,nameId,tag, 
FROM Attr
WHERE attributeof IN (SELECT oid_ FROM ItemView WHERE
ItemView.ItemId=?)) x RIGHT OUTER JOIN (SELECT oid_ FROM ItemView
WHERE ItemView.ItemId=? and ItemView.assignedTo_=?) y ON attributeof =
oid_ FOR READ ONLY;


Execute it:
EXECUTE test_query;
EXECUTE test_query;

Greetings, 

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very big insert/join performance problem (bacula)

2009-07-13 Thread Andres Freund
Hi Marc,

I don't have really extensive comments, but I found two small things...

On Monday 13 July 2009 15:40:18 Marc Cousin wrote:
> I'm trying to solve big performance issues with PostgreSQL + bacula while
> inserting very big sets of records.
>
> I'm sorry, this email will be a long one, as I've already spent quite a lot
> of time on the issue, I don't want to waste your time speculating on things
> I may already have done, and the problem is (or seems to me) a bit complex.
> The other problem is that I don't have the explain plans to provide with
> the email right now. I'll try to use this as a way to push 8.4 in this
> setup, to dump all these plans with autoexplain (queries are on temporary
> tables, so a bit tricky to get).
>
> Let me first explain or remind how this works. Bacula is a backup solution
> and is trying to insert its metadatas at the end of backups (file name,
> directory name, size, etc ...)
> For what we are interested in, there are 3 tables :
> - file
> - filename
> - path
>
> file is the one containing most records. It's the real metadata. filename
> and path just contain an id and the real file or directory name (to save
> some space with redundant names).
>
> Before explaining the issue, just some information about sizing here :
>
> file is 1.1 billion records for 280GB (with indexes).
>
>Column   |  Type   |   Modifiers
> +-+
>--- fileid | bigint  | not null default
> nextval('file_fileid_seq'::regclass) fileindex  | integer | not null
> default 0
>  jobid  | integer | not null
>  pathid | integer | not null
>  filenameid | integer | not null
>  markid | integer | not null default 0
>  lstat  | text| not null
>  md5| text| not null
> Indexes:
> "file_pkey" UNIQUE, btree (fileid)
> "file_fp_idx" btree (filenameid, pathid)
> "file_jpfid_idx" btree (jobid, pathid, filenameid)
>
>
> path is 17 million for 6 GB
>
>  Column |  Type   |   Modifiers
> +-+---
>  pathid | integer | not null default nextval('path_pathid_seq'::regclass)
>  path   | text| not null
> Indexes:
> "path_pkey" PRIMARY KEY, btree (pathid)
> "path_name_idx" UNIQUE, btree (path)
>
> filename is 80 million for 13GB
>
>Column   |  Type   |   Modifiers
> +-+
>--- filenameid | integer | not null default
> nextval('filename_filenameid_seq'::regclass)
>  name   | text| not null
> Indexes:
> "filename_pkey" PRIMARY KEY, btree (filenameid)
> "filename_name_idx" UNIQUE, btree (name)
>
>
> There are several queries for each job despooling :
>
> First we fill a temp table with the raw data (filename, pathname,
> metadata), using COPY (no problem here)
>
> Then we insert missing filenames in file, and missing pathnames in path,
> with this query (and the same for file) :
>
> INSERT INTO Path (Path)
>   SELECT a.Path FROM (SELECT DISTINCT Path FROM batch) AS a
>  WHERE NOT EXISTS (SELECT Path FROM Path WHERE Path = a.Path)
>
> These do nested loops and work very well (after a sort on batch to get rid
> from duplicates). They work reasonably fast (what one would expect when
> looping on millions of records... they do their job in a few minutes).
While this is not your questions, I still noticed you seem to be on 8.3 - it 
might be a bit faster to use GROUP BY instead of DISTINCT.

> The problem occurs with the final query, which inserts data in file,
> joining the temp table to both file and filename
>
> INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5)
>   SELECT batch.FileIndex,
>   batch.JobId,
>   Path.PathId,
>   Filename.FilenameId,
>   batch.LStat,
>   batch.MD5
>   FROM batch
>   JOIN Path ON (batch.Path = Path.Path)
>   JOIN Filename ON (batch.Name = Filename.Name)
>
> This one has two split personnalities, depending on how many records are in
> batch.
> For small batch tables, it does nested loops.
> For big batch tables (more than around one million initially) it decides to
> hash join path (still ok, it's reasonably small) and then filename to batch
> before starting. And that's when the problems begin The behaviour seems
> logicial to me, it should go to hash join when batch gets bigger, but it
> seems to be much too early here, considering the size of filename.
>
> First of all, performance remains much better on nested loops, except for
> extremely big batches (i'd say over 30 million, extrapolating from the
> times I'm seeing with 10 millions records), so if I disable hash/merge
> joins, I get my performance back on these queries (they execute in around
> the same time as the searches in path and filename above). So I found a way
> to make most of my queries do nested lo

Re: [PERFORM] random slow query

2009-06-29 Thread Andres Freund

On 06/29/2009 03:33 PM, Sean Ma wrote:

I have a table about 50 million rows. There are a few writers to pump
data into the table at the rate of 4 row/hours. Most the time, the
SELECT is less than 100 ms. However sometime it is very slow, from 30
seconds to 500 seconds. The database is vacuum analyze regularly.

One months ago, this type of slow query happened about a few time per
day. But recently, the slow query happens more frequent at the rate of
once every 10 minutes or less. There seesm not relation to the
database loading or the type of query. If I manually execute these
query, it is returns in less than 1 seconds.

I just wonder where should I start to look?
The slow queries could be waiting for locks - so you could enable 
log_lock_waits to see if that is the issue.


Andres


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How would you store read/unread topic status?

2009-06-23 Thread Andres Freund

On 06/23/2009 04:54 PM, Mathieu Nebra wrote:

On 06/23/2009 01:12 PM, Mathieu Nebra wrote:

I'm running a quite large website which has its own forums.
They are currently heavily used and I'm getting performance
issues. Most of

them

are due to repeated UPDATE queries on a "flags" table.

This "flags" table has more or less the following fields:

UserID - TopicID - LastReadAnswerID

The flags table keeps track of every topic a member has
visited and remembers the last answer which was posted at
this moment. It allows the user to come back a few days
after and immediately jump to the last answer he has not
read. My problem is that everytime a user READS a topic, it
UPDATES this flags table to remember he has read it. This
leads to multiple updates at the same time on the same table,
and an update can take a few seconds. This is not acceptable
for my users.

Have you analyzed why it takes that long? Determining that is the
first step of improving the current situation...

My first guess would be, that your disks cannot keep up with the
 number of syncronous writes/second. Do you know how many
transactions with write access you have? Guessing from your
description you do at least one write for every page hit on your
 forum.


I don't know how many writes/s Pgsql can handle on my server, but I
first suspected that it was good practice to avoid unnecessary
writes.

It surely is.


I do 1 write/page for every connected user on the forums. I do the
same on another part of my website to increment the number of page
views (this was not part of my initial question but it is very
close).

That even more cries for some in-memory-caching.


On which OS are you? If you are on linux you could use iostat to
 get some relevant statistics like: iostat -x
/path/to/device/the/database/resides/on 2 10

That gives you 10 statistics over periods of 2 seconds.


Depending on those results there are numerous solutions to that

problem...

Here it is:

$ iostat -x /dev/sda 2 10 Linux 2.6.18-6-amd64 (scratchy) 23.06.2009

avg-cpu:  %user   %nice %system %iowait  %steal   %idle 18,02 0,00
12,87   13,130,00   55,98

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util sda   0,94
328,98 29,62 103,06   736,58  6091,1451,46 0,040,25   0,04
0,51

avg-cpu:  %user   %nice %system %iowait  %steal   %idle 39,65 0,00
48,382,000,009,98

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util sda   0,00 0,00
10,00 78,00   516,00  1928,0027,77 6,44   73,20   2,75 24,20

avg-cpu:  %user   %nice %system %iowait  %steal   %idle 40,15 0,00
48,132,240,009,48

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util sda   0,00 0,00
6,47 100,50   585,07  2288,5626,87 13,00  121,56   3,00 32,04

avg-cpu:  %user   %nice %system %iowait  %steal   %idle 45,14 0,00
45,646,730,002,49

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util sda   1,00 0,00
34,00 157,50  1232,00  3904,0026,82 26,64  139,09   3,03 58,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle 46,25 0,00
49,253,500,001,00

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util sda   0,00 0,00
27,00 173,00   884,00  4224,0025,54 24,46  122,32   3,00 60,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle 44,42 0,00
47,642,230,005,71

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util sda   0,00 0,00
15,42 140,30   700,50  3275,6225,53 17,94  115,21   2,81 43,78

avg-cpu:  %user   %nice %system %iowait  %steal   %idle 41,75 0,00
48,502,500,007,25

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util sda   0,50 0,00
21,11 116,08   888,44  2472,3624,50 12,62   91,99   2,55 34,97

avg-cpu:  %user   %nice %system %iowait  %steal   %idle 44,03 0,00
46,272,990,006,72

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util sda   9,00 0,00
10,00 119,00   484,00  2728,0024,90 15,15  117,47   2,70 34,80

avg-cpu:  %user   %nice %system %iowait  %steal   %idle 36,91 0,00
51,372,490,009,23

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util sda   0,99 0,00
14,78 136,45   390,15  2825,6221,26 21,86  144,52   2,58 39,01

avg-cpu:  %user   %nice %system %iowait  %steal   %idle 38,75 0,00
48,751,000,00   11,50

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util sda   0,00 0,00
7,54 67,34   377,89  1764,8228,62 5,38   71,89   2,95 22,11

You see that

Re: [PERFORM] How would you store read/unread topic status?

2009-06-23 Thread Andres Freund

On 06/23/2009 02:37 PM, Alexander Staubo wrote:

(That said, I believe PostgreSQL diffs tuple updates, so in practice
PostgreSQL might not be writing anything if you run an "update" with
the same value. I will let someone more intimate with the internal
details of updates to comment on this.)

No, it does not do that by default.
You can write a trigger to do that though - and there is one packaged 
with the core version in the upcoming 8.4 version.


Andres

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How would you store read/unread topic status?

2009-06-23 Thread Andres Freund

On 06/23/2009 01:12 PM, Mathieu Nebra wrote:

I'm running a quite large website which has its own forums. They are
currently heavily used and I'm getting performance issues. Most of them
are due to repeated UPDATE queries on a "flags" table.

This "flags" table has more or less the following fields:

UserID - TopicID - LastReadAnswerID

The flags table keeps track of every topic a member has visited and
remembers the last answer which was posted at this moment. It allows the
user to come back a few days after and immediately jump to the last
answer he has not read.
My problem is that everytime a user READS a topic, it UPDATES this flags
table to remember he has read it. This leads to multiple updates at the
same time on the same table, and an update can take a few seconds. This
is not acceptable for my users.
Have you analyzed why it takes that long? Determining that is the first 
step of improving the current situation...


My first guess would be, that your disks cannot keep up with the number 
of syncronous writes/second. Do you know how many transactions with 
write access you have? Guessing from your description you do at least 
one write for every page hit on your forum.


With the default settings every transaction needs to wait for io at the 
end - to ensure transactional semantics.
Depending on your disk the number of possible writes/second is quite low 
- a normal SATA disk with 7200rpm can satisfy something around 130 
syncronous writes per second. Which is the upper limit on writing 
transactions per second.

What disks do you have?

On which OS are you? If you are on linux you could use iostat to get 
some relevant statistics like:

iostat -x /path/to/device/the/database/resides/on 2 10

That gives you 10 statistics over periods of 2 seconds.


Depending on those results there are numerous solutions to that problem...


Question: what is the general rule of thumb here? How would you store
this information?
The problem here is, that every read access writes to disk - that is not 
going to scale very well.
One possible solution is to use something like memcached to store the 
last read post in memory and periodically write it into the database.



Which pg version are you using?


Andres

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Any better plan for this query?..

2009-05-12 Thread Andres Freund

Hi,

On 05/12/2009 12:46 AM, Dimitri wrote:

So, why I don't use prepare here: let's say I'm testing the worst
stress case :-)  Imagine you have thousands of such kind of queries -
you cannot prepare all of them! :-)  or you'll maybe prepare it once,
but as I showed previously in this thread prepare statement itself
takes 16ms, so for a single shot there is no gain! :-)
I have a hard time imaging a high throughput OLTP workload with that 
many different queries ;-)


Naturally it would still be nice to be good in this not optimal workload...

Andres

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


<    1   2