Re: [PERFORM] Performance

2011-04-29 Thread James Mansion

Greg Smith wrote:
There are also some severe query plan stability issues with this idea 
beyond this.  The idea that your plan might vary based on execution 
latency, that the system load going up can make query plans alter with 
it, is terrifying for a production server.


I thought I was clear that it should present some stats to the DBA, not 
that it would try to auto-tune?  This thread started with a discussion 
of appropriate tunings for random page cost vs sequential page cost I 
believe,, based on some finger in the air based on total size vs 
available disk cache.  And it was observed that on systems that have 
very large databases but modest hot data, you can perform like a fully 
cached system, for much of the time.


I'm just suggesting providing statistical information to the DBA which 
will indicate whether the system has 'recently' been behaving like a 
system that runs from buffer cache and/or subsystem caches, or one that 
runs from disk platters, and what the actual observed latency difference 
is.  It may well be that this varies with time of day or day of week.  
Whether the actual latencies translate directly into the relative costs 
is another matter.





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance

2011-04-29 Thread James Mansion

Robert Haas wrote:

The server can and does measure hit rates for the PG buffer pool, but to my 
knowledge there is no clear-cut way for PG to know whether read() is satisfied 
from the OS cache or a drive cache or the platter.

  
Does the server know which IO it thinks is sequential, and which it 
thinks is random?  Could it not time the IOs (perhaps optionally) and at 
least keep some sort of statistics of the actual observed times?


It might not be appropriate for the server to attempt auto-tuning, but 
it might be able to provide some information that can be used by a DBA 
to make informed decisions.


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-29 Thread James Mansion

Tom Lane wrote:

Uh, no, it is not.  The difference is that we can update a byte in a
shared buffer, and know that it *isn't* getting written out before we
  
Well, I don't know where yu got the idea I was refering to that sort of 
thing - its

the same as writing to a buffer before copying to the mmap'd area.

It's true that we don't know whether write() causes an immediate or
delayed disk write, but we generally don't care that much.  What we do
  

Which is what I was refering to.

care about is being able to ensure that a WAL write happens before the
data write, and with mmap we don't have control over that.

  
I think you have just the same control either way, because you can only 
force ordering
with an appropriate explicit sync, and in the absence of such a sync all 
bets are off for
whether/when each disk page is written out, and if you can't ensure that 
the controller
and disk are write through you'd better do a hardware cache flush.too, 
right?


A shame that so many systems have relatively poor handling of that 
hardware flush.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-28 Thread James Mansion

Tom Lane wrote:

The other and probably worse problem is that there's no application
control over how soon changes to mmap'd pages get to disk.  An msync
will flush them out, but the kernel is free to write dirty pages sooner.
So if they're depending for consistency on writes not happening until
msync, it's broken by design.  (This is one of the big reasons we don't
use mmap'd space for Postgres disk buffers.)
  
Well, I agree that it sucks for the reason you give - but you use write 
and that's *exactly* the
same in terms of when it gets written, as when you update a byte on an 
mmap'd page.


And you're quite happy to use write.

The only difference is that its a lot more explicit where the point of 
'maybe its written and maybe

it isn't' occurs.

There need be no real difference in the architecture for one over the 
other: there does seem to be
evidence that write and read can have better forward-read and 
write-behind behaviour, because
read/write does allow you to initiate an IO with a hint to a size that 
exceeds a hardware page.


And yes, after getting into the details while starting to port TC to 
Windows, I decided to bin
it.  Especially handy that SQLite3 has WAL now.  (And one last dig - TC 
didn't even
have a checksum that would let you tell when it had been broken: but it 
might all be fixed now

of course, I don't have time to check.)

James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-24 Thread James Mansion

Kevin Grittner wrote:

On what do you base that assumption?  I assume that we send a full
8K to the OS cache, and the file system writes disk sectors
according to its own algorithm.  With either platters or BBU cache,
the data is persisted on fsync; why do you see a risk with one but
not the other?
  
Surely 'the data is persisted sometime after our write and before the 
fsynch returns, but

may be written:
- in small chunks
- out of order
- in an unpredictable way'

When I looked at the internals of TokyoCabinet for example, the design 
was flawed but
would be 'fairly robust' so long as mmap'd pages that were dirtied did 
not get persisted

until msync, and were then persisted atomically.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] mysql to postgresql, performance questions

2010-03-25 Thread James Mansion

Hannu Krosing wrote:

Pulling the plug should not corrupt a postgreSQL database, unless it was
using disks which lie about write caching.
  
Didn't we recently put the old wife's 'the disks lied' tale to bed in 
favour of actually admiting that some well known filesystems and 
saftware raid systems have had trouble with their write barriers?



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Dell PERC H700/H800

2010-02-11 Thread James Mansion

Matthew Wakeling wrote:


Just a heads up - apparently the more recent Dell RAID controllers 
will no longer recognise hard discs that weren't sold through Dell.


http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ 



As one of the comments points out, that kind of makes them no longer 
SATA or SAS compatible, and they shouldn't be allowed to use those 
acronyms any more.


Matthew

I think that's potentially FUD.  Its all about 'Dell qualified drives'.  
I can't see anything that suggests that Dell will OEM drives and somehow 
tag them so that the drive must have come from them.  Of course they are 
big enough that they could have special BIOS I guess, but I read it that 
the drive types (and presumably revisions thereof) had to be recognised 
by the controller from a list, which presumably can be reflashed, which 
is not quite saying that if some WD enterprise drive model is 
'qualified' then you have to buy it from Dell..


Do you have any further detail?


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Benchmark shows very slow bulk delete

2010-01-27 Thread James Mansion

Ivan Voras wrote:
I wish that, when people got the idea to run a simplistic benchmark 
like this, they would at least have the common sense to put the 
database on a RAM drive to avoid problems with different cylinder 
speeds of rotational media and fragmentation from multiple runs.

Huh?

It's tough to benchmark anything involving rotational drives :)
But - how the database organises its IO to maximise the available 
bandwidth, limit
avaiodable seeks, and limit avoidable flushes is absolutely key to 
realistic performance,
especially on modest everyday hardware. Not everyone has a usage that 
justifies
'enterprise' kit - but plenty of people can benefit from something a 
step up from

SQLite.

If you just want to benchmark query processor efficiency then that's one 
scenario
where taking physical IO out of the picture might be justified, but I 
don't see a good
reason to suggest that it is 'common sense' to do so for all testing, 
and while the

hardware involved is pretty low end, its still a valid data point.
.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Scalability in postgres

2009-06-04 Thread James Mansion

Kevin Grittner wrote:

Sure, but the architecture of those products is based around all the
work being done by "engines" which try to establish affinity to
different CPUs, and loop through the various tasks to be done.  You
don't get a context switch storm because you normally have the number
of engines set at or below the number of CPUs.  The down side is that
they spend a lot of time spinning around queue access to see if
anything has become available to do -- which causes them not to play
nice with other processes on the same box.
  
This is just misleading at best.  I'm sorry, but (in particular) UNIX 
systems have routinely

managed large numbers of runnable processes where the run queue lengths are
long without such an issue.  This is not an issue with the number of 
runnable threads,

but with the way that they wait and what they do.

The context switch rate reported does not indicate processes using their 
timeslices
productively, unless the load is from a continuous stream of trivial 
RPCs and that
doesn't stack up with the good performance and then problematic load 
that the

OP reported.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Scalability in postgres

2009-06-03 Thread James Mansion

Greg Smith wrote:
3500 active connections across them.  That doesn't work, and what 
happens
is exactly the sort of context switch storm you're showing data for. 
Think about it for a minute:  how many of those can really be doing 
work at any time?  32, that's how many.  Now, you need some multiple 
of the number of cores to try to make sure everybody is always busy, 
but that multiple should be closer to 10X the number of cores rather 
than 100X. 

That's surely overly simplistic.  There is inherently nothing problematic
about having a lot of compute processes waiting for their timeslice, nor
of having IO- or semaphore-blocked processes waiting, and it doesn't
cause a context switch storm - this is a problem with postgres scalability,
not (inherently) lots of connections. I'm sure most of us evaluating 
Postgres

from a background in Sybase or SQLServer would regard 5000
connections as no big deal.

This has the sniff of a badly contended spin-and-yield doesn't it?

I'd guess that if the yield were a sleep for a couple of milliseconds then
the lock holder would run an free everything up.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance for high-volume log insertion

2009-04-22 Thread James Mansion

Stephen Frost wrote:

You're re-hashing things I've already said.  The big win is batching the
inserts, however that's done, into fewer transactions.  Sure, multi-row
inserts could be used to do that, but so could dropping begin/commits in
right now which probably takes even less effort.
  
Well, I think you are seriously underestimating the cost of the 
round-trip compared
to all the other effects (possibly bar the commits).  When I tested the 
union insert

technique on SQLServer and Sybase I got measurable improvements going from
100 row statements to 200 row statements, though I suspect in that case the
per-statement overheads are quite high.  I expected improvements from 10 
to 20
row batches, but it carried on getting better for a long time after 
that.  The

Sybase parser runs out of workspace first.



No, as was pointed out previously already, you really just need 2.  A
  
And I'm disagreeing with that.  Single row is a given, but I think 
you'll find it pays to have one
round trip if at all possible and invoking multiple prepared statements 
can work against this.



see if there's really much of a performance difference between a
50-insert prepared statement, and 50 1-insert prepared statements.  If
they're both done in larger transactions, I don't know that there's
really alot of performance difference.
  
I think you'll be surprised, but the only way is to test it.  And also 
the simple 50 row single
insert as text.  See if you can measure the difference between that and 
the prepared

statement.

storage overhead?  indexing overhead?  We're talking about prepared
statements here, what additional storage requirement do you think those
would impose?  What additional indexing overhead?  I don't believe we
actually do anything differently between prepared statements and
multi-row inserts that would change either of those.
  
That's my point.  You will brickwall on the actual database operations 
for execution
early enough that the efficiency difference between parse-and-execute 
and prepared
statements will be hard to measure - at least if you have multi-row 
statements.


But this really needs testing and timing.


Ah, latency is a reasonable thing to bring up.  Of course, if you want
to talk about latency then you get to consider that multi-insert SQL
will inherently have larger packet sizes which could cause them to be
delayed in some QoS arrangements.
  
No, I mean latency from round trips from the client to the server 
process.  I don't know why

you think I'd mean that.

As I said, most of this is a re-hash of things already said.  The
low-hanging fruit here is doing multiple inserts inside of a
transaction, rather than 1 insert per transaction.  Regardless of how
that's done, it's going to give the best bang-for-buck.  It will
complicate the client code some, regardless of how it's implemented, so
that failures are handled gracefully (if that's something you care about
anyway), but as there exists some queueing mechanisms in rsyslog
already, hopefully it won't be too bad.
  

I think you have largely missed the point. There are two things here:
1) how many rows per commit
2) how many rows per logical RPC (ie round trip) between the client
  and server processes

We are agreed that the first is a Very Big Deal, but you seem resistant to
the idea that the second of these is a big deal once you've dealt with 
the former.


My experience has been that its much more important than any benefits of
preparing statements etc, particularly if the use of a prepared 
statement can

make it harder to do multi-row RPCs because the protocol doesn't
allow pipelining (at least without things getting very hairy).

Clearly 'copy' is your friend for this too, at least potentially (even 
if it means

streaming to a staging table).


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance for high-volume log insertion

2009-04-21 Thread James Mansion

Stephen Frost wrote:

apart again.  That's where the performance is going to be improved by
going that route, not so much in eliminating the planning.
  
Fine.  But like I said, I'd suggest measuring the fractional improvement 
for this
when sending multi-row inserts before writing something complex.  I 
think the
big will will be doing multi-row inserts at all.  If you are going to 
prepare then
you'll need a collection of different prepared statements for different 
batch sizes
(say 1,2,3,4,5,10,20,50) and things will get complicated.  A multi-row 
insert

with unions and dynamic SQL is actually rather universal.

Personally I'd implement that first (and it should be easy to do across 
multiple

dbms types) and then return to it to have a more complex client side with
prepared statements etc if (and only if) necessary AND the performance
improvement were measurably worthwhile, given the indexing and storage
overheads.

There is no point optimising away the CPU of the simple parse if you are
just going to get hit with a lot of latency from round trips, and forming a
generic multi-insert SQL string is much, much easier to get working as a 
first

step. Server CPU isn't a bottleneck all that often - and with something as
simple as this you'll hit IO performance bottlenecks rather easily.

James



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance for high-volume log insertion

2009-04-21 Thread James Mansion

da...@lang.hm wrote:

2. insert into table values (),(),(),()


Using this structure would be more database agnostic, but won't perform
as well as the COPY options I don't believe.  It might be interesting to
do a large "insert into table values (),(),()" as a prepared statement,
but then you'd have to have different sizes for each different number of
items you want inserted.


on the other hand, when you have a full queue (lots of stuff to 
insert) is when you need the performance the most. if it's enough of a 
win on the database side, it could be worth more effort on the 
applicaiton side.

Are you sure preparing a simple insert is really worthwhile?

I'd check if I were you.  It shouldn't take long to plan.

Note that this structure (above) is handy but not universal.

You might want to try:

insert into table
select (...)
union
select (...)
union
select (...)
...

as well, since its more univeral.  Works on Sybase and SQLServer for 
example (and v.quickly too - much more so than a TSQL batch with lots of 
inserts or execs of stored procs)


James



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Raid 10 chunksize

2009-04-02 Thread James Mansion

Greg Smith wrote:
OK, that's clearly cached writes where the drive is lying about fsync. 
The claim is that since my drive supports both the flush calls, I just 
need to turn on barrier support, right?


That's a big pointy finger you are aiming at that drive - are you sure 
it was sent the flush instruction?  Clearly *something* isn't right.


This is basically how this always works for me:  somebody claims 
barriers and/or SATA disks work now, no really this time.  I test, 
they give answers that aren't possible if fsync were working properly, 
I conclude turning off the write cache is just as necessary as it 
always was.  If you can suggest something wrong with how I'm testing 
here, I'd love to hear about it.  I'd like to believe you but I can't 
seem to produce any evidence that supports you claims here.

Try similar tests with Solaris and Vista?

(Might have to give the whole disk to ZFS with Solaris to give it 
confidence to enable write cache, which mioght not be easy with a laptop 
boot drive: XP and Vista should show the toggle on the drive)


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [JDBC] [PERFORM] Query much slower when run from postgres function

2009-03-09 Thread James Mansion




The driver will use unnamed statements for all statements until it 
sees the same statement N times where N is 5 I believe, after that it 
uses a named statement.



Shame there's no syntax for it to pass the a table of the parameters to 
the server when it creates the named statement as planner hints.


James



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD performance

2009-01-26 Thread James Mansion

Craig Ringer wrote:

These devices would be interesting for a few uses, IMO. One is temp
table space and sort space in Pg. Another is scratch space for apps
(like Photoshop) that do their own VM management. There's also potential
  
Surely temp tables and sort space isn't subject to fsync and won't gain 
that much since they
should stay in the OS cache?  The device will surely help seek- or 
sync-bound tasks.


Doesn't that make it a good candidate for WAL and hot tables?

James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] understanding postgres issues/bottlenecks

2009-01-11 Thread James Mansion

Ron wrote:
I think the idea is that with SSDs or a RAID with a battery backed 
cache you
can leave fsync on and not have any significant performance hit since 
the seek
times are very fast for SSD. They have limited bandwidth but 
bandwidth to the

WAL is rarely an issue -- just latency.
Yes, Greg understands what I meant here.  In the case of SSDs, the 
performance hit of fsync = on is essentially zero.  In the case of 
battery backed RAM caches for RAID arrays, the efficacy is dependent 
on how the size of the cache compares with the working set of the disk 
access pattern.
Out of interest, if we take a scenario where the working set of updates 
exceeds the size
of the RAID card cache, has anyone tested the relative performance of 
using the battery

backed RAID on WAL only and non-cached access to other drives?

And perhaps the similar scenario with (hot) indices and WAL on a 
battery-backed device

on the data on uncached devices?

It seems to me that if you're going to thrash the cache from data 
updates (presumably
courtesy of full-page-write), then you might be better to partition the 
cache - and a

thrashed cache can be hardly any better than no cache (so why have one?).


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Need help with 8.4 Performance Testing

2008-12-11 Thread James Mansion

Scott Marlowe wrote:

involves tiny bits of data scattered throughout the database.  Our
current database is about 20-25 Gig, which means it's quickly reaching
the point where it will not fit in our 32G of ram, and it's likely to
grow too big for 64Gig before a year or two is out.
  

...

I wonder how many hard drives it would take to be CPU bound on random
access patterns?  About 40 to 60?  And probably 15k / SAS drives to
  
Well, its not a very big database and you're seek bound - so what's 
wrong with the latest
generation flash drives?  They're perfect for what you want to do it 
seems, and you can

probably get what you need using the new ARC cache on flash stuff in ZFS.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-12 Thread James Mansion

Scott Carey wrote:
Consumer drives will often read-ahead much more than server drives 
optimized for i/o per second.

...
The Linux readahead setting is _definitely_ in the kernel, definitely 
uses and fills the page cache, and from what I can gather, simply 
issues extra I/O's to the hardware beyond the last one requested by an 
app in certain situations.  It does not make your I/O request larger, 
it just queues an extra I/O following your request.
So ... fiddling with settings in Linux is going to force read-ahead, but 
the read-ahead data will hit the controller cache and the system buffers.


And the drives use their caches for cyclinder caching implicitly (maybe 
the SATA drives appear to preread more because the storage density per 
cylinder is higher?)..


But is there any way for an OS or application to (portably) ask SATA, 
SAS or SCSI drives to read ahead more (or less) than their default and 
NOT return the data to the controller?


I've never heard of such a thing, but I'm no expert in the command sets 
for any of this stuff.


James



On Thu, Sep 11, 2008 at 12:54 PM, James Mansion 
<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> 
wrote:


Greg Smith wrote:

The point I was trying to make there is that even under
impossibly optimal circumstances, you'd be hard pressed to
blow out the disk's read cache with seek-dominated data even
if you read a lot at each seek point.  That idea didn't make
it from my head into writing very well though.

Isn't there a bigger danger in blowing out the cache on the
controller and causing premature pageout of its dirty pages?

If you could get the readahead to work on the drive and not return
data to the controller, that might be dandy, but I'm sceptical.

James



-- 
Sent via pgsql-performance mailing list

(pgsql-performance@postgresql.org
<mailto:pgsql-performance@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-11 Thread James Mansion

Greg Smith wrote:
The point I was trying to make there is that even under impossibly 
optimal circumstances, you'd be hard pressed to blow out the disk's 
read cache with seek-dominated data even if you read a lot at each 
seek point.  That idea didn't make it from my head into writing very 
well though.


Isn't there a bigger danger in blowing out the cache on the controller 
and causing premature pageout of its dirty pages?


If you could get the readahead to work on the drive and not return data 
to the controller, that might be dandy, but I'm sceptical.


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Effects of setting linux block device readahead size

2008-09-10 Thread James Mansion

Greg Smith wrote:

Average seek time:  4ms
Seeks/second:250
Data read/seek:1MB(read-ahead number goes here)
Total read bandwidth:250MB/s

Most spinning disks now are nearer to 100MB/s streaming.  You've talked 
yourself into twice that, random access!


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] select on 22 GB table causes "An I/O error occured while sending to the backend." exception

2008-08-28 Thread James Mansion

[EMAIL PROTECTED] wrote:
for example if you have a process that uses 1G of ram (say firefox) 
and it needs to start a new process (say acroread to handle a pdf 
file), what it does is it forks the firefox process (each of which 
have 1G of ram allocated), and then does an exec of the acroread 
process (releasing the 1G of ram previously held by that copy of the 
firefox process)


Indeed, which is why we have vfork.  And, OK, vfork is busted if you 
have a threaded environment, so we have posix_spawn and posix_spawnp.


It is also worth noting that the copy isn't really a full copy on any 
decent modern UNIX - it is a reservation against the total swap space 
available.  Most pages will be happilly shared copy-on-write and never 
fully copied to the child before the exec.


I can't see how an OS can lie to processes about memory being allocated 
to them and not be ridiculed as a toy, but there you go.  I don't think 
Linux is the only perpetrator - doesn't AIX do this too?


The 'bests trategy' for the OOM killer is not to have one, and accept 
that you need some swap space available (it doesn't have to be fast 
since it won't actually be touched) to help out when fork/exec happens 
in big process images.


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] insert/update tps slow with indices on table > 1M rows

2008-06-04 Thread James Mansion

andrew klassen wrote:

I'll try adding more threads to update the table as you suggest.
You could try materially increasing the update batch size too.  As an 
exercise you could
see what the performance of COPY is by backing out the data and 
reloading it from

a suitable file.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] insert/update tps slow with indices on table > 1M rows

2008-06-04 Thread James Mansion

Matthew Wakeling wrote:
If you're running a "work queue" architecture, that probably means you 
only have one thread doing all the updates/inserts? It might be worth 
going multi-threaded, and issuing inserts and updates through more 
than one connection. Postgres is designed pretty well to scale 
performance by the number of simultaneous connections.
That would explain a disappointing upper limit on insert rate, but not 
any sort of cliff for the rate.  Nor, really, any material slowdown, if 
the single thread implies that we're stuck on round trip latency as a 
material limiting factor.


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread James Mansion

Alvaro Herrera wrote:

Hint bits are used to mark tuples as created and/or deleted by
transactions that are know committed or aborted.  To determine the
visibility of a tuple without such bits set, you need to consult pg_clog
and possibly pg_subtrans, so it is an expensive check.  On the other
  
So, how come there is this outstanding work to do, which will inevitably 
be done, and it
hasn't been done until it is 'just too late' to avoid getting in the way 
of the query?


The OP didn't suggest that he had just loaded the data.

Also - is it the case that this only affects the case where updated 
pages were spilled
during the transaction that changed them?  ie, if we commit a 
transaction and there
are changed rows still in the cache since their pages are not evicted 
yet, are the hint
bits set immediately so that page is written just once?  Seems this 
would be common

in most OLTP systems.

Heikki points out that the list might get big and need to be abandoned, 
but then you
fall back to scheduling a clog scan that can apply the bits, which does 
what you have
now, though hopefully in a way that fills slack disk IO rather than 
waiting for the

read.

Matthew says: 'it would be a list of changes since the last checkpoint' 
but I don't
see why you can't start writing hints to in-memory pages as soon as the 
transaction

ends.  You might fall behind, but I doubt it with modern CPU speeds.

I can't see why Pavan's suggestion to try to update as many of the bits 
as possible

when a dirty page is evicted would be contentious.

I do think this is something of interest to users, not just developers, 
since it

may influence the way updates are processed where it is reasonable to do
so in 'bite sized chunks' as a multipart workflow.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID controllers for Postgresql on large setups

2008-05-13 Thread James Mansion

PFC wrote:
PCI limits you to 133 MB/s (theoretical), actual speed being 
around 100-110 MB/s.
Many servers do have more than one bus.  You have to process that data 
too so its not going to be as much of a limit as you are suggesting.  It 
may be possible to stream a compressed data file to the server and copy 
in from that after decompression, which will free LAN bandwidth.  Or 
even if you RPC blocks of compressed data and decompress in the proc and 
insert right there.


On your current setup with 15K drives if you need 1 fsync per 
INSERT you won't do more than 250 per second, which is very limiting... 
Well, thats 250 physical syncs.  But if you have multiple insert streams 
(for group commit), or can batch the rows in each insert or copy, its 
not necessarily as much of a problem as you seem to be implying. 
Particularly if you are doing the holding table trick.


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Background writer underemphasized ...

2008-04-20 Thread James Mansion

Greg Smith wrote:
If you write a giant block of writes, those tend to be sorted by the 
OS and possibly the controller to reduce total seeks.  That's a pretty 
efficient write and it can clear relatively fast.


But if you're been trickling writes in an unstructured form and in low 
volume, there can be a stack of them that aren't sorted well blocking 
the queue from clearing.  With a series of small writes, it's not that 
difficult to end up in a situation where a controller cache is filled 
with writes containing a larger seek component than you'd have gotten 
had you written in larger blocks that took advantage of more OS-level 
elevator sorting.  There's actually a pending patch to try and improve 
this situation in regards to checkpoint writes in the queue.


Seeks are so slow compared to more sequential writes that you really 
can end up in the counterintuitive situation that you finish faster by 
avoiding early writes, even in cases when the disk is the bottleneck.

I'm sorry but I am somewhat unconvinced by this.

I accept that by early submission the disk subsystem may end up doing 
more seeks and more writes in total, but when the dam breaks at the 
start of the checkpoint, how can it help to have _more_ data write 
volume and _more_ implied seeks offered up at that point?


Are you suggesting that the disk subsystem has already decided on its 
strategy for a set of seeks and writes and will not insert new 
instructions into an existing elevator plan until it is completed and it 
looks at the new requests? This sounds a bit tenuous at best - almost to 
the point of being a bug. Do you believe this is universal?


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Background writer underemphasized ...

2008-04-19 Thread James Mansion

Greg Smith wrote:
Using the background writer more assures that the cache on the 
controller is going to be written to aggressively, so it may be 
somewhat filled already come checkpoint time.  If you leave the writer 
off, when the checkpoint comes you're much more likely to have the 
full 2GB available to absorb a large block of writes.
But isn't it the case that while using background writer might result in 
*slightly* more data to write (since data that is updated several times 
might actually be sent several times), the total amount of data in both 
cases is much the same?  And if the buffer backed up in the BGW case, 
wouldn't it also back up (more?) if the writes are deferred?  And in 
fact by sending earlier, the real bottleneck (the disks) could have been 
getting on with it and staring their IO earlier?


Can you explian your reasoning a bit more?

James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] full_page_write and also compressed logging

2008-04-18 Thread James Mansion
Has there ever been any analysis regarding the redundant write overhead 
of full page writes?


I'm wondering if once could regard an 8k page as being 64 off 128 byte 
paragraphs or
32 off 256byte paragraphs, each represented by a bit in a word.  And, 
when a pageis dirtied
by changes some record is kept of this based on the paragraphs 
affected.  Then you could
just incrementally dump the pre-image of newly dirtied paragraphs as you 
go, and the cost
in terms of dirtied pages would be much lower for the case of scattered 
updates.


(I was also wondering about just doing preimages based on chaned byte 
ranges but the
approach above is probably faster, doesn't dump the same range twice, 
and may fit

the existing flow more directly)

Also - has any attempt been made to push log writes through a cheap 
compressor, such

a zlib on lowest setting or one like Jeff Bonwick's for ZFS
(http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/compress.c).

Would work well for largely textual tables (and I suspect a lot of 
integer data too).


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance Implications of Using Exceptions

2008-04-06 Thread James Mansion

Robins Tharakan wrote:


I think James was talking about Sybase. Postgresql on the other hand 
has a slightly better way to do this.


SELECT ... FOR UPDATE allows you to lock a given row (based on the 
SELECT ... WHERE clause) and update it... without worrying about a 
concurrent modification. Of course, if the SELECT ... WHERE didn't 
bring up any rows, you would need to do an INSERT anyway.

How does that help?

If the matching row doesn't exist at that point - what is there to get 
locked?


The problem is that you need to effectively assert a lock on the primary 
key so that you can update
the row (if it exists) or insert a row with that key (if it doesn't) 
without checking and then inserting and
finding that some other guy you were racing performed the insert and you 
get a duplicate key error.


How does Postgresql protect against this?

James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] POSIX file updates

2008-04-02 Thread James Mansion

Greg Smith wrote:
You turn on direct I/O differently under Solaris then everywhere else, 
and nobody has bothered to write the patch (trivial) and OS-specific 
code to turn it on only when appropriate (slightly tricker) to handle 
this case. There's not a lot of pressure on PostgreSQL to handle this 
case correctly when Solaris admins are used to doing direct I/O tricks 
on filesystems already, so they don't complain about it much.
I'm not sure that this will survive use of PostgreSQL on Solaris with 
more users

on Indiana though. Which I'm hoping will happen
RPM of the drive.  Seen it on UFS and ZFS, both seem to do the right 
thing here.
But ZFS *is* smart enough to manage the cache, albeit sometimes with 
unexpected

consequences as with the 2530 here http://milek.blogspot.com/.
You seem to feel that there is an alternative here that PostgreSQL 
could take but doesn't.  There is not.  You either wait until writes 
hit disk, which by physical limitations only happens at RPM speed and 
therefore is too slow to commit for many cases, or you cache in the 
most reliable memory you've got and hope for the best.  No software 
approach can change any of that.
Indeed I do, but the issue I have is that the problem is that some 
popular operating
systems (lets try to avoid the flame war) fail to expose control of disk 
caches and the
so the code assumes that the onus is on the admin and the documentation 
rightly says
so.  But this is as much a failure of the POSIX API and operating 
systems to expose
something that's necessary and it seems to me rather valuable that the 
application be
able to work with such facilities as they become available. Exposing the 
flush cache

mechanisms isn't dangerous and can improve performance for non-dbms users of
the same drives.

I think manipulation of this stuff is a major concern for a DBMS that 
might be
used by amateur SAs, and if at all possible it should work out of the 
box on common
hardware.  So far as I can tell, SQLServerExpress makes a pretty good 
attempt
at it, for example It might be enough for initdb to whinge and fail if 
it thinks the
disks are behaving insanely unless the wouldbe dba sets a 
'my_disks_really_are_that_fast'
flag in the config. At the moment anyone can apt-get themselves a DBMS 
which may

become a liability.

At the moment:
- casual use is likely to be unreliable
- uncontrolled deferred IO can result in almost DOS-like checkpoints

These affect other systems than PostgreSQL too - but would be avoidable 
if the
drive cache flush was better exposed and the IO was staged to use it. 
There's no
reason to block on anything but the final IO in a WAL commit after all, 
and with
the deferred commit feature (which I really like for workflow engines) 
intermediate

WAL writes of configured chunk size could let the WAL drives get on with it.
Admitedly I'm assuming a non-blocking write through - direct IO from a
background thread (process if you must) or aio.

There are plenty of cases where the so-called "lying" drives 
themselves are completely stupid on their own regardless of operating 
system.
With modern NCQ capable drive firmware? Or just with older PATA stuff? 
There's

an awful lot of fud out there about SCSI vs IDE still.

James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] POSIX file updates

2008-04-02 Thread James Mansion

Andreas Kostyrka wrote:

takes over. The thing you worry about is if all data has made it to the
replication servers, not if some data might get lost in the hardware
cache of a controller. (Actually, talk to your local computer forensics
guru, there are a number of way to keep the current to electronics while
moving them.)
  
But it doesn't, unless you use a synchronous rep at block level - which 
is why we have SRDF.
Log-based reps are async and will lose committed transactions.  Even if 
you failed over, its
still extra-ordinarily useful to be able to see what the primary tried 
to do - its the only place
the e-comm transactions are stored, and the customer will still expect 
delivery.


I'm well aware that there are battery-backed caches that can be detached 
from controllers
and moved.  But you'd better make darn sure you move all the drives and 
plug them in in
exactly the right order and make sure they all spin up OK with the 
replaced cache, because
its expecting them to be exactly as they were last time they were on the 
bus.



3.) a controller cache is an issue if you have a filesystem in your data
path or not. If you do raw io, and the stupid hardware do cache writes,
well it's about as stupid as it would be if it would have cached
filesystem writes.
  
Only if the OS doesn't know how to tell the cache to flush.  SATA and 
SAS both have that facility.
But the semantics of *sync don't seem to be defined to require it being 
exercised, at least as far

as many operating systems implement it.

You would think hard drives could have enough capacitor store to dump 
cache to flash or the
drive - if only to a special dump zone near where the heads park. They 
are spinning already

after all.

On small systems in SMEs its inevitable that large drives will be shared 
with filesystem use, even
if the database files are on their own slice. If you can allow the drive 
to run with writeback cache
turned on, then the users will be a lot happier, even if dbms commits 
force *all* that cache to

flush to the platters.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance Implications of Using Exceptions

2008-04-02 Thread James Mansion

Stephen Denne wrote:

A third option is to update, if not found, insert.

  

I find myself having to do this in Sybase, but it sucks because there's
a race - if there's no row updated then there's no lock and you race
another thread doing the same thing. So you grab a row lock on a
sacrificial row used as a mutex, or just a table lock. Or you just
accept that sometimes you have to detect the insert fail and retry the
whole transaction. Which is sucky however you look at it.

I think the 'update or insert' or 'merge' extensions make a degree
of sense. At least in psql one can use the lightweight lock manager.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] POSIX file updates

2008-04-02 Thread James Mansion

Greg Smith wrote:
"After a write() to a regular file has successfully returned, any 
successful read() from each byte position in the file that was 
modified by that write() will return the data that was written by the 
write()...a similar requirement applies to multiple write operations 
to the same file position"


Yes, but that doesn't say anything about simultaneous read and write 
from multiple threads from

the same or different processes with descriptors on the same file.

No matter, I was thinking about a case with direct unbuffered IO.  Too 
many years

using Sybase on raw devices. :-(

Though, some of the performance studies relating to UFS directio suggest 
that there
are indeed benefits to managing the write through rather than using the 
OS as a poor

man's background thread to do it. SQLServer allows config based on deadline
scheduling for checkpoint completion I believe. This seems to me a very 
desirable

feature, but it does need more active scheduling of the write-back.



It's clear that such relaxation has benefits with some of Oracle's 
mechanisms as described.  But amusingly, PostgreSQL doesn't even 
support Solaris's direct I/O method right now unless you override the 
filesystem mounting options, so you end up needing to split it out and 
hack at that level regardless.

Indeed that's a shame. Why doesn't it use the directio?
PostgreSQL writes transactions to the WAL.  When they have reached 
disk, confirmed by a successful f[data]sync or a completed syncronous 
write, that transactions is now committed.  Eventually the impacted 
items in the buffer cache will be written as well.  At checkpoint 
time, things are reconciled such that all dirty buffers at that point 
have been written, and now f[data]sync is called on each touched file 
to make sure those changes have made it to disk.
Yes but fsync and stable on disk isn't the same thing if there is a 
cache anywhere is it?
Hence the fuss a while back about Apple's control of disk caches. 
Solaris and Windows

do it too.

Isn't allowing the OS to accumulate an arbitrary number of dirty blocks 
without
control of the rate at which they spill to media just exposing a 
possibility of an IO
storm when it comes to checkpoint time?  Does bgwriter attempt to 
control this

with intermediate fsync (and push to media if available)?

It strikes me as odd that fsync_writethrough isn't the most preferred 
option where
it is implemented. The postgres approach of *requiring* that there be no 
cache

below the OS is problematic, especially since the battery backup on internal
array controllers is hardly the handiest solution when you find the mobo 
has died.

And especially when the inability to flush caches on modern SATA and SAS
drives would appear to be more a failing in some operating systems than in
the drives themselves..

The links I've been accumulating into my bibliography include:

http://www.h2database.com/html/advanced.html#transaction_isolation
http://lwn.net/Articles/270891/
http://article.gmane.org/gmane.linux.kernel/646040
http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html
http://brad.livejournal.com/2116715.html

And your handy document on wal tuning, of course.

James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] SSDs

2008-04-01 Thread James Mansion
Tried harder to find info on the write cycles: found som CFs that claim 
2million

cycles, and found the Mtron SSDs which claim to have very advanced wear
levelling and a suitably long lifetime as a result even with an 
assumption that

the underlying flash can do 100k writes only.

The 'consumer' MTrons are not shabby on the face of it and not too 
expensive,

and the pro models even faster.

But ... the spec pdf shows really hight performance for average access, 
stream
read *and* write, random read ... and absolutely pants performance for 
random

write.  Like 130/s, for .5k and 4k writes.

Its so pants it looks like a misprint and it doesn't seem to square with the
review on tomshardware:
http://www.tomshardware.com/2007/11/21/mtron_ssd_32_gb/page7.html

Even there, the database IO rate does seem lower than you might hope,
and this *might* be because the random reads are very very fast and the
random writes ... aren't. Which is a shame, because that's exactly the
bit I'd hope was fast.

So, more work to do somewhere.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] POSIX file updates

2008-03-31 Thread James Mansion

Mark Mielke wrote:
Is there anything in POSIX that seems to suggest this? :-) (i.e. why 
are you going under the assumption that the answer is yes - did you 
read something?)



Perhaps it was just this:

http://kevinclosson.wordpress.com/2007/01/18/yes-direct-io-means-concurrent-writes-oracle-doesnt-need-write-ordering/

Whichof course isn't on Sun.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] POSIX file updates

2008-03-31 Thread James Mansion


I don't believe POSIX has any restriction such as you describe - or if 
it does, and I don't know about it, then most UNIX file systems (if 
not most file systems on any platform) are not POSIX compliant.


I suspect that indeed there are two different issues here in that the 
file mutex relates to updates
to the file, not passing the buffers through into the drive, which 
indeed might be delayed.


Been using direct io too much recently. :-(


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] POSIX file updates

2008-03-31 Thread James Mansion

Mark Mielke wrote:
Is there anything in POSIX that seems to suggest this? :-) (i.e. why 
are you going under the assumption that the answer is yes - did you 
read something?)



It was something somewhere on the Sun web site, relating to tuning Solaris
filesystems.  Or databases. Or ZFS. :-(

Needless to say I can't find a search string that finds it now. I 
remember being surprised

though, since I wasn't aware of it either.
I don't believe POSIX has any restriction such as you describe - or if 
it does, and I don't know about it, then most UNIX file systems (if 
not most file systems on any platform) are not POSIX compliant.

That, I can believe.



Linux itself, even without NCQ, might choose to reorder the writes. If 
you use ext2, the pressure to push pages out is based upon last used 
time rather than last write time. It can choose to push out pages at 
any time, and it's only every 5 seconds or so the the system task 
(bdflush?) tries to force out all dirty file system pages. NCQ 
exaggerates the situation, but I believe the issue pre-exists NCQ or 
the SCSI equivalent of years past.
Indeed there do seem to be issues with Linux and fsync. Its one of 
things I'm trying to get a
handle on as well - the relationship between fsync and flushes of 
controller and/or disk caches.


The rest of your email relies on the premise that POSIX enforces such 
a thing, or that systems are POSIX compliant. :-)



True. I'm hoping someone (Jignesh?) will be prompted to remember.

It may have been something in a blog related to ZFS vs other 
filesystems, but so far I'm coming

up empty in google. doesn't feel like something I imagined though.

James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] POSIX file updates

2008-03-31 Thread James Mansion

(Declaration of interest: I'm researching for a publication
on OLTP system design)

I have a question about file writes, particularly on POSIX.
This arose while considering the extent to which cache memory
and command queueing on disk
drives can help improve performance.

Is it correct that POSIX requires that the updates to a single
file are serialised in the filesystem layer?

So, if we have a number of dirty pages to write back to a single
file in the database (whether a table or index) then we cannot
pass these through the POSIX filesystem layer into the TCQ/NCQ
system on the disk drive, so it can reorder them?

I have seen suggestions that on Solaris this can be relaxed.

I *assume* that PostgreSQL's lack of threads or AIO and the
single bgwriter means that PostgreSQL 8.x does not normally
attempt to make any use of such a relaxation but could do so if the
bgwriter fails to keep up and other backends initiate flushes.

Does anyone know (perhaps from other systems) whether it is
valuable to attempt to take advantage of such a relaxation
where it is available?

Does the serialisation for file update apply in the case
where the file contents have been memory-mapped and we
try an msync (or equivalent)?



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planning a new server - help needed

2008-03-29 Thread James Mansion

PFC wrote:



Why do you claim that 'More platters also means slower seeks
and generally slower performance.'?


More platters -> more heads -> heavier head assembly -> slower 
seek time

Note sure I've sen a lot of evidence of that in drive specifications!

Gigabyte should revamp their i-RAM to use ECC RAM of a larger 
capacity... and longer lasting battery backup...
You would think a decent capacitor or rechargable button battery would 
be enough to dump it to a flash memory.

No problem with flash wear then.
I wonder, how many write cycles those Flash drives can take before 
reliability becomes a problem...
Hard to get data isn't it? I believe its hundreds of thousands to 
millions now. Now each record in most OLTP
tables is rewritten a few times unless its stuff that can go into temp 
tables etc, which should be elsewhere.

Index pages clearly get rewritten often.

I suspect a mix of storage technologies will be handy for some time yet 
- WAL on disk, and temp tables on

disk with no synchronous fsync requirement.

I think life is about to get interesting in DBMS storage.  All good for 
us users.


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planning a new server - help needed

2008-03-29 Thread James Mansion

Greg Smith wrote:
As for SCSI vs. SATA, I collected up the usual arguments on both sides 
at http://www.postgresqldocs.org/index.php/SCSI_vs._IDE/SATA_Disks



Why do you claim that 'More platters also means slower seeks
and generally slower performance.'?

On the face of it, it should mean that the number of track step
operations is reduced, even if the drive doesn't buffer a slice
of tracks aross all platters (which would help if it did).

I'm not entirely sure why the extra platters should really count
as more moving parts since I think the platter assembly and
head assembly are both single parts in effect, albeit they will
be more massive with more platters. I'm not sure how much
extra bearing friction that will mean, but it is reasonable that
some extra energy is going to be needed.

Recent figures I've seen suggest that the increased storage
density per platter, plus the extra number of platters, means
that the streaming speed of good 7200rpm SATA drives is
very close to that of good 15000rpm SAS drives - and you
can choose which bit of the disk to use to reduce seek time and
maximise transfer rate with the oversize drives. You can
get about 100MB/s out of both technologies, streaming.

It may be worth considering an alternative approach. I suspect
that a god RAID1 or RAID1+0 is worthwhile for WAL, but
you might consider a RAID1 of a pair of SSDs for data. They
will use a lot of your budget, but the seek time is negligible so the
effective usable performance is higher than you get with
spinning disks - so you might trade a fancy controller with
battery-backed cache for straight SSD.

I haven't done this, so YMMV.  But the prices are getting
interesting for OLTP where most disks are massively
oversized. The latest Samsung and SanDisk are expensive
in the UK but the Transcend 16GB TS16GSSD25S-S  SATA
is about $300 equiv - it can do 'only' 'up to' 28MB/s write and
you wouldn't want to put WAL on one, but sustaining
15-20MB/s through random access to a real disk isn't
trivial. If average access is 10ms, and you write 100MB/s
streaming, then you have to ask yourself if you going to do
80 or more seeks a second.

James

James





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Benchmark: Dell/Perc 6, 8 disk RAID 10

2008-03-18 Thread James Mansion

[EMAIL PROTECTED] wrote:


WAL is on a RAID 0 drive along with the OS

Isn't that just as unsafe as having the whole lot on RAID0?



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] temp tables

2008-03-13 Thread James Mansion
Do CREATE TEMP TABLE table have any special treatment regarding eliding 
sync operations
or deferring creation of disk files in the case where memory pressure 
does not require a spill?


James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] LISTEN / NOTIFY performance in 8.3

2008-02-26 Thread James Mansion

Tom Lane wrote:

Hardly --- how's that going to pass a notify name?  Also, a lot of
people want some payload data in a notify, not just a condition name;
any reimplementation that doesn't address that desire probably won't
get accepted.
  
Ah - forgot about the name.  At least there need be just one instance of 
a name record queued
per worker if I'm reading the documentation right - it suggest that 
notifications can be folded
with the proviso that if the process generates a notification and at 
least one other process
generates a notification then it will get at least (but possibly only) 
two events.  Not sure why

the PID is there rather than a couple of flag bits.

You'll alsways have the danger of overflowing a shm area and need to 
spill: is the signal and then
lookup in storage materially quicker than using the master process to 
route messages via pipes?
As you say, you have a lock contention issue and often the total signal 
data volume outstanding

for a single back end will be less than will fit in a kernel's pipe buffer.

The sending processes can track what signals they've generated in the 
current transaction so
the master (or signal distributor) needn't get bombarded with signals 
from lots of rows within

one transaction.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] LISTEN / NOTIFY performance in 8.3

2008-02-26 Thread James Mansion

Tom Lane wrote:

read-modify-write type of operation it uses an exclusive lock, so only
one can clear its pg_listener entry at a time.  The 'waiting' ones you
are seeing are stacked up behind whichever one has the lock at the
moment.  They shouldn't be waiting for long.
  
I certainly hadn't expected that to be the implementation technique - 
isn't it smply that we need
a sngle flag per worker process and can set/test-and-clear with atomic 
operations and then a

signal to wake them up?

Anyway - how hard would it be to install triggers on commit and 
rollback?  Then we could write

our own mechanisms.

James


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Linux/PostgreSQL scalability issue - problem with 8 cores

2008-01-04 Thread James Mansion

Jakub Ouhrabka wrote:

How can we diagnose what is happening during the peaks?
Can you try forcing a core from a bunch of the busy processes?  (Hmm - 
does Linux have an equivalent to the useful Solaris pstacks?)


James

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Multi-threading friendliness

2007-12-17 Thread James Mansion

Craig James wrote:
Don't confuse thread-friendly with a threaded implemetation of 
Postgres itself.  These are two separate questions.  Thread-friendly 
involves compile/link options that don't affect the Postgres source 
code at all.
Indeed.  I'm specifically not suggesting that Postgres should offer an 
API that can be called from
anything except the initial thread of its process - just that library 
subsystems might want to use
threads internally and that should be OK. Or rather, it should be 
possible to build Postgres
so that its OK.  Even if there's a small slowdown, the benefit of 
running the full JVM or CLR

might outweigh that quite easily *in some circumstances*.

I've also hinted that at some stage you might want to thread some parts 
of the implementation,
but I'm not suggesting that would be an early target.  It seems to me 
sensible to make it
straightforward to take baby steps in that direction in future would be 
a reasonable thing to
do.  As would being friendly to dynamically loaded C++ code.  If you 
create the framework,
(and we're talking the barest of scaffolding) then others can work to 
show the cost/benefit.


I fail to see why this would be a controversial engineering approach.

James


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] libgcc double-free, backend won't die

2007-12-16 Thread James Mansion

Tom Lane wrote:

Yes.

1) It's of no value to us

2) On many platforms there is a nonzero performance penalty

  

I think you have your head in the ground, but its your perogative.  *You*
might not care, but anyone wanting to use thread-aware libraries (and I'm
*not* talking about threading in any Postgres code) will certainly value 
it if

they can do so with some stability.

There's a clear benefit to being able to use such code.  I suggested a
build option but you reject it out of hand.  And in doing so, you also 
lock out

the benefits that you *could* have as well, in future..  It seems religious,
which is unfortunate.

Are you suggesting that the performance penalty, apart from the
malloc performance (which is easily dealt with) is *material*?
An extra indirection in access to errno will hurt so much?  Non-zero I can
accept, but clinging to 'non-zero' religiously isn't smart, especially 
if its a

build-time choice.

We'll clearly move to multiple cores, and the clock speed enhancements will
slow (at best).  In many cases, the number of available cores will 
exceed the

number of instantaneously active connections.  Don't you want to be able
to use all the horsepower?

Certainly on the sort of systems I work in my day job (big derivative 
trading
systems) its the norm that the cache hit rate on Sybase is well over 
99%, and

such systems are typically CPU bound.  Parallelism matters, and will matter
more and more in future.

So, an ability to start incrementally adding parallel operation of some 
actions
(whether scanning or updating indices or pushing data to the peer) is 
valuable,
as is the ability to use threaded libraries - and the (potential?) 
ability to use
embedded languages and more advanced libraries in Postgres procs is one 
of the
advantages of the system itself.  (I'd like to discount the use of a 
runtime in a

seperate process - the latency is a problem for row triggers and functions)

James


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] libgcc double-free, backend won't die

2007-12-16 Thread James Mansion

Gregory Stark wrote:

1) No Postgres function is guaranteed to be thread-safe so you better protect
   against concurrent calls to Postgres API functions. Also Postgres functions
   use longjmp which can restore the stack pointer to a value which may have
   been set earlier, possibly by another thread which wouldn't work.

  
That's a whole different thing to saying that you can't use a threaded 
subsystem under a Postgres

process.


2) Some OSes have bugs (notably glibc for a specific narrow set of versions)
   and don't expect to have standard library functions called before
   pthread_init() then called again after pthread_init(). If they expect the
   system to be either "threaded" or "not threaded" then they may be surprised
   to see that state change.

  
Is there any particular reason not to ensure that any low-level 
threading support in libc is enabled right
from the get-go, as a build-time option?  Does it do anything that's not 
well defined in a threaded
process?  Signal handling and atfork (and posix_ exec) are tyical areas 
I guess.  While this can potentially
make malloc slower, Postgres already wraps malloc so using a caching 
thread-aware malloc

substitute such as nedmalloc should be no problem.

I don't see any issue with the setjmp usage - so long as only one thread 
uses any internal API.  Which

can be checked rather easily at runtime with low cost in a debug build.

That just means you have to use a non-buggy version of your OS. Unfortunately
tracking down bugs in your OS to figure out what's causing them and whether
it's a particular known bug can be kind of tricky.
  
Is that really much of an issue an the  current version of any major OS 
though?  Its reaonable to
limit the use of a threaded library (in particular, the runtimes for 
most embeddable languages, or
libraries for RPC runtimes, etc) to 'modern' platforms that support 
threads effectively.  On

many such platforms these will already implicitly link libpthread anyway.

James



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] RAID arrays and performance

2007-12-04 Thread James Mansion

Mark Mielke wrote:
PostgreSQL or the kernel should already have the hottest pages in 
memory, so the value of doing async I/O is very likely the cooler 
pages that are unique to the query. We don't know what the cooler 
pages are until we follow three tree down.


I'm assuming that at the time we start to search the index, we have some 
idea of value or values that
we are looking for.  Or, as you say, we are applying a function to 'all 
of it'.


Think of a 'between' query.  The subset of the index that can be a match 
can be bounded by the leaf
pages that contain the end point(s).  Similarly if we have a merge with 
a sorted intermediate set from

a prior step then we also have bounds on the values.

I'm not convinced that your assertion that the index leaf pages must 
necessarily be processed in-order
is true - it depends what sort of operation is under way.  I am assuming 
that we try hard to keep
interior index nodes and data in meory and that having identified the 
subset of these that we want, we

can immediately infer the set of leaves that are potentially of interest.
The difference between preload and handling async I/O in terms of 
performance is debatable. Greg reports that async I/O on Linux sucks, 
but posix_fadvise*() has substantial benefits. posix_fadvise*() is 
preload not async I/O (he also reported that async I/O on Solaris 
seems to work well). Being able to do work as the first page is 
available is a micro-optimization as far as I am concerned at this 
point (one that may not yet work on Linux), as the real benefit comes 
from utilizing all 12 disks in Matthew's case, not from guaranteeing 
that data is processed as soon as possible.


I see it as part of the same problem.  You can partition the data across 
all the disks and run queries in parallel
against the partitions, or you can lay out the data in the RAID array in 
which case the optimiser has very little idea
how the data will map to physical data layout - so its best bet is to 
let the systems that DO know decide the
access strategy.  And those systems can only do that if you give them a 
lot of requests that CAN be reordered,

so they can choose a good ordering.


Micro-optimization.

Well, you like to assert this - but why?  If a concern is the latency 
(and my experience suggests that latency is the
biggest issue in practice, not throughput per se) then overlapping 
processing while waiting for 'distant' data is
important - and we don't have any information about the physical layout 
of the data that allows us to assert that
forward access pre-read of data from a file is the right strategy for 
accessing it as fast as possible - we have to
allow the OS (and to an increasing extent the disks) to manage the 
elevator IO to best effect.  Its clear that the
speed of streaming read and write of modern disks is really high 
compared to that of random access, so anything
we can do to help the disks run in that mode is pretty worthwhile even 
if the physical streaming doesn't match
any obvious logical ordering of the OS files or logical data pages 
within them.  If you have a function to apply to
a set of data elements and the elements are independant, then requiring 
that the function is applied in an order
rather than conceptually in parallel is going to put a lot of constraint 
on how the hardware can optimise it.


Clearly a hint to preload is better than nothing.  But it seems to me 
that the worst case is that we wait for
the slowest page to load and then start processing hoping that the rest 
of the data stays in the buffer cache
and is 'instant', while AIO and evaluate-when-ready means that process 
is still bound by the slowest
data to arrive, but at that point there is little processing still to 
do, and the already-processed buffers can be
reused earlier.  In the case where there is significant presure on the 
buffer cache, this can be significant.


Of course, a couple of decades bullying Sybase systems on Sun Enterprise 
boxes may have left me
somewhat jaundiced - but Sybase can at least parallelise things.  
Sometimes.  When it does, its quite

a big win.

In your hand waving, you have assumed that PostgreSQL B-Tree index 
might need to be replaced? :-)


Sure, if the intent is to make the system thread-hot or AIO-hot, then 
the change is potentially very
invasive.  The strategy to evaluate queries based on parallel execution 
and async IO is not necessarily

very like a strategy where you delegate to the OS buffer cache.

I'm not too bothered for the urpose of this discussion whether the way 
that postgres currently

navigates indexes is amenable to this.  This is bikeshed land, right?

I think it is foolish to disregard strategies that will allow 
overlapping IO and processing - and we want to
keep disks reading and writing rather than seeking.  To me that suggests 
AIO and disk-native queuing
are quite a big deal.  And parallel evaluation will be too as the number 
of cores goes up and there is
an expectation tha

Re: [PERFORM] RAID arrays and performance

2007-12-04 Thread James Mansion

Mark Mielke wrote:
At a minimum, this breaks your query into: 1) Preload all the index 
pages you will need
Isn't this fairly predictable - the planner has chosen the index so it 
will be operating

on a bounded subset.

, 2) Scan the index pages you needed
Yes, and AIO helps when you can scan them in arbitrary order, as they 
are returned.

, 3) Preload all the table page you will need
No - just request that they load.  You can do work as soon as any page 
is returned.

, 4) Scan the table pages you needed.

In the order that is most naturally returned by the disks.

But do you really need the whole index? 

I don't think I suggested that.
What if you only need parts of the index, and this plan now reads the 
whole index using async I/O "just in case" it is useful?

Where did you get that from?
index scan into a regular B-Tree scan, which is difficult to perform 
async I/O for, as you don't necessarily know which pages to read next.


Most B-trees are not so deep.  It would generally be a win to retain 
interior nodes of indices in
shared memory, even if leaf pages are not present.  In such a case, it 
is quite quick to establish

which leaf pages must be demand loaded.

I'm not suggesting that Postgres indices are structured in a way that 
would support this sort

of thing now.

James


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] RAID arrays and performance

2007-12-04 Thread James Mansion

Mark Mielke wrote:
This assumes that you can know which pages to fetch ahead of time - 
which you do not except for sequential read of a single table.


Why doesn't it help to issue IO ahead-of-time requests when you are 
scanning an index?  You can read-ahead
in index pages, and submit requests for data pages as soon as it is 
clear you'll want them.  Doing so can allow
the disks and OS to relax the order in which you receive them, which may 
allow you to process them while IO
continues, and it may also optimise away some seeking and settle time.  
Maybe.



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] RAID arrays and performance

2007-12-04 Thread James Mansion

[EMAIL PROTECTED] wrote:

So, if you hand requests one by one to the disc, it will almost always be
faster to order them. On the other hand, if you hand a huge long list of
requests to a decent SCSI or SATA-NCQ disc in one go, it will reorder the
reads itself, and it will do it much better than you.

  
Sure - but this doesn't suggest threading so much as pushing all reads 
into AIO as soon as they can
be identified - and hoping that your os has a decent AIO subsystem, 
which is sadly a tall order

for many *nix systems.

I do think some thought should be given to threading but I would expect 
the effect to be more
noticeable for updates where you update tables that have multiple 
indices.  In the case of your
scan then you need threading on CPU (rather than concurrent IO through 
AIO) if the disks
can feed you data faster than you can scan it.  Which might be the case 
for some scans
using user functions, but I wouldn't have thought it would be the case 
for a sinple index scan.


At some point, hopefully, the engines will be:
a) thread safe (if not thread hot) so it can exist with threaded user 
functions and embedded

   languages
b) able to incorporate C++ add-in functionality

There may not be a pressing reason to support either of these, but 
having a capability to
experiment would surely be helpful and allow incremental enhancement - 
so baby steps
could be made to (for example) move stats and logging to a background 
thread, move
push of results to clients out of the query evaluator thread, and so 
on.  Parallel threading

queries is a whle different ball game which needs thought in the optimiser.

James


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Performance on 8CPU's and 32GB of RAM

2007-09-06 Thread James Mansion

Carlo Stonebanks wrote:

Isn't it just easier to assume that Windows Server can't do anything right?
;-)

  

Well, avoiding the ;-) - people do, and its remarkably foolish of them.  Its
a long-standing whinge that many people with a UNIX-background seem to
just assume that Windows sucks, but you could run 40,000 sockets from a
single Win32 process for a while and some well-known UNIX systems would
still struggle to do this, libevent or no.  Admitedly, the way a Win32
app is architected would be rather different from a typical POSIX one.

Windows has been a cheap target bt its remarkably adequate and the
TPC results speak for themselves.






---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performance on 8CPU's and 32GB of RAM

2007-09-06 Thread James Mansion

Scott Marlowe wrote:

Where unixes generally outperform windows is in starting up new
backends, better file systems, and handling very large shared_buffer
settings.
  


Why do you think that UNIX systems are better at handling large shared 
buffers than Wndows?
32 bit Windows systems can suffer from fragmented address space, to be 
sure, but if the
performance of the operating-system supplied mutex or semaphore isn't 
good enough, you can

just use the raw atomic ops.

If what you mean is that pg has a design that's heavily oriented towards 
things that tend to
be cheap on POSIX and doesn't use the core Win32 features effectively, 
then let's track

that as an optimisation opportunity for the Win32 port.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Performance on 8CPU's and 32GB of RAM

2007-09-06 Thread James Mansion

Scott Marlowe wrote:

And there's the issue that with windows / NTFS that when one process
opens a file for read, it locks it for all other users.  This means
that things like virus scanners can cause odd, unpredictable failures
of your database.

  

Can you provide some justification for this?

James


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] LIKE search and performance

2007-06-06 Thread James Mansion

[EMAIL PROTECTED] wrote:

What is a real life example where an intelligent and researched
database application would issue a like or ilike query as their
primary condition in a situation where they expected very high
selectivity?
  
In my case the canonical example is to search against textual keys where 
the search is
performed automatically if the user hs typed enough data and paused.  In 
almost all
cases the '%' trails, and I'm looking for 'starts with' in effect.  
usually the search will have
a specified upper number of returned rows, if that's an available 
facility.  I realise in this
case that matching against the index does not allow the match count 
unless we check

MVCC as we go, but I don't see why another thread can't be doing that.

James


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread James Mansion



If Sybase is still like SQL Server (or the other way around), it *may*
end up scanning the index *IFF* the index is a clustered index. If it's
a normal index, it will do a sequential scan on the table.

  
Are you sure its not covered?  Have to check at work - but I'm off next 
week so it'll have to wait.



It's not a win on PostgreSQL, because of our MVCC implementation. We
need to scan *both* index *and* data pages if we go down that route, in
which case it's a lot faster to just scan the data pages alone.

  
Why do you need to go to all the data pages - doesn't the index 
structure contain all the keys so
you prefilter and then check to see if the *matched* items are still in 
view?  I'll be first to admit I
know zip about Postgres, but it seems odd - doesn't the index contain 
copies of the key values?.


I suspect that I mis-spoke with 'leaf'.  I really just mean 'all index 
pages with data', since the scan
does not even need to be in index order, just a good way to get at the 
data in a compact way.




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] LIKE search and performance

2007-05-24 Thread James Mansion

Alexander Staubo wrote:

On 5/23/07, Andy <[EMAIL PROTECTED]> wrote:

An example would be:
SELECT * FROM table
 WHERE name like '%john%' or street like 
'%srt%'


Anyway, the query planner always does seq scan on the whole table and 
that

takes some time. How can this be optimized or made in another way to be
faster?


There's no algorithm in existence that can "index" arbitrary
substrings the way you think. The only rational way to accomplish this
is to first break the text into substrings using some algorithm (eg.,
words delimited by whitespace and punctuation), and index the
substrings individually.
That seems rather harsh.  If I'd put an index on each of these colomns 
I'd certainly
expect it to use the indices - and I'm pretty sure that Sybase would.  
I'd expect
it to scan the index leaf pages instead of the table itself - they 
should be much

more compact and also likely to be hot in cache.

Why *wouldn't* the planner do this?

James


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] SCSI vs SATA

2007-04-08 Thread James Mansion
>Logic?

Foul!  That's NOT evidence.

>
>Mechanical devices have decreasing MTBF when run in hotter environments,
>often at non-linear rates.

I agree that this seems intuitive.  But I think taking it as a cast-iron
truth is dangerous.

>Server class drives are designed with a longer lifespan in mind.

Evidence?

>Server class hard drives are rated at higher temperatures than desktop
>drives.
>
>Google can supply any numbers to fill those facts in, but I found a
>dozen or so data sheets for various enterprise versus desktop drives in
>a matter of minutes.

I know what the marketing info says, that's not the point.  Bear in mind
that these are somewhat designed to justify very much higher prices.

I'm looking for statistical evidence that the difference is there, not
marketing colateral.  They may be designed to be more reliable.  And
the design targets *that the manufacturer admits to* may be more
stringent, but I'm interested to know what the actual measured difference
is.

>From the sound of it, you DON'T have such evidence.  Which is not a
surprise, because I don't have it either, and I do try to keep my eyes
open for it.

James




--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.446 / Virus Database: 269.0.0/751 - Release Date: 07/04/2007
22:57


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] SCSI vs SATA

2007-04-05 Thread James Mansion
>Server drives are generally more tolerant of higher temperatures.  I.e.
>the failure rate for consumer and server class HDs may be about the same
>at 40 degrees C, but by the time the internal case temps get up to 60-70
>degrees C, the consumer grade drives will likely be failing at a much
>higher rate, whether they're working hard or not.

Can you cite any statistical evidence for this?

James

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.446 / Virus Database: 268.18.26/746 - Release Date: 04/04/2007
13:09


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] SCSI vs SATA

2007-04-04 Thread James Mansion
>Right --- the point is not the interface, but whether the drive is built
>for reliability or to hit a low price point.

Personally I take the marketing mublings about the enterprise drives
with a pinch of salt.  The low-price drives HAVE TO be reliable too,
because a non-negligible failure rate will result in returns processing
costs that destroy a very thin margin.

Granted, there was a move to very short warranties a while back,
but the trend has been for more realistic warranties again recently.
You can bet they don't do this unless the drives are generally pretty
good.

James

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.446 / Virus Database: 268.18.25/745 - Release Date: 03/04/2007
12:48


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] SCSI vs SATA

2007-04-04 Thread James Mansion
>sure but for any serious usage one either wants to disable that
>cache(and rely on tagged command queuing or how that is called in SATAII
>world) or rely on the OS/raidcontroller implementing some sort of
>FUA/write barrier feature(which linux for example only does in pretty
>recent kernels)

Does anyone know which other hosts have write barrier implementations?
Solaris?  FreeBSD? Windows?

The buffers should help greatly in such a case, right?  Particularly if
you have quite a wide stripe.

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.446 / Virus Database: 268.18.25/745 - Release Date: 03/04/2007
12:48


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] compact flash disks?

2007-04-03 Thread James Mansion
>On sequential read speed HDs outperform flash disks... only on random
>access the flash disks are better. So if your application is a DW one,
>you're very likely better off using HDs.

This looks likely to be a non-issue shortly, see here:

http://www.reghardware.co.uk/2007/03/27/sams_doubles_ssd_capacity/

I still think this sort of devices will become the OLTP device
of choice before too long - even if we do have to watch the wear rate.

>WARNING:  modern TOtL flash RAMs are only good for ~1.2M writes per
>memory cell.  and that's the =good= ones.

Well, my original question was whether the physical update pattern
of the server does have hotspots that will tend to cause a problem
in normal usage if the wear levelling (such as it is) doesn't entirely
spread the load.  The sorts of application I'm interested in will not
update individual data elements very often.  There's a danger that
index nodes might be rewritted frequently, but one might want to allow
that indexes persist lazily and should be recovered from a scan after
a crash that leaves them dirty, so that they can be cached and avoid
such an access pattern.


Out of interest with respect to WAL - has anyone tested to see whether
one could tune the group commit to pick up slightly bigger blocks and
write the WAL using compression, to up the *effective* write speed of
media? Once again, most data I'm interested in is far from a random
pattern and tends to compress quite well.

If the WAL write is committed to the disk platter, is it OK for
arbitrary data blocks to have failed to commit to disk so we can
recover the updates for committed transactions?

Is theer any documentation on the write barrier usage?

James

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.446 / Virus Database: 268.18.25/744 - Release Date: 03/04/2007
05:32


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] compact flash disks?

2007-03-08 Thread James Mansion
Isn't it likely that a single stream (or perhaps one that can be partitioned
across spindles) will tend to be fastest, since it has a nice localised
stream that a) allows for compression of reasonable blocks and b) fits with
commit aggregation?

RAM capacity on servers is going up and up, but the size of a customer
address or row on an invoice isn't.  I'd like to see an emphasis on speed of
update with an assumption that most hot data is cached, most of the time.

My understanding also is that storing data columnwise is handy when its
persisted because linear scans are much faster.  Saw it once with a system
modelled after APL, blew me away even on a sparc10 once the data was
organised and could be mapped.

Still, for the moment anything that helps with the existing system would be
good.  Would it help to define triggers to be deferrable to commit as well
as end of statement (and per row)?  Seems to me it should be, at least for
ones that raise 'some thing changed' events.  And/or allow specification
that events can fold and should be very cheap (don't know if this is the
case now?  Its not as well documented how this works as I'd like)

James
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.446 / Virus Database: 268.18.7/713 - Release Date: 07/03/2007
09:24


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] compact flash disks?

2007-03-07 Thread James Mansion
>WARNING:  modern TOtL flash RAMs are only good for ~1.2M writes per
>memory cell.  and that's the =good= ones.
>Using flash RAM for write heavy applications like OLTP, or for WAL,
>etc can be very dangerous

Well, that's why I suggested that the WAL would stream to a hard disk
array, where the large IO sequential write speed will be helpful.

Whether OLTP is a problem will presumably depend on the freqency of updates
and vacuum to each physical cluster of rows in a disk block.

Most rows in a trading application will have quite a long lifetime, and be
updated relatively few times (even where we writing fixings info into
trades).

>Flash write speeds also stink; being ~1/2 flash's already low read speed.

Sure - but it may still be an effective tradoff where the limiting factor
would otherwise be seek time.

>Much better to use flash RAM for read heavy applications.

Why?  I can get a 'PC' server with 128GB of RAM quite easily now,
and that will mean I can cache most of not all hot data for any trading
app I've worked on.  Settled trades that matured in prior periods can
be moved to tables on real disks - they are hardly ever accessed
anyway.


In the long run, we are going to have to seriously rethink pg's use
of WAL as the way we implement MVCC as it becomes more and more of a
performance bottleneck.
We have WAL because Stonebreaker made an assumption about the future
dominance of optical media that has turned out to be false.
...and it's been one of pg's big issues every since.


>>   2GB CF isn't so
>>pricey any more.
>Heck =16= GB Flash only costs ~$300 US and 128GB SSDs based on flash
>RAM are due out this year.

Quite.  Suppose I have a RAID with double redundancy, then I get enough
capacity
for quite a lot of raw data, and can swap a card out every weekend and let
the
RAID rebuild it in rotation to keep them within conservative wear limits.

So long as the wear levelling works moderately well (and without needing FAT
on the disk or whatever) then I should be fine.

I think.  Maybe.

James

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.446 / Virus Database: 268.18.7/713 - Release Date: 07/03/2007
09:24


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] compact flash disks?

2007-03-06 Thread James Mansion
I see that one can now get compact flash to SATA connectors.

If I were to use a filesystem with noatime etc and little non-sql traffic,
does the physical update pattern tend to have hot sectors that will tend to
wear out CF?

I'm wondering about a RAID5 with data on CF drives and RAID1 for teh WAL on
a fast SATA or SAS drive pair.  I'm thhinking that this would tend to have
good performance because the seek time for the data is very low, even if the
actual write speed can be slower than state of the art.  2GB CF isn't so
pricey any more.

Just wondering.

James

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.446 / Virus Database: 268.18.7/711 - Release Date: 05/03/2007
09:41


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly