Re: [PERFORM] Performance
Greg Smith wrote: There are also some severe query plan stability issues with this idea beyond this. The idea that your plan might vary based on execution latency, that the system load going up can make query plans alter with it, is terrifying for a production server. I thought I was clear that it should present some stats to the DBA, not that it would try to auto-tune? This thread started with a discussion of appropriate tunings for random page cost vs sequential page cost I believe,, based on some finger in the air based on total size vs available disk cache. And it was observed that on systems that have very large databases but modest hot data, you can perform like a fully cached system, for much of the time. I'm just suggesting providing statistical information to the DBA which will indicate whether the system has 'recently' been behaving like a system that runs from buffer cache and/or subsystem caches, or one that runs from disk platters, and what the actual observed latency difference is. It may well be that this varies with time of day or day of week. Whether the actual latencies translate directly into the relative costs is another matter. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance
Robert Haas wrote: The server can and does measure hit rates for the PG buffer pool, but to my knowledge there is no clear-cut way for PG to know whether read() is satisfied from the OS cache or a drive cache or the platter. Does the server know which IO it thinks is sequential, and which it thinks is random? Could it not time the IOs (perhaps optionally) and at least keep some sort of statistics of the actual observed times? It might not be appropriate for the server to attempt auto-tuning, but it might be able to provide some information that can be used by a DBA to make informed decisions. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Tom Lane wrote: Uh, no, it is not. The difference is that we can update a byte in a shared buffer, and know that it *isn't* getting written out before we Well, I don't know where yu got the idea I was refering to that sort of thing - its the same as writing to a buffer before copying to the mmap'd area. It's true that we don't know whether write() causes an immediate or delayed disk write, but we generally don't care that much. What we do Which is what I was refering to. care about is being able to ensure that a WAL write happens before the data write, and with mmap we don't have control over that. I think you have just the same control either way, because you can only force ordering with an appropriate explicit sync, and in the absence of such a sync all bets are off for whether/when each disk page is written out, and if you can't ensure that the controller and disk are write through you'd better do a hardware cache flush.too, right? A shame that so many systems have relatively poor handling of that hardware flush. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Tom Lane wrote: The other and probably worse problem is that there's no application control over how soon changes to mmap'd pages get to disk. An msync will flush them out, but the kernel is free to write dirty pages sooner. So if they're depending for consistency on writes not happening until msync, it's broken by design. (This is one of the big reasons we don't use mmap'd space for Postgres disk buffers.) Well, I agree that it sucks for the reason you give - but you use write and that's *exactly* the same in terms of when it gets written, as when you update a byte on an mmap'd page. And you're quite happy to use write. The only difference is that its a lot more explicit where the point of 'maybe its written and maybe it isn't' occurs. There need be no real difference in the architecture for one over the other: there does seem to be evidence that write and read can have better forward-read and write-behind behaviour, because read/write does allow you to initiate an IO with a hint to a size that exceeds a hardware page. And yes, after getting into the details while starting to port TC to Windows, I decided to bin it. Especially handy that SQLite3 has WAL now. (And one last dig - TC didn't even have a checksum that would let you tell when it had been broken: but it might all be fixed now of course, I don't have time to check.) James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: On what do you base that assumption? I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other? Surely 'the data is persisted sometime after our write and before the fsynch returns, but may be written: - in small chunks - out of order - in an unpredictable way' When I looked at the internals of TokyoCabinet for example, the design was flawed but would be 'fairly robust' so long as mmap'd pages that were dirtied did not get persisted until msync, and were then persisted atomically. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] mysql to postgresql, performance questions
Hannu Krosing wrote: Pulling the plug should not corrupt a postgreSQL database, unless it was using disks which lie about write caching. Didn't we recently put the old wife's 'the disks lied' tale to bed in favour of actually admiting that some well known filesystems and saftware raid systems have had trouble with their write barriers? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Dell PERC H700/H800
Matthew Wakeling wrote: Just a heads up - apparently the more recent Dell RAID controllers will no longer recognise hard discs that weren't sold through Dell. http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ As one of the comments points out, that kind of makes them no longer SATA or SAS compatible, and they shouldn't be allowed to use those acronyms any more. Matthew I think that's potentially FUD. Its all about 'Dell qualified drives'. I can't see anything that suggests that Dell will OEM drives and somehow tag them so that the drive must have come from them. Of course they are big enough that they could have special BIOS I guess, but I read it that the drive types (and presumably revisions thereof) had to be recognised by the controller from a list, which presumably can be reflashed, which is not quite saying that if some WD enterprise drive model is 'qualified' then you have to buy it from Dell.. Do you have any further detail? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Benchmark shows very slow bulk delete
Ivan Voras wrote: I wish that, when people got the idea to run a simplistic benchmark like this, they would at least have the common sense to put the database on a RAM drive to avoid problems with different cylinder speeds of rotational media and fragmentation from multiple runs. Huh? It's tough to benchmark anything involving rotational drives :) But - how the database organises its IO to maximise the available bandwidth, limit avaiodable seeks, and limit avoidable flushes is absolutely key to realistic performance, especially on modest everyday hardware. Not everyone has a usage that justifies 'enterprise' kit - but plenty of people can benefit from something a step up from SQLite. If you just want to benchmark query processor efficiency then that's one scenario where taking physical IO out of the picture might be justified, but I don't see a good reason to suggest that it is 'common sense' to do so for all testing, and while the hardware involved is pretty low end, its still a valid data point. . -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Scalability in postgres
Kevin Grittner wrote: Sure, but the architecture of those products is based around all the work being done by "engines" which try to establish affinity to different CPUs, and loop through the various tasks to be done. You don't get a context switch storm because you normally have the number of engines set at or below the number of CPUs. The down side is that they spend a lot of time spinning around queue access to see if anything has become available to do -- which causes them not to play nice with other processes on the same box. This is just misleading at best. I'm sorry, but (in particular) UNIX systems have routinely managed large numbers of runnable processes where the run queue lengths are long without such an issue. This is not an issue with the number of runnable threads, but with the way that they wait and what they do. The context switch rate reported does not indicate processes using their timeslices productively, unless the load is from a continuous stream of trivial RPCs and that doesn't stack up with the good performance and then problematic load that the OP reported. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Scalability in postgres
Greg Smith wrote: 3500 active connections across them. That doesn't work, and what happens is exactly the sort of context switch storm you're showing data for. Think about it for a minute: how many of those can really be doing work at any time? 32, that's how many. Now, you need some multiple of the number of cores to try to make sure everybody is always busy, but that multiple should be closer to 10X the number of cores rather than 100X. That's surely overly simplistic. There is inherently nothing problematic about having a lot of compute processes waiting for their timeslice, nor of having IO- or semaphore-blocked processes waiting, and it doesn't cause a context switch storm - this is a problem with postgres scalability, not (inherently) lots of connections. I'm sure most of us evaluating Postgres from a background in Sybase or SQLServer would regard 5000 connections as no big deal. This has the sniff of a badly contended spin-and-yield doesn't it? I'd guess that if the yield were a sleep for a couple of milliseconds then the lock holder would run an free everything up. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] performance for high-volume log insertion
Stephen Frost wrote: You're re-hashing things I've already said. The big win is batching the inserts, however that's done, into fewer transactions. Sure, multi-row inserts could be used to do that, but so could dropping begin/commits in right now which probably takes even less effort. Well, I think you are seriously underestimating the cost of the round-trip compared to all the other effects (possibly bar the commits). When I tested the union insert technique on SQLServer and Sybase I got measurable improvements going from 100 row statements to 200 row statements, though I suspect in that case the per-statement overheads are quite high. I expected improvements from 10 to 20 row batches, but it carried on getting better for a long time after that. The Sybase parser runs out of workspace first. No, as was pointed out previously already, you really just need 2. A And I'm disagreeing with that. Single row is a given, but I think you'll find it pays to have one round trip if at all possible and invoking multiple prepared statements can work against this. see if there's really much of a performance difference between a 50-insert prepared statement, and 50 1-insert prepared statements. If they're both done in larger transactions, I don't know that there's really alot of performance difference. I think you'll be surprised, but the only way is to test it. And also the simple 50 row single insert as text. See if you can measure the difference between that and the prepared statement. storage overhead? indexing overhead? We're talking about prepared statements here, what additional storage requirement do you think those would impose? What additional indexing overhead? I don't believe we actually do anything differently between prepared statements and multi-row inserts that would change either of those. That's my point. You will brickwall on the actual database operations for execution early enough that the efficiency difference between parse-and-execute and prepared statements will be hard to measure - at least if you have multi-row statements. But this really needs testing and timing. Ah, latency is a reasonable thing to bring up. Of course, if you want to talk about latency then you get to consider that multi-insert SQL will inherently have larger packet sizes which could cause them to be delayed in some QoS arrangements. No, I mean latency from round trips from the client to the server process. I don't know why you think I'd mean that. As I said, most of this is a re-hash of things already said. The low-hanging fruit here is doing multiple inserts inside of a transaction, rather than 1 insert per transaction. Regardless of how that's done, it's going to give the best bang-for-buck. It will complicate the client code some, regardless of how it's implemented, so that failures are handled gracefully (if that's something you care about anyway), but as there exists some queueing mechanisms in rsyslog already, hopefully it won't be too bad. I think you have largely missed the point. There are two things here: 1) how many rows per commit 2) how many rows per logical RPC (ie round trip) between the client and server processes We are agreed that the first is a Very Big Deal, but you seem resistant to the idea that the second of these is a big deal once you've dealt with the former. My experience has been that its much more important than any benefits of preparing statements etc, particularly if the use of a prepared statement can make it harder to do multi-row RPCs because the protocol doesn't allow pipelining (at least without things getting very hairy). Clearly 'copy' is your friend for this too, at least potentially (even if it means streaming to a staging table). James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] performance for high-volume log insertion
Stephen Frost wrote: apart again. That's where the performance is going to be improved by going that route, not so much in eliminating the planning. Fine. But like I said, I'd suggest measuring the fractional improvement for this when sending multi-row inserts before writing something complex. I think the big will will be doing multi-row inserts at all. If you are going to prepare then you'll need a collection of different prepared statements for different batch sizes (say 1,2,3,4,5,10,20,50) and things will get complicated. A multi-row insert with unions and dynamic SQL is actually rather universal. Personally I'd implement that first (and it should be easy to do across multiple dbms types) and then return to it to have a more complex client side with prepared statements etc if (and only if) necessary AND the performance improvement were measurably worthwhile, given the indexing and storage overheads. There is no point optimising away the CPU of the simple parse if you are just going to get hit with a lot of latency from round trips, and forming a generic multi-insert SQL string is much, much easier to get working as a first step. Server CPU isn't a bottleneck all that often - and with something as simple as this you'll hit IO performance bottlenecks rather easily. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] performance for high-volume log insertion
da...@lang.hm wrote: 2. insert into table values (),(),(),() Using this structure would be more database agnostic, but won't perform as well as the COPY options I don't believe. It might be interesting to do a large "insert into table values (),(),()" as a prepared statement, but then you'd have to have different sizes for each different number of items you want inserted. on the other hand, when you have a full queue (lots of stuff to insert) is when you need the performance the most. if it's enough of a win on the database side, it could be worth more effort on the applicaiton side. Are you sure preparing a simple insert is really worthwhile? I'd check if I were you. It shouldn't take long to plan. Note that this structure (above) is handy but not universal. You might want to try: insert into table select (...) union select (...) union select (...) ... as well, since its more univeral. Works on Sybase and SQLServer for example (and v.quickly too - much more so than a TSQL batch with lots of inserts or execs of stored procs) James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Raid 10 chunksize
Greg Smith wrote: OK, that's clearly cached writes where the drive is lying about fsync. The claim is that since my drive supports both the flush calls, I just need to turn on barrier support, right? That's a big pointy finger you are aiming at that drive - are you sure it was sent the flush instruction? Clearly *something* isn't right. This is basically how this always works for me: somebody claims barriers and/or SATA disks work now, no really this time. I test, they give answers that aren't possible if fsync were working properly, I conclude turning off the write cache is just as necessary as it always was. If you can suggest something wrong with how I'm testing here, I'd love to hear about it. I'd like to believe you but I can't seem to produce any evidence that supports you claims here. Try similar tests with Solaris and Vista? (Might have to give the whole disk to ZFS with Solaris to give it confidence to enable write cache, which mioght not be easy with a laptop boot drive: XP and Vista should show the toggle on the drive) James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [JDBC] [PERFORM] Query much slower when run from postgres function
The driver will use unnamed statements for all statements until it sees the same statement N times where N is 5 I believe, after that it uses a named statement. Shame there's no syntax for it to pass the a table of the parameters to the server when it creates the named statement as planner hints. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] SSD performance
Craig Ringer wrote: These devices would be interesting for a few uses, IMO. One is temp table space and sort space in Pg. Another is scratch space for apps (like Photoshop) that do their own VM management. There's also potential Surely temp tables and sort space isn't subject to fsync and won't gain that much since they should stay in the OS cache? The device will surely help seek- or sync-bound tasks. Doesn't that make it a good candidate for WAL and hot tables? James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] understanding postgres issues/bottlenecks
Ron wrote: I think the idea is that with SSDs or a RAID with a battery backed cache you can leave fsync on and not have any significant performance hit since the seek times are very fast for SSD. They have limited bandwidth but bandwidth to the WAL is rarely an issue -- just latency. Yes, Greg understands what I meant here. In the case of SSDs, the performance hit of fsync = on is essentially zero. In the case of battery backed RAM caches for RAID arrays, the efficacy is dependent on how the size of the cache compares with the working set of the disk access pattern. Out of interest, if we take a scenario where the working set of updates exceeds the size of the RAID card cache, has anyone tested the relative performance of using the battery backed RAID on WAL only and non-cached access to other drives? And perhaps the similar scenario with (hot) indices and WAL on a battery-backed device on the data on uncached devices? It seems to me that if you're going to thrash the cache from data updates (presumably courtesy of full-page-write), then you might be better to partition the cache - and a thrashed cache can be hardly any better than no cache (so why have one?). -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Need help with 8.4 Performance Testing
Scott Marlowe wrote: involves tiny bits of data scattered throughout the database. Our current database is about 20-25 Gig, which means it's quickly reaching the point where it will not fit in our 32G of ram, and it's likely to grow too big for 64Gig before a year or two is out. ... I wonder how many hard drives it would take to be CPU bound on random access patterns? About 40 to 60? And probably 15k / SAS drives to Well, its not a very big database and you're seek bound - so what's wrong with the latest generation flash drives? They're perfect for what you want to do it seems, and you can probably get what you need using the new ARC cache on flash stuff in ZFS. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Effects of setting linux block device readahead size
Scott Carey wrote: Consumer drives will often read-ahead much more than server drives optimized for i/o per second. ... The Linux readahead setting is _definitely_ in the kernel, definitely uses and fills the page cache, and from what I can gather, simply issues extra I/O's to the hardware beyond the last one requested by an app in certain situations. It does not make your I/O request larger, it just queues an extra I/O following your request. So ... fiddling with settings in Linux is going to force read-ahead, but the read-ahead data will hit the controller cache and the system buffers. And the drives use their caches for cyclinder caching implicitly (maybe the SATA drives appear to preread more because the storage density per cylinder is higher?).. But is there any way for an OS or application to (portably) ask SATA, SAS or SCSI drives to read ahead more (or less) than their default and NOT return the data to the controller? I've never heard of such a thing, but I'm no expert in the command sets for any of this stuff. James On Thu, Sep 11, 2008 at 12:54 PM, James Mansion <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: Greg Smith wrote: The point I was trying to make there is that even under impossibly optimal circumstances, you'd be hard pressed to blow out the disk's read cache with seek-dominated data even if you read a lot at each seek point. That idea didn't make it from my head into writing very well though. Isn't there a bigger danger in blowing out the cache on the controller and causing premature pageout of its dirty pages? If you could get the readahead to work on the drive and not return data to the controller, that might be dandy, but I'm sceptical. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org <mailto:pgsql-performance@postgresql.org>) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Effects of setting linux block device readahead size
Greg Smith wrote: The point I was trying to make there is that even under impossibly optimal circumstances, you'd be hard pressed to blow out the disk's read cache with seek-dominated data even if you read a lot at each seek point. That idea didn't make it from my head into writing very well though. Isn't there a bigger danger in blowing out the cache on the controller and causing premature pageout of its dirty pages? If you could get the readahead to work on the drive and not return data to the controller, that might be dandy, but I'm sceptical. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Effects of setting linux block device readahead size
Greg Smith wrote: Average seek time: 4ms Seeks/second:250 Data read/seek:1MB(read-ahead number goes here) Total read bandwidth:250MB/s Most spinning disks now are nearer to 100MB/s streaming. You've talked yourself into twice that, random access! James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] select on 22 GB table causes "An I/O error occured while sending to the backend." exception
[EMAIL PROTECTED] wrote: for example if you have a process that uses 1G of ram (say firefox) and it needs to start a new process (say acroread to handle a pdf file), what it does is it forks the firefox process (each of which have 1G of ram allocated), and then does an exec of the acroread process (releasing the 1G of ram previously held by that copy of the firefox process) Indeed, which is why we have vfork. And, OK, vfork is busted if you have a threaded environment, so we have posix_spawn and posix_spawnp. It is also worth noting that the copy isn't really a full copy on any decent modern UNIX - it is a reservation against the total swap space available. Most pages will be happilly shared copy-on-write and never fully copied to the child before the exec. I can't see how an OS can lie to processes about memory being allocated to them and not be ridiculed as a toy, but there you go. I don't think Linux is the only perpetrator - doesn't AIX do this too? The 'bests trategy' for the OOM killer is not to have one, and accept that you need some swap space available (it doesn't have to be fast since it won't actually be touched) to help out when fork/exec happens in big process images. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] insert/update tps slow with indices on table > 1M rows
andrew klassen wrote: I'll try adding more threads to update the table as you suggest. You could try materially increasing the update batch size too. As an exercise you could see what the performance of COPY is by backing out the data and reloading it from a suitable file. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] insert/update tps slow with indices on table > 1M rows
Matthew Wakeling wrote: If you're running a "work queue" architecture, that probably means you only have one thread doing all the updates/inserts? It might be worth going multi-threaded, and issuing inserts and updates through more than one connection. Postgres is designed pretty well to scale performance by the number of simultaneous connections. That would explain a disappointing upper limit on insert rate, but not any sort of cliff for the rate. Nor, really, any material slowdown, if the single thread implies that we're stuck on round trip latency as a material limiting factor. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Alvaro Herrera wrote: Hint bits are used to mark tuples as created and/or deleted by transactions that are know committed or aborted. To determine the visibility of a tuple without such bits set, you need to consult pg_clog and possibly pg_subtrans, so it is an expensive check. On the other So, how come there is this outstanding work to do, which will inevitably be done, and it hasn't been done until it is 'just too late' to avoid getting in the way of the query? The OP didn't suggest that he had just loaded the data. Also - is it the case that this only affects the case where updated pages were spilled during the transaction that changed them? ie, if we commit a transaction and there are changed rows still in the cache since their pages are not evicted yet, are the hint bits set immediately so that page is written just once? Seems this would be common in most OLTP systems. Heikki points out that the list might get big and need to be abandoned, but then you fall back to scheduling a clog scan that can apply the bits, which does what you have now, though hopefully in a way that fills slack disk IO rather than waiting for the read. Matthew says: 'it would be a list of changes since the last checkpoint' but I don't see why you can't start writing hints to in-memory pages as soon as the transaction ends. You might fall behind, but I doubt it with modern CPU speeds. I can't see why Pavan's suggestion to try to update as many of the bits as possible when a dirty page is evicted would be contentious. I do think this is something of interest to users, not just developers, since it may influence the way updates are processed where it is reasonable to do so in 'bite sized chunks' as a multipart workflow. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] RAID controllers for Postgresql on large setups
PFC wrote: PCI limits you to 133 MB/s (theoretical), actual speed being around 100-110 MB/s. Many servers do have more than one bus. You have to process that data too so its not going to be as much of a limit as you are suggesting. It may be possible to stream a compressed data file to the server and copy in from that after decompression, which will free LAN bandwidth. Or even if you RPC blocks of compressed data and decompress in the proc and insert right there. On your current setup with 15K drives if you need 1 fsync per INSERT you won't do more than 250 per second, which is very limiting... Well, thats 250 physical syncs. But if you have multiple insert streams (for group commit), or can batch the rows in each insert or copy, its not necessarily as much of a problem as you seem to be implying. Particularly if you are doing the holding table trick. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Background writer underemphasized ...
Greg Smith wrote: If you write a giant block of writes, those tend to be sorted by the OS and possibly the controller to reduce total seeks. That's a pretty efficient write and it can clear relatively fast. But if you're been trickling writes in an unstructured form and in low volume, there can be a stack of them that aren't sorted well blocking the queue from clearing. With a series of small writes, it's not that difficult to end up in a situation where a controller cache is filled with writes containing a larger seek component than you'd have gotten had you written in larger blocks that took advantage of more OS-level elevator sorting. There's actually a pending patch to try and improve this situation in regards to checkpoint writes in the queue. Seeks are so slow compared to more sequential writes that you really can end up in the counterintuitive situation that you finish faster by avoiding early writes, even in cases when the disk is the bottleneck. I'm sorry but I am somewhat unconvinced by this. I accept that by early submission the disk subsystem may end up doing more seeks and more writes in total, but when the dam breaks at the start of the checkpoint, how can it help to have _more_ data write volume and _more_ implied seeks offered up at that point? Are you suggesting that the disk subsystem has already decided on its strategy for a set of seeks and writes and will not insert new instructions into an existing elevator plan until it is completed and it looks at the new requests? This sounds a bit tenuous at best - almost to the point of being a bug. Do you believe this is universal? James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Background writer underemphasized ...
Greg Smith wrote: Using the background writer more assures that the cache on the controller is going to be written to aggressively, so it may be somewhat filled already come checkpoint time. If you leave the writer off, when the checkpoint comes you're much more likely to have the full 2GB available to absorb a large block of writes. But isn't it the case that while using background writer might result in *slightly* more data to write (since data that is updated several times might actually be sent several times), the total amount of data in both cases is much the same? And if the buffer backed up in the BGW case, wouldn't it also back up (more?) if the writes are deferred? And in fact by sending earlier, the real bottleneck (the disks) could have been getting on with it and staring their IO earlier? Can you explian your reasoning a bit more? James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] full_page_write and also compressed logging
Has there ever been any analysis regarding the redundant write overhead of full page writes? I'm wondering if once could regard an 8k page as being 64 off 128 byte paragraphs or 32 off 256byte paragraphs, each represented by a bit in a word. And, when a pageis dirtied by changes some record is kept of this based on the paragraphs affected. Then you could just incrementally dump the pre-image of newly dirtied paragraphs as you go, and the cost in terms of dirtied pages would be much lower for the case of scattered updates. (I was also wondering about just doing preimages based on chaned byte ranges but the approach above is probably faster, doesn't dump the same range twice, and may fit the existing flow more directly) Also - has any attempt been made to push log writes through a cheap compressor, such a zlib on lowest setting or one like Jeff Bonwick's for ZFS (http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/compress.c). Would work well for largely textual tables (and I suspect a lot of integer data too). James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance Implications of Using Exceptions
Robins Tharakan wrote: I think James was talking about Sybase. Postgresql on the other hand has a slightly better way to do this. SELECT ... FOR UPDATE allows you to lock a given row (based on the SELECT ... WHERE clause) and update it... without worrying about a concurrent modification. Of course, if the SELECT ... WHERE didn't bring up any rows, you would need to do an INSERT anyway. How does that help? If the matching row doesn't exist at that point - what is there to get locked? The problem is that you need to effectively assert a lock on the primary key so that you can update the row (if it exists) or insert a row with that key (if it doesn't) without checking and then inserting and finding that some other guy you were racing performed the insert and you get a duplicate key error. How does Postgresql protect against this? James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] POSIX file updates
Greg Smith wrote: You turn on direct I/O differently under Solaris then everywhere else, and nobody has bothered to write the patch (trivial) and OS-specific code to turn it on only when appropriate (slightly tricker) to handle this case. There's not a lot of pressure on PostgreSQL to handle this case correctly when Solaris admins are used to doing direct I/O tricks on filesystems already, so they don't complain about it much. I'm not sure that this will survive use of PostgreSQL on Solaris with more users on Indiana though. Which I'm hoping will happen RPM of the drive. Seen it on UFS and ZFS, both seem to do the right thing here. But ZFS *is* smart enough to manage the cache, albeit sometimes with unexpected consequences as with the 2530 here http://milek.blogspot.com/. You seem to feel that there is an alternative here that PostgreSQL could take but doesn't. There is not. You either wait until writes hit disk, which by physical limitations only happens at RPM speed and therefore is too slow to commit for many cases, or you cache in the most reliable memory you've got and hope for the best. No software approach can change any of that. Indeed I do, but the issue I have is that the problem is that some popular operating systems (lets try to avoid the flame war) fail to expose control of disk caches and the so the code assumes that the onus is on the admin and the documentation rightly says so. But this is as much a failure of the POSIX API and operating systems to expose something that's necessary and it seems to me rather valuable that the application be able to work with such facilities as they become available. Exposing the flush cache mechanisms isn't dangerous and can improve performance for non-dbms users of the same drives. I think manipulation of this stuff is a major concern for a DBMS that might be used by amateur SAs, and if at all possible it should work out of the box on common hardware. So far as I can tell, SQLServerExpress makes a pretty good attempt at it, for example It might be enough for initdb to whinge and fail if it thinks the disks are behaving insanely unless the wouldbe dba sets a 'my_disks_really_are_that_fast' flag in the config. At the moment anyone can apt-get themselves a DBMS which may become a liability. At the moment: - casual use is likely to be unreliable - uncontrolled deferred IO can result in almost DOS-like checkpoints These affect other systems than PostgreSQL too - but would be avoidable if the drive cache flush was better exposed and the IO was staged to use it. There's no reason to block on anything but the final IO in a WAL commit after all, and with the deferred commit feature (which I really like for workflow engines) intermediate WAL writes of configured chunk size could let the WAL drives get on with it. Admitedly I'm assuming a non-blocking write through - direct IO from a background thread (process if you must) or aio. There are plenty of cases where the so-called "lying" drives themselves are completely stupid on their own regardless of operating system. With modern NCQ capable drive firmware? Or just with older PATA stuff? There's an awful lot of fud out there about SCSI vs IDE still. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] POSIX file updates
Andreas Kostyrka wrote: takes over. The thing you worry about is if all data has made it to the replication servers, not if some data might get lost in the hardware cache of a controller. (Actually, talk to your local computer forensics guru, there are a number of way to keep the current to electronics while moving them.) But it doesn't, unless you use a synchronous rep at block level - which is why we have SRDF. Log-based reps are async and will lose committed transactions. Even if you failed over, its still extra-ordinarily useful to be able to see what the primary tried to do - its the only place the e-comm transactions are stored, and the customer will still expect delivery. I'm well aware that there are battery-backed caches that can be detached from controllers and moved. But you'd better make darn sure you move all the drives and plug them in in exactly the right order and make sure they all spin up OK with the replaced cache, because its expecting them to be exactly as they were last time they were on the bus. 3.) a controller cache is an issue if you have a filesystem in your data path or not. If you do raw io, and the stupid hardware do cache writes, well it's about as stupid as it would be if it would have cached filesystem writes. Only if the OS doesn't know how to tell the cache to flush. SATA and SAS both have that facility. But the semantics of *sync don't seem to be defined to require it being exercised, at least as far as many operating systems implement it. You would think hard drives could have enough capacitor store to dump cache to flash or the drive - if only to a special dump zone near where the heads park. They are spinning already after all. On small systems in SMEs its inevitable that large drives will be shared with filesystem use, even if the database files are on their own slice. If you can allow the drive to run with writeback cache turned on, then the users will be a lot happier, even if dbms commits force *all* that cache to flush to the platters. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance Implications of Using Exceptions
Stephen Denne wrote: A third option is to update, if not found, insert. I find myself having to do this in Sybase, but it sucks because there's a race - if there's no row updated then there's no lock and you race another thread doing the same thing. So you grab a row lock on a sacrificial row used as a mutex, or just a table lock. Or you just accept that sometimes you have to detect the insert fail and retry the whole transaction. Which is sucky however you look at it. I think the 'update or insert' or 'merge' extensions make a degree of sense. At least in psql one can use the lightweight lock manager. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] POSIX file updates
Greg Smith wrote: "After a write() to a regular file has successfully returned, any successful read() from each byte position in the file that was modified by that write() will return the data that was written by the write()...a similar requirement applies to multiple write operations to the same file position" Yes, but that doesn't say anything about simultaneous read and write from multiple threads from the same or different processes with descriptors on the same file. No matter, I was thinking about a case with direct unbuffered IO. Too many years using Sybase on raw devices. :-( Though, some of the performance studies relating to UFS directio suggest that there are indeed benefits to managing the write through rather than using the OS as a poor man's background thread to do it. SQLServer allows config based on deadline scheduling for checkpoint completion I believe. This seems to me a very desirable feature, but it does need more active scheduling of the write-back. It's clear that such relaxation has benefits with some of Oracle's mechanisms as described. But amusingly, PostgreSQL doesn't even support Solaris's direct I/O method right now unless you override the filesystem mounting options, so you end up needing to split it out and hack at that level regardless. Indeed that's a shame. Why doesn't it use the directio? PostgreSQL writes transactions to the WAL. When they have reached disk, confirmed by a successful f[data]sync or a completed syncronous write, that transactions is now committed. Eventually the impacted items in the buffer cache will be written as well. At checkpoint time, things are reconciled such that all dirty buffers at that point have been written, and now f[data]sync is called on each touched file to make sure those changes have made it to disk. Yes but fsync and stable on disk isn't the same thing if there is a cache anywhere is it? Hence the fuss a while back about Apple's control of disk caches. Solaris and Windows do it too. Isn't allowing the OS to accumulate an arbitrary number of dirty blocks without control of the rate at which they spill to media just exposing a possibility of an IO storm when it comes to checkpoint time? Does bgwriter attempt to control this with intermediate fsync (and push to media if available)? It strikes me as odd that fsync_writethrough isn't the most preferred option where it is implemented. The postgres approach of *requiring* that there be no cache below the OS is problematic, especially since the battery backup on internal array controllers is hardly the handiest solution when you find the mobo has died. And especially when the inability to flush caches on modern SATA and SAS drives would appear to be more a failing in some operating systems than in the drives themselves.. The links I've been accumulating into my bibliography include: http://www.h2database.com/html/advanced.html#transaction_isolation http://lwn.net/Articles/270891/ http://article.gmane.org/gmane.linux.kernel/646040 http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html http://brad.livejournal.com/2116715.html And your handy document on wal tuning, of course. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] SSDs
Tried harder to find info on the write cycles: found som CFs that claim 2million cycles, and found the Mtron SSDs which claim to have very advanced wear levelling and a suitably long lifetime as a result even with an assumption that the underlying flash can do 100k writes only. The 'consumer' MTrons are not shabby on the face of it and not too expensive, and the pro models even faster. But ... the spec pdf shows really hight performance for average access, stream read *and* write, random read ... and absolutely pants performance for random write. Like 130/s, for .5k and 4k writes. Its so pants it looks like a misprint and it doesn't seem to square with the review on tomshardware: http://www.tomshardware.com/2007/11/21/mtron_ssd_32_gb/page7.html Even there, the database IO rate does seem lower than you might hope, and this *might* be because the random reads are very very fast and the random writes ... aren't. Which is a shame, because that's exactly the bit I'd hope was fast. So, more work to do somewhere. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] POSIX file updates
Mark Mielke wrote: Is there anything in POSIX that seems to suggest this? :-) (i.e. why are you going under the assumption that the answer is yes - did you read something?) Perhaps it was just this: http://kevinclosson.wordpress.com/2007/01/18/yes-direct-io-means-concurrent-writes-oracle-doesnt-need-write-ordering/ Whichof course isn't on Sun. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] POSIX file updates
I don't believe POSIX has any restriction such as you describe - or if it does, and I don't know about it, then most UNIX file systems (if not most file systems on any platform) are not POSIX compliant. I suspect that indeed there are two different issues here in that the file mutex relates to updates to the file, not passing the buffers through into the drive, which indeed might be delayed. Been using direct io too much recently. :-( -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] POSIX file updates
Mark Mielke wrote: Is there anything in POSIX that seems to suggest this? :-) (i.e. why are you going under the assumption that the answer is yes - did you read something?) It was something somewhere on the Sun web site, relating to tuning Solaris filesystems. Or databases. Or ZFS. :-( Needless to say I can't find a search string that finds it now. I remember being surprised though, since I wasn't aware of it either. I don't believe POSIX has any restriction such as you describe - or if it does, and I don't know about it, then most UNIX file systems (if not most file systems on any platform) are not POSIX compliant. That, I can believe. Linux itself, even without NCQ, might choose to reorder the writes. If you use ext2, the pressure to push pages out is based upon last used time rather than last write time. It can choose to push out pages at any time, and it's only every 5 seconds or so the the system task (bdflush?) tries to force out all dirty file system pages. NCQ exaggerates the situation, but I believe the issue pre-exists NCQ or the SCSI equivalent of years past. Indeed there do seem to be issues with Linux and fsync. Its one of things I'm trying to get a handle on as well - the relationship between fsync and flushes of controller and/or disk caches. The rest of your email relies on the premise that POSIX enforces such a thing, or that systems are POSIX compliant. :-) True. I'm hoping someone (Jignesh?) will be prompted to remember. It may have been something in a blog related to ZFS vs other filesystems, but so far I'm coming up empty in google. doesn't feel like something I imagined though. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] POSIX file updates
(Declaration of interest: I'm researching for a publication on OLTP system design) I have a question about file writes, particularly on POSIX. This arose while considering the extent to which cache memory and command queueing on disk drives can help improve performance. Is it correct that POSIX requires that the updates to a single file are serialised in the filesystem layer? So, if we have a number of dirty pages to write back to a single file in the database (whether a table or index) then we cannot pass these through the POSIX filesystem layer into the TCQ/NCQ system on the disk drive, so it can reorder them? I have seen suggestions that on Solaris this can be relaxed. I *assume* that PostgreSQL's lack of threads or AIO and the single bgwriter means that PostgreSQL 8.x does not normally attempt to make any use of such a relaxation but could do so if the bgwriter fails to keep up and other backends initiate flushes. Does anyone know (perhaps from other systems) whether it is valuable to attempt to take advantage of such a relaxation where it is available? Does the serialisation for file update apply in the case where the file contents have been memory-mapped and we try an msync (or equivalent)? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Planning a new server - help needed
PFC wrote: Why do you claim that 'More platters also means slower seeks and generally slower performance.'? More platters -> more heads -> heavier head assembly -> slower seek time Note sure I've sen a lot of evidence of that in drive specifications! Gigabyte should revamp their i-RAM to use ECC RAM of a larger capacity... and longer lasting battery backup... You would think a decent capacitor or rechargable button battery would be enough to dump it to a flash memory. No problem with flash wear then. I wonder, how many write cycles those Flash drives can take before reliability becomes a problem... Hard to get data isn't it? I believe its hundreds of thousands to millions now. Now each record in most OLTP tables is rewritten a few times unless its stuff that can go into temp tables etc, which should be elsewhere. Index pages clearly get rewritten often. I suspect a mix of storage technologies will be handy for some time yet - WAL on disk, and temp tables on disk with no synchronous fsync requirement. I think life is about to get interesting in DBMS storage. All good for us users. James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Planning a new server - help needed
Greg Smith wrote: As for SCSI vs. SATA, I collected up the usual arguments on both sides at http://www.postgresqldocs.org/index.php/SCSI_vs._IDE/SATA_Disks Why do you claim that 'More platters also means slower seeks and generally slower performance.'? On the face of it, it should mean that the number of track step operations is reduced, even if the drive doesn't buffer a slice of tracks aross all platters (which would help if it did). I'm not entirely sure why the extra platters should really count as more moving parts since I think the platter assembly and head assembly are both single parts in effect, albeit they will be more massive with more platters. I'm not sure how much extra bearing friction that will mean, but it is reasonable that some extra energy is going to be needed. Recent figures I've seen suggest that the increased storage density per platter, plus the extra number of platters, means that the streaming speed of good 7200rpm SATA drives is very close to that of good 15000rpm SAS drives - and you can choose which bit of the disk to use to reduce seek time and maximise transfer rate with the oversize drives. You can get about 100MB/s out of both technologies, streaming. It may be worth considering an alternative approach. I suspect that a god RAID1 or RAID1+0 is worthwhile for WAL, but you might consider a RAID1 of a pair of SSDs for data. They will use a lot of your budget, but the seek time is negligible so the effective usable performance is higher than you get with spinning disks - so you might trade a fancy controller with battery-backed cache for straight SSD. I haven't done this, so YMMV. But the prices are getting interesting for OLTP where most disks are massively oversized. The latest Samsung and SanDisk are expensive in the UK but the Transcend 16GB TS16GSSD25S-S SATA is about $300 equiv - it can do 'only' 'up to' 28MB/s write and you wouldn't want to put WAL on one, but sustaining 15-20MB/s through random access to a real disk isn't trivial. If average access is 10ms, and you write 100MB/s streaming, then you have to ask yourself if you going to do 80 or more seeks a second. James James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Benchmark: Dell/Perc 6, 8 disk RAID 10
[EMAIL PROTECTED] wrote: WAL is on a RAID 0 drive along with the OS Isn't that just as unsafe as having the whole lot on RAID0? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] temp tables
Do CREATE TEMP TABLE table have any special treatment regarding eliding sync operations or deferring creation of disk files in the case where memory pressure does not require a spill? James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] LISTEN / NOTIFY performance in 8.3
Tom Lane wrote: Hardly --- how's that going to pass a notify name? Also, a lot of people want some payload data in a notify, not just a condition name; any reimplementation that doesn't address that desire probably won't get accepted. Ah - forgot about the name. At least there need be just one instance of a name record queued per worker if I'm reading the documentation right - it suggest that notifications can be folded with the proviso that if the process generates a notification and at least one other process generates a notification then it will get at least (but possibly only) two events. Not sure why the PID is there rather than a couple of flag bits. You'll alsways have the danger of overflowing a shm area and need to spill: is the signal and then lookup in storage materially quicker than using the master process to route messages via pipes? As you say, you have a lock contention issue and often the total signal data volume outstanding for a single back end will be less than will fit in a kernel's pipe buffer. The sending processes can track what signals they've generated in the current transaction so the master (or signal distributor) needn't get bombarded with signals from lots of rows within one transaction. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] LISTEN / NOTIFY performance in 8.3
Tom Lane wrote: read-modify-write type of operation it uses an exclusive lock, so only one can clear its pg_listener entry at a time. The 'waiting' ones you are seeing are stacked up behind whichever one has the lock at the moment. They shouldn't be waiting for long. I certainly hadn't expected that to be the implementation technique - isn't it smply that we need a sngle flag per worker process and can set/test-and-clear with atomic operations and then a signal to wake them up? Anyway - how hard would it be to install triggers on commit and rollback? Then we could write our own mechanisms. James ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Linux/PostgreSQL scalability issue - problem with 8 cores
Jakub Ouhrabka wrote: How can we diagnose what is happening during the peaks? Can you try forcing a core from a bunch of the busy processes? (Hmm - does Linux have an equivalent to the useful Solaris pstacks?) James ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Multi-threading friendliness
Craig James wrote: Don't confuse thread-friendly with a threaded implemetation of Postgres itself. These are two separate questions. Thread-friendly involves compile/link options that don't affect the Postgres source code at all. Indeed. I'm specifically not suggesting that Postgres should offer an API that can be called from anything except the initial thread of its process - just that library subsystems might want to use threads internally and that should be OK. Or rather, it should be possible to build Postgres so that its OK. Even if there's a small slowdown, the benefit of running the full JVM or CLR might outweigh that quite easily *in some circumstances*. I've also hinted that at some stage you might want to thread some parts of the implementation, but I'm not suggesting that would be an early target. It seems to me sensible to make it straightforward to take baby steps in that direction in future would be a reasonable thing to do. As would being friendly to dynamically loaded C++ code. If you create the framework, (and we're talking the barest of scaffolding) then others can work to show the cost/benefit. I fail to see why this would be a controversial engineering approach. James ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] libgcc double-free, backend won't die
Tom Lane wrote: Yes. 1) It's of no value to us 2) On many platforms there is a nonzero performance penalty I think you have your head in the ground, but its your perogative. *You* might not care, but anyone wanting to use thread-aware libraries (and I'm *not* talking about threading in any Postgres code) will certainly value it if they can do so with some stability. There's a clear benefit to being able to use such code. I suggested a build option but you reject it out of hand. And in doing so, you also lock out the benefits that you *could* have as well, in future.. It seems religious, which is unfortunate. Are you suggesting that the performance penalty, apart from the malloc performance (which is easily dealt with) is *material*? An extra indirection in access to errno will hurt so much? Non-zero I can accept, but clinging to 'non-zero' religiously isn't smart, especially if its a build-time choice. We'll clearly move to multiple cores, and the clock speed enhancements will slow (at best). In many cases, the number of available cores will exceed the number of instantaneously active connections. Don't you want to be able to use all the horsepower? Certainly on the sort of systems I work in my day job (big derivative trading systems) its the norm that the cache hit rate on Sybase is well over 99%, and such systems are typically CPU bound. Parallelism matters, and will matter more and more in future. So, an ability to start incrementally adding parallel operation of some actions (whether scanning or updating indices or pushing data to the peer) is valuable, as is the ability to use threaded libraries - and the (potential?) ability to use embedded languages and more advanced libraries in Postgres procs is one of the advantages of the system itself. (I'd like to discount the use of a runtime in a seperate process - the latency is a problem for row triggers and functions) James ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] libgcc double-free, backend won't die
Gregory Stark wrote: 1) No Postgres function is guaranteed to be thread-safe so you better protect against concurrent calls to Postgres API functions. Also Postgres functions use longjmp which can restore the stack pointer to a value which may have been set earlier, possibly by another thread which wouldn't work. That's a whole different thing to saying that you can't use a threaded subsystem under a Postgres process. 2) Some OSes have bugs (notably glibc for a specific narrow set of versions) and don't expect to have standard library functions called before pthread_init() then called again after pthread_init(). If they expect the system to be either "threaded" or "not threaded" then they may be surprised to see that state change. Is there any particular reason not to ensure that any low-level threading support in libc is enabled right from the get-go, as a build-time option? Does it do anything that's not well defined in a threaded process? Signal handling and atfork (and posix_ exec) are tyical areas I guess. While this can potentially make malloc slower, Postgres already wraps malloc so using a caching thread-aware malloc substitute such as nedmalloc should be no problem. I don't see any issue with the setjmp usage - so long as only one thread uses any internal API. Which can be checked rather easily at runtime with low cost in a debug build. That just means you have to use a non-buggy version of your OS. Unfortunately tracking down bugs in your OS to figure out what's causing them and whether it's a particular known bug can be kind of tricky. Is that really much of an issue an the current version of any major OS though? Its reaonable to limit the use of a threaded library (in particular, the runtimes for most embeddable languages, or libraries for RPC runtimes, etc) to 'modern' platforms that support threads effectively. On many such platforms these will already implicitly link libpthread anyway. James ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] RAID arrays and performance
Mark Mielke wrote: PostgreSQL or the kernel should already have the hottest pages in memory, so the value of doing async I/O is very likely the cooler pages that are unique to the query. We don't know what the cooler pages are until we follow three tree down. I'm assuming that at the time we start to search the index, we have some idea of value or values that we are looking for. Or, as you say, we are applying a function to 'all of it'. Think of a 'between' query. The subset of the index that can be a match can be bounded by the leaf pages that contain the end point(s). Similarly if we have a merge with a sorted intermediate set from a prior step then we also have bounds on the values. I'm not convinced that your assertion that the index leaf pages must necessarily be processed in-order is true - it depends what sort of operation is under way. I am assuming that we try hard to keep interior index nodes and data in meory and that having identified the subset of these that we want, we can immediately infer the set of leaves that are potentially of interest. The difference between preload and handling async I/O in terms of performance is debatable. Greg reports that async I/O on Linux sucks, but posix_fadvise*() has substantial benefits. posix_fadvise*() is preload not async I/O (he also reported that async I/O on Solaris seems to work well). Being able to do work as the first page is available is a micro-optimization as far as I am concerned at this point (one that may not yet work on Linux), as the real benefit comes from utilizing all 12 disks in Matthew's case, not from guaranteeing that data is processed as soon as possible. I see it as part of the same problem. You can partition the data across all the disks and run queries in parallel against the partitions, or you can lay out the data in the RAID array in which case the optimiser has very little idea how the data will map to physical data layout - so its best bet is to let the systems that DO know decide the access strategy. And those systems can only do that if you give them a lot of requests that CAN be reordered, so they can choose a good ordering. Micro-optimization. Well, you like to assert this - but why? If a concern is the latency (and my experience suggests that latency is the biggest issue in practice, not throughput per se) then overlapping processing while waiting for 'distant' data is important - and we don't have any information about the physical layout of the data that allows us to assert that forward access pre-read of data from a file is the right strategy for accessing it as fast as possible - we have to allow the OS (and to an increasing extent the disks) to manage the elevator IO to best effect. Its clear that the speed of streaming read and write of modern disks is really high compared to that of random access, so anything we can do to help the disks run in that mode is pretty worthwhile even if the physical streaming doesn't match any obvious logical ordering of the OS files or logical data pages within them. If you have a function to apply to a set of data elements and the elements are independant, then requiring that the function is applied in an order rather than conceptually in parallel is going to put a lot of constraint on how the hardware can optimise it. Clearly a hint to preload is better than nothing. But it seems to me that the worst case is that we wait for the slowest page to load and then start processing hoping that the rest of the data stays in the buffer cache and is 'instant', while AIO and evaluate-when-ready means that process is still bound by the slowest data to arrive, but at that point there is little processing still to do, and the already-processed buffers can be reused earlier. In the case where there is significant presure on the buffer cache, this can be significant. Of course, a couple of decades bullying Sybase systems on Sun Enterprise boxes may have left me somewhat jaundiced - but Sybase can at least parallelise things. Sometimes. When it does, its quite a big win. In your hand waving, you have assumed that PostgreSQL B-Tree index might need to be replaced? :-) Sure, if the intent is to make the system thread-hot or AIO-hot, then the change is potentially very invasive. The strategy to evaluate queries based on parallel execution and async IO is not necessarily very like a strategy where you delegate to the OS buffer cache. I'm not too bothered for the urpose of this discussion whether the way that postgres currently navigates indexes is amenable to this. This is bikeshed land, right? I think it is foolish to disregard strategies that will allow overlapping IO and processing - and we want to keep disks reading and writing rather than seeking. To me that suggests AIO and disk-native queuing are quite a big deal. And parallel evaluation will be too as the number of cores goes up and there is an expectation tha
Re: [PERFORM] RAID arrays and performance
Mark Mielke wrote: At a minimum, this breaks your query into: 1) Preload all the index pages you will need Isn't this fairly predictable - the planner has chosen the index so it will be operating on a bounded subset. , 2) Scan the index pages you needed Yes, and AIO helps when you can scan them in arbitrary order, as they are returned. , 3) Preload all the table page you will need No - just request that they load. You can do work as soon as any page is returned. , 4) Scan the table pages you needed. In the order that is most naturally returned by the disks. But do you really need the whole index? I don't think I suggested that. What if you only need parts of the index, and this plan now reads the whole index using async I/O "just in case" it is useful? Where did you get that from? index scan into a regular B-Tree scan, which is difficult to perform async I/O for, as you don't necessarily know which pages to read next. Most B-trees are not so deep. It would generally be a win to retain interior nodes of indices in shared memory, even if leaf pages are not present. In such a case, it is quite quick to establish which leaf pages must be demand loaded. I'm not suggesting that Postgres indices are structured in a way that would support this sort of thing now. James ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] RAID arrays and performance
Mark Mielke wrote: This assumes that you can know which pages to fetch ahead of time - which you do not except for sequential read of a single table. Why doesn't it help to issue IO ahead-of-time requests when you are scanning an index? You can read-ahead in index pages, and submit requests for data pages as soon as it is clear you'll want them. Doing so can allow the disks and OS to relax the order in which you receive them, which may allow you to process them while IO continues, and it may also optimise away some seeking and settle time. Maybe. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] RAID arrays and performance
[EMAIL PROTECTED] wrote: So, if you hand requests one by one to the disc, it will almost always be faster to order them. On the other hand, if you hand a huge long list of requests to a decent SCSI or SATA-NCQ disc in one go, it will reorder the reads itself, and it will do it much better than you. Sure - but this doesn't suggest threading so much as pushing all reads into AIO as soon as they can be identified - and hoping that your os has a decent AIO subsystem, which is sadly a tall order for many *nix systems. I do think some thought should be given to threading but I would expect the effect to be more noticeable for updates where you update tables that have multiple indices. In the case of your scan then you need threading on CPU (rather than concurrent IO through AIO) if the disks can feed you data faster than you can scan it. Which might be the case for some scans using user functions, but I wouldn't have thought it would be the case for a sinple index scan. At some point, hopefully, the engines will be: a) thread safe (if not thread hot) so it can exist with threaded user functions and embedded languages b) able to incorporate C++ add-in functionality There may not be a pressing reason to support either of these, but having a capability to experiment would surely be helpful and allow incremental enhancement - so baby steps could be made to (for example) move stats and logging to a background thread, move push of results to clients out of the query evaluator thread, and so on. Parallel threading queries is a whle different ball game which needs thought in the optimiser. James ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Performance on 8CPU's and 32GB of RAM
Carlo Stonebanks wrote: Isn't it just easier to assume that Windows Server can't do anything right? ;-) Well, avoiding the ;-) - people do, and its remarkably foolish of them. Its a long-standing whinge that many people with a UNIX-background seem to just assume that Windows sucks, but you could run 40,000 sockets from a single Win32 process for a while and some well-known UNIX systems would still struggle to do this, libevent or no. Admitedly, the way a Win32 app is architected would be rather different from a typical POSIX one. Windows has been a cheap target bt its remarkably adequate and the TPC results speak for themselves. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Performance on 8CPU's and 32GB of RAM
Scott Marlowe wrote: Where unixes generally outperform windows is in starting up new backends, better file systems, and handling very large shared_buffer settings. Why do you think that UNIX systems are better at handling large shared buffers than Wndows? 32 bit Windows systems can suffer from fragmented address space, to be sure, but if the performance of the operating-system supplied mutex or semaphore isn't good enough, you can just use the raw atomic ops. If what you mean is that pg has a design that's heavily oriented towards things that tend to be cheap on POSIX and doesn't use the core Win32 features effectively, then let's track that as an optimisation opportunity for the Win32 port. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Performance on 8CPU's and 32GB of RAM
Scott Marlowe wrote: And there's the issue that with windows / NTFS that when one process opens a file for read, it locks it for all other users. This means that things like virus scanners can cause odd, unpredictable failures of your database. Can you provide some justification for this? James ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] LIKE search and performance
[EMAIL PROTECTED] wrote: What is a real life example where an intelligent and researched database application would issue a like or ilike query as their primary condition in a situation where they expected very high selectivity? In my case the canonical example is to search against textual keys where the search is performed automatically if the user hs typed enough data and paused. In almost all cases the '%' trails, and I'm looking for 'starts with' in effect. usually the search will have a specified upper number of returned rows, if that's an available facility. I realise in this case that matching against the index does not allow the match count unless we check MVCC as we go, but I don't see why another thread can't be doing that. James ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] LIKE search and performance
If Sybase is still like SQL Server (or the other way around), it *may* end up scanning the index *IFF* the index is a clustered index. If it's a normal index, it will do a sequential scan on the table. Are you sure its not covered? Have to check at work - but I'm off next week so it'll have to wait. It's not a win on PostgreSQL, because of our MVCC implementation. We need to scan *both* index *and* data pages if we go down that route, in which case it's a lot faster to just scan the data pages alone. Why do you need to go to all the data pages - doesn't the index structure contain all the keys so you prefilter and then check to see if the *matched* items are still in view? I'll be first to admit I know zip about Postgres, but it seems odd - doesn't the index contain copies of the key values?. I suspect that I mis-spoke with 'leaf'. I really just mean 'all index pages with data', since the scan does not even need to be in index order, just a good way to get at the data in a compact way. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] LIKE search and performance
Alexander Staubo wrote: On 5/23/07, Andy <[EMAIL PROTECTED]> wrote: An example would be: SELECT * FROM table WHERE name like '%john%' or street like '%srt%' Anyway, the query planner always does seq scan on the whole table and that takes some time. How can this be optimized or made in another way to be faster? There's no algorithm in existence that can "index" arbitrary substrings the way you think. The only rational way to accomplish this is to first break the text into substrings using some algorithm (eg., words delimited by whitespace and punctuation), and index the substrings individually. That seems rather harsh. If I'd put an index on each of these colomns I'd certainly expect it to use the indices - and I'm pretty sure that Sybase would. I'd expect it to scan the index leaf pages instead of the table itself - they should be much more compact and also likely to be hot in cache. Why *wouldn't* the planner do this? James ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] SCSI vs SATA
>Logic? Foul! That's NOT evidence. > >Mechanical devices have decreasing MTBF when run in hotter environments, >often at non-linear rates. I agree that this seems intuitive. But I think taking it as a cast-iron truth is dangerous. >Server class drives are designed with a longer lifespan in mind. Evidence? >Server class hard drives are rated at higher temperatures than desktop >drives. > >Google can supply any numbers to fill those facts in, but I found a >dozen or so data sheets for various enterprise versus desktop drives in >a matter of minutes. I know what the marketing info says, that's not the point. Bear in mind that these are somewhat designed to justify very much higher prices. I'm looking for statistical evidence that the difference is there, not marketing colateral. They may be designed to be more reliable. And the design targets *that the manufacturer admits to* may be more stringent, but I'm interested to know what the actual measured difference is. >From the sound of it, you DON'T have such evidence. Which is not a surprise, because I don't have it either, and I do try to keep my eyes open for it. James -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.446 / Virus Database: 269.0.0/751 - Release Date: 07/04/2007 22:57 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] SCSI vs SATA
>Server drives are generally more tolerant of higher temperatures. I.e. >the failure rate for consumer and server class HDs may be about the same >at 40 degrees C, but by the time the internal case temps get up to 60-70 >degrees C, the consumer grade drives will likely be failing at a much >higher rate, whether they're working hard or not. Can you cite any statistical evidence for this? James -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.446 / Virus Database: 268.18.26/746 - Release Date: 04/04/2007 13:09 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] SCSI vs SATA
>Right --- the point is not the interface, but whether the drive is built >for reliability or to hit a low price point. Personally I take the marketing mublings about the enterprise drives with a pinch of salt. The low-price drives HAVE TO be reliable too, because a non-negligible failure rate will result in returns processing costs that destroy a very thin margin. Granted, there was a move to very short warranties a while back, but the trend has been for more realistic warranties again recently. You can bet they don't do this unless the drives are generally pretty good. James -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.446 / Virus Database: 268.18.25/745 - Release Date: 03/04/2007 12:48 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] SCSI vs SATA
>sure but for any serious usage one either wants to disable that >cache(and rely on tagged command queuing or how that is called in SATAII >world) or rely on the OS/raidcontroller implementing some sort of >FUA/write barrier feature(which linux for example only does in pretty >recent kernels) Does anyone know which other hosts have write barrier implementations? Solaris? FreeBSD? Windows? The buffers should help greatly in such a case, right? Particularly if you have quite a wide stripe. -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.446 / Virus Database: 268.18.25/745 - Release Date: 03/04/2007 12:48 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] compact flash disks?
>On sequential read speed HDs outperform flash disks... only on random >access the flash disks are better. So if your application is a DW one, >you're very likely better off using HDs. This looks likely to be a non-issue shortly, see here: http://www.reghardware.co.uk/2007/03/27/sams_doubles_ssd_capacity/ I still think this sort of devices will become the OLTP device of choice before too long - even if we do have to watch the wear rate. >WARNING: modern TOtL flash RAMs are only good for ~1.2M writes per >memory cell. and that's the =good= ones. Well, my original question was whether the physical update pattern of the server does have hotspots that will tend to cause a problem in normal usage if the wear levelling (such as it is) doesn't entirely spread the load. The sorts of application I'm interested in will not update individual data elements very often. There's a danger that index nodes might be rewritted frequently, but one might want to allow that indexes persist lazily and should be recovered from a scan after a crash that leaves them dirty, so that they can be cached and avoid such an access pattern. Out of interest with respect to WAL - has anyone tested to see whether one could tune the group commit to pick up slightly bigger blocks and write the WAL using compression, to up the *effective* write speed of media? Once again, most data I'm interested in is far from a random pattern and tends to compress quite well. If the WAL write is committed to the disk platter, is it OK for arbitrary data blocks to have failed to commit to disk so we can recover the updates for committed transactions? Is theer any documentation on the write barrier usage? James -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.446 / Virus Database: 268.18.25/744 - Release Date: 03/04/2007 05:32 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] compact flash disks?
Isn't it likely that a single stream (or perhaps one that can be partitioned across spindles) will tend to be fastest, since it has a nice localised stream that a) allows for compression of reasonable blocks and b) fits with commit aggregation? RAM capacity on servers is going up and up, but the size of a customer address or row on an invoice isn't. I'd like to see an emphasis on speed of update with an assumption that most hot data is cached, most of the time. My understanding also is that storing data columnwise is handy when its persisted because linear scans are much faster. Saw it once with a system modelled after APL, blew me away even on a sparc10 once the data was organised and could be mapped. Still, for the moment anything that helps with the existing system would be good. Would it help to define triggers to be deferrable to commit as well as end of statement (and per row)? Seems to me it should be, at least for ones that raise 'some thing changed' events. And/or allow specification that events can fold and should be very cheap (don't know if this is the case now? Its not as well documented how this works as I'd like) James -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.446 / Virus Database: 268.18.7/713 - Release Date: 07/03/2007 09:24 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] compact flash disks?
>WARNING: modern TOtL flash RAMs are only good for ~1.2M writes per >memory cell. and that's the =good= ones. >Using flash RAM for write heavy applications like OLTP, or for WAL, >etc can be very dangerous Well, that's why I suggested that the WAL would stream to a hard disk array, where the large IO sequential write speed will be helpful. Whether OLTP is a problem will presumably depend on the freqency of updates and vacuum to each physical cluster of rows in a disk block. Most rows in a trading application will have quite a long lifetime, and be updated relatively few times (even where we writing fixings info into trades). >Flash write speeds also stink; being ~1/2 flash's already low read speed. Sure - but it may still be an effective tradoff where the limiting factor would otherwise be seek time. >Much better to use flash RAM for read heavy applications. Why? I can get a 'PC' server with 128GB of RAM quite easily now, and that will mean I can cache most of not all hot data for any trading app I've worked on. Settled trades that matured in prior periods can be moved to tables on real disks - they are hardly ever accessed anyway. In the long run, we are going to have to seriously rethink pg's use of WAL as the way we implement MVCC as it becomes more and more of a performance bottleneck. We have WAL because Stonebreaker made an assumption about the future dominance of optical media that has turned out to be false. ...and it's been one of pg's big issues every since. >> 2GB CF isn't so >>pricey any more. >Heck =16= GB Flash only costs ~$300 US and 128GB SSDs based on flash >RAM are due out this year. Quite. Suppose I have a RAID with double redundancy, then I get enough capacity for quite a lot of raw data, and can swap a card out every weekend and let the RAID rebuild it in rotation to keep them within conservative wear limits. So long as the wear levelling works moderately well (and without needing FAT on the disk or whatever) then I should be fine. I think. Maybe. James -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.446 / Virus Database: 268.18.7/713 - Release Date: 07/03/2007 09:24 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] compact flash disks?
I see that one can now get compact flash to SATA connectors. If I were to use a filesystem with noatime etc and little non-sql traffic, does the physical update pattern tend to have hot sectors that will tend to wear out CF? I'm wondering about a RAID5 with data on CF drives and RAID1 for teh WAL on a fast SATA or SAS drive pair. I'm thhinking that this would tend to have good performance because the seek time for the data is very low, even if the actual write speed can be slower than state of the art. 2GB CF isn't so pricey any more. Just wondering. James -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.446 / Virus Database: 268.18.7/711 - Release Date: 05/03/2007 09:41 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly