Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Title: RE: [Testperf-general] Re: [HACKERS] ExclusiveLock From: Kenneth Marshall [mailto:[EMAIL PROTECTED]] [snip] The simplest idea I had was to pre-layout the WAL logs in a contiguous fashion on the disk. Solaris has this ability given appropriate FS parameters and we should be able to get close on most other OSes. Once that has happened, use something like the FSM map to show the allocated blocks. The CPU can keep track of its current disk rotational position (approx. is okay) then when we need to write a WAL block start writing at the next area that the disk head will be sweeping. Give it a little leaway for latency in the system and we should be able to get very low latency for the writes. Obviously, there would be wasted space but you could intersperse writes to the granularity of space overhead that you would like to see. As far as implementation, I was reading an interesting article that used a simple theoretical model to estimate disk head position to avoid latency. Ken, That's a neat idea, but I'm not sure how much good it will do. As bad as rotational latency is, seek time is worse. Pre-allocation isn't going to do much for rotational latency if the heads also have to seek back to the WAL. OTOH, pre-allocation could help two other performance aspects of the WAL: First, if the WAL was pre-allocated, steps could be taken (by the operator, based on their OS) to make the space allocated to the WAL contiguous. Statistics on how much WAL is needed in 24 hours would help with that sizing. This would reduce seeks involved in writing the WAL data. The other thing it would do is reduce seeks and metadata writes involved in extending WAL files. All of this is moot if the WAL doesn't have its own spindle(s). This almost leads back to the old-fashioned idea of using a raw partition, to avoid the overhead of the OS and file structure. Or I could be thoroughly demonstrating my complete lack of understanding of PostgreSQL internals. :-) Maybe I'll get a chance to try the flash drive WAL idea in the next couple of weeks. Need to see if the hardware guys have a spare flash drive I can abuse. Paul
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
On Wed, Nov 24, 2004 at 11:00:30AM -0500, Bort, Paul wrote: From: Kenneth Marshall [mailto:[EMAIL PROTECTED] [snip] The simplest idea I had was to pre-layout the WAL logs in a contiguous fashion on the disk. Solaris has this ability given appropriate FS parameters and we should be able to get close on most other OSes. Once that has happened, use something like the FSM map to show the allocated blocks. The CPU can keep track of its current disk rotational position (approx. is okay) then when we need to write a WAL block start writing at the next area that the disk head will be sweeping. Give it a little leaway for latency in the system and we should be able to get very low latency for the writes. Obviously, there would be wasted space but you could intersperse writes to the granularity of space overhead that you would like to see. As far as implementation, I was reading an interesting article that used a simple theoretical model to estimate disk head position to avoid latency. Ken, That's a neat idea, but I'm not sure how much good it will do. As bad as rotational latency is, seek time is worse. Pre-allocation isn't going to do much for rotational latency if the heads also have to seek back to the WAL. OTOH, pre-allocation could help two other performance aspects of the WAL: First, if the WAL was pre-allocated, steps could be taken (by the operator, based on their OS) to make the space allocated to the WAL contiguous. Statistics on how much WAL is needed in 24 hours would help with that sizing. This would reduce seeks involved in writing the WAL data. The other thing it would do is reduce seeks and metadata writes involved in extending WAL files. All of this is moot if the WAL doesn't have its own spindle(s). This almost leads back to the old-fashioned idea of using a raw partition, to avoid the overhead of the OS and file structure. Or I could be thoroughly demonstrating my complete lack of understanding of PostgreSQL internals. :-) Maybe I'll get a chance to try the flash drive WAL idea in the next couple of weeks. Need to see if the hardware guys have a spare flash drive I can abuse. Paul Obviously, this whole process would be much more effective on systems with separate WAL drives. But even on less busy systems, the lock-step of write-a-WAL/wait-for-heads/write-a-WAL can dramatically decrease your effective throughput to the drive. For example, the worst case would be write one WAL block to disk. Then schedule another WAL block to be written to disk. This block will need to wait for 1 full disk rotation to perform the write. On a 10k drive, you will be able to log in this scenario 166 TPS assuming no piggy-backed syncs. Now look at the case where we can use the preallocated WAL and write immediately. Assuming a 100% sequential disk layout, if we can start writing within 25% of the full rotation we can now support 664 TPS on the same hardware. Now look at a typical hard drive on my desktop system with 150M sectors/4 heads/5 tracks - 3000 blocks/track or 375 8K blocks. If we can write the next block within 10 8K blocks we can perform 6225 TPS, within 5 8K blocks = 12450 TPS, within 2 8K blocks = 31125 TPS. This is just on a simple disk drive. As you can see, even small improvements can make a tremendous difference in throughput. My analysis is very simplistic and whether we can model the I/O quickly enough to be useful is still to be determined. Maybe someone on the mailing list with more experiance in how disk drives actually function can provide more definitive information. Ken ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Title: RE: [Testperf-general] Re: [HACKERS] ExclusiveLock The impression I had was that disk drives no longer pay the slightest attention to interleave specs, because the logical model implied by the concept is too far removed from modern reality (on-disk buffering, variable numbers of sectors per track, transparently remapped bad sectors, yadda yadda). Entirely true. Interleave was an issue back when the controller wasn't fast enough to keep up with 3600 RPM disks, and is now completely obscured from the bus. I don't know if the ATA spec includes interleave control; I suspect it does not. And that's just at the hardware level ... who knows where the filesystem is putting your data, or what the kernel I/O scheduler is doing with your requests :-( Basically I see the TODO item as a blue-sky research topic, not something we have any idea how to implement. That doesn't mean it can't be on the TODO list ... I think that if we also take into consideration various hardware and software RAID configurations, this is just too far removed from the database level to be at all practical to throw code at. Perhaps this should be rewritten as a documentation change: recommendations about performance hardware? What we recommend for our highest volume customers (alas, on a proprietary RDBMS, and only x86) is something like this: - Because drive capacity is so huge now, choose faster drives over larger drives. 15K RPM isn't three times faster than 5400, but there is a noticable difference. - More spindles reduce delays even further. Mirroring allows reads to happen faster because they can come from either side of the mirror, and spanning reduces problems with rotational delays. - The ideal disk configuration that we recommend is a 14 drive chassis with a split backplane. Run each backplane to a separate channel on the controller, and mirror the channels. Use the first drive on each channel for the OS and swap, the second drive for transaction logs, and the remaining drives spanned (and already mirrored) for data. With a reasonable write cache on the controller, this has proven to be a pretty fast configuration despite a less than ideal engine. One other thought: How does static RAM compare to disk speed nowadays? A 1Gb flash drive might be reasonable for the WAL if it can keep up.
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Bort, Paul [EMAIL PROTECTED] writes: One other thought: How does static RAM compare to disk speed nowadays? A 1Gb flash drive might be reasonable for the WAL if it can keep up. Flash RAM wears out; it's not suitable for a continuously-updated application like WAL. -Doug ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Title: RE: [Testperf-general] Re: [HACKERS] ExclusiveLock From: Doug McNaught [mailto:[EMAIL PROTECTED]] Bort, Paul [EMAIL PROTECTED] writes: One other thought: How does static RAM compare to disk speed nowadays? A 1Gb flash drive might be reasonable for the WAL if it can keep up. Flash RAM wears out; it's not suitable for a continuously-updated application like WAL. -Doug But if it's even 2x faster than a disk, that might be worth wearing them out. Given that they have published write count limits, one could reasonably plan to replace the memory after half of that time and be comfortable with the lifecycle. I saw somewhere that even with continuous writes on USB 2.0, it would take about twelve years to exhaust the write life of a typical flash drive. Even an order-of-magnitude increase in throughput beyond that only calls for a new drive every year. (Or every six months if you're paranoid. If you're that paranoid, you can mirror them, too.) Whether USB 2.0 is fast enought for the WAL is a separate discussion.
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
On Tue, Nov 23, 2004 at 12:04:17AM +, Simon Riggs wrote: On Mon, 2004-11-22 at 23:37, Greg Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: - Find a way to reduce rotational delay when repeatedly writing last WAL page Currently fsync of WAL requires the disk platter to perform a full rotation to fsync again. One idea is to write the WAL to different offsets that might reduce the rotational delay. Once upon a time when you formatted hard drives you actually gave them an interleave factor for a similar reason. These days you invariably use an interleave of 1, ie, store the blocks continuously. Whether that's because controllers have become fast enough to keep up with the burst rate or because the firmware is smart enough to handle the block interleaving invisibly isn't clear to me. I wonder if formatting the drive to have an interleave 1 would actually improve performance of the WAL log. It would depend a lot on the usage pattern though. A heavily used system might be able to generate enough WAL traffic to keep up with the burst rate of the drive. And an less used system might benefit but might lose. Probably now the less than saturated system gets close to the average half-rotation-time latency. This idea would only really help if you have a system that happens to be triggering pessimal results worse than that due to unfortunate timing. I was asking whether that topic should be removed, since Tom had said it had been rejected If you could tell me how to instrument the system to (better) show whether such plans as you suggest are workable, I would be greatly interested. Anything we do needs to be able to be monitored for success/failure. -- Best Regards, Simon Riggs The disk performance has increased so much that the reasons for having an interleave factor other than 1 (no interleaving) have all but disappeared. CPU speed has also increased so much relative to disk speed that using some CPU cycles to improve I/O is a reasonable approach. I have been considering how this might be accomplished. As Simon so aptly pointed out, we need to show that it materially affects the performance or it is not worth doing. The simplest idea I had was to pre-layout the WAL logs in a contiguous fashion on the disk. Solaris has this ability given appropriate FS parameters and we should be able to get close on most other OSes. Once that has happened, use something like the FSM map to show the allocated blocks. The CPU can keep track of its current disk rotational position (approx. is okay) then when we need to write a WAL block start writing at the next area that the disk head will be sweeping. Give it a little leaway for latency in the system and we should be able to get very low latency for the writes. Obviously, there would be wasted space but you could intersperse writes to the granularity of space overhead that you would like to see. As far as implementation, I was reading an interesting article that used a simple theoretical model to estimate disk head position to avoid latency. Yours truly, Ken Marshall ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
On Thu, 2004-11-18 at 23:54, Tom Lane wrote: I don't think so; WAL is inherently a linear log. (Awhile ago there was some talk of nonlinear log writing to get around the one-commit-per- disk-revolution syndrome, but the idea basically got rejected as unworkably complicated.) ...this appears to still be on the TODO list... should it be removed? - Find a way to reduce rotational delay when repeatedly writing last WAL page Currently fsync of WAL requires the disk platter to perform a full rotation to fsync again. One idea is to write the WAL to different offsets that might reduce the rotational delay. -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Simon Riggs [EMAIL PROTECTED] writes: - Find a way to reduce rotational delay when repeatedly writing last WAL page Currently fsync of WAL requires the disk platter to perform a full rotation to fsync again. One idea is to write the WAL to different offsets that might reduce the rotational delay. Once upon a time when you formatted hard drives you actually gave them an interleave factor for a similar reason. These days you invariably use an interleave of 1, ie, store the blocks continuously. Whether that's because controllers have become fast enough to keep up with the burst rate or because the firmware is smart enough to handle the block interleaving invisibly isn't clear to me. I wonder if formatting the drive to have an interleave 1 would actually improve performance of the WAL log. It would depend a lot on the usage pattern though. A heavily used system might be able to generate enough WAL traffic to keep up with the burst rate of the drive. And an less used system might benefit but might lose. Probably now the less than saturated system gets close to the average half-rotation-time latency. This idea would only really help if you have a system that happens to be triggering pessimal results worse than that due to unfortunate timing. -- greg ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
On Mon, 2004-11-22 at 23:37, Greg Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: - Find a way to reduce rotational delay when repeatedly writing last WAL page Currently fsync of WAL requires the disk platter to perform a full rotation to fsync again. One idea is to write the WAL to different offsets that might reduce the rotational delay. Once upon a time when you formatted hard drives you actually gave them an interleave factor for a similar reason. These days you invariably use an interleave of 1, ie, store the blocks continuously. Whether that's because controllers have become fast enough to keep up with the burst rate or because the firmware is smart enough to handle the block interleaving invisibly isn't clear to me. I wonder if formatting the drive to have an interleave 1 would actually improve performance of the WAL log. It would depend a lot on the usage pattern though. A heavily used system might be able to generate enough WAL traffic to keep up with the burst rate of the drive. And an less used system might benefit but might lose. Probably now the less than saturated system gets close to the average half-rotation-time latency. This idea would only really help if you have a system that happens to be triggering pessimal results worse than that due to unfortunate timing. I was asking whether that topic should be removed, since Tom had said it had been rejected If you could tell me how to instrument the system to (better) show whether such plans as you suggest are workable, I would be greatly interested. Anything we do needs to be able to be monitored for success/failure. -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Simon Riggs wrote: On Mon, 2004-11-22 at 23:37, Greg Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: - Find a way to reduce rotational delay when repeatedly writing last WAL page Currently fsync of WAL requires the disk platter to perform a full rotation to fsync again. One idea is to write the WAL to different offsets that might reduce the rotational delay. Once upon a time when you formatted hard drives you actually gave them an interleave factor for a similar reason. These days you invariably use an interleave of 1, ie, store the blocks continuously. Whether that's because controllers have become fast enough to keep up with the burst rate or because the firmware is smart enough to handle the block interleaving invisibly isn't clear to me. I wonder if formatting the drive to have an interleave 1 would actually improve performance of the WAL log. It would depend a lot on the usage pattern though. A heavily used system might be able to generate enough WAL traffic to keep up with the burst rate of the drive. And an less used system might benefit but might lose. Probably now the less than saturated system gets close to the average half-rotation-time latency. This idea would only really help if you have a system that happens to be triggering pessimal results worse than that due to unfortunate timing. I was asking whether that topic should be removed, since Tom had said it had been rejected The method used to fix it was rejected, but the goal of making it better is still a valid one. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Greg Stark [EMAIL PROTECTED] writes: Once upon a time when you formatted hard drives you actually gave them an interleave factor for a similar reason. These days you invariably use an interleave of 1, ie, store the blocks continuously. Whether that's because controllers have become fast enough to keep up with the burst rate or because the firmware is smart enough to handle the block interleaving invisibly isn't clear to me. The impression I had was that disk drives no longer pay the slightest attention to interleave specs, because the logical model implied by the concept is too far removed from modern reality (on-disk buffering, variable numbers of sectors per track, transparently remapped bad sectors, yadda yadda). And that's just at the hardware level ... who knows where the filesystem is putting your data, or what the kernel I/O scheduler is doing with your requests :-( Basically I see the TODO item as a blue-sky research topic, not something we have any idea how to implement. That doesn't mean it can't be on the TODO list ... regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
On Sat, 2004-11-20 at 16:14, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Thu, 2004-11-18 at 22:55, Tom Lane wrote: If it is a problem, the LockBuffer calls in RelationGetBufferForTuple would be the places showing contention delays. You say this as if we can easily check that. I think this can be done with oprofile ... OK, well thats where this thread started. oprofile only tells us aggregate information. It doesn't tell us how much time is spent waiting because of contention issues, it just tells us how much time is spent and even that is skewed. There really ought to be a better way to instrument things from inside, based upon knowledge of the code. -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
On Thu, 2004-11-18 at 22:55, Tom Lane wrote: Josh Berkus [EMAIL PROTECTED] writes: The main problem on INSERTs is that it is usually the same few pages: the lead data block and the lead index block. There are ways of spreading the load out across an index, but I'm not sure what happens on the leading edge of the data relation, but I think it hits the same block each time. I actually have several test cases for this, can you give me a trace or profile suggestion that would show if this is happening? If it is a problem, the LockBuffer calls in RelationGetBufferForTuple would be the places showing contention delays. You say this as if we can easily check that. My understanding is that this would require a scripted gdb session to instrument the executable at that point. Is that what you mean? That isn't typically regarded as a great thing to do on a production system. You've mentioned about performance speculation, which I agree with, but what are the alternatives? Compile-time changes aren't usually able to be enabled, since many people from work RPMs. It could also be that the contention is for the WALInsertLock, ie, the right to stuff a WAL record into the shared buffers. This effect would be the same even if you were inserting into N separate tables. ...and how do we check that also. Are we back to simulated workloads and fully rigged executables? -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Simon Riggs [EMAIL PROTECTED] writes: On Thu, 2004-11-18 at 22:55, Tom Lane wrote: If it is a problem, the LockBuffer calls in RelationGetBufferForTuple would be the places showing contention delays. You say this as if we can easily check that. I think this can be done with oprofile ... regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Tom, I think you are right that these reflect heap or btree-index extension operations. Those do not actually take locks on the *table* however, but locks on a single page within it (which are completely orthogonal to table locks and don't conflict). The pg_locks output leaves something to be desired, because you can't tell the difference between table and page locks. Aside from foriegn keys, though, is there any way in which INSERT page locks could block other inserts?I have another system (Lyris) where that appears to be happening with 32 concurrent INSERT streams.It's possible that the problem is somewhere else, but I'm disturbed by the possibility. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Josh Berkus [EMAIL PROTECTED] writes: Aside from foriegn keys, though, is there any way in which INSERT page locks could block other inserts? Not for longer than the time needed to physically add a tuple to a page. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Simon Riggs [EMAIL PROTECTED] writes: The main problem on INSERTs is that it is usually the same few pages: the lead data block and the lead index block. There are ways of spreading the load out across an index, but I'm not sure what happens on the leading edge of the data relation, but I think it hits the same block each time. FSM does what it can to spread the insertion load across multiple pages, but of course this is not going to help much unless your table has lots of embedded free space. I think it would work pretty well on a table with lots of update turnover, but not on an INSERT-only workload. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
On Thu, 2004-11-18 at 22:12, Tom Lane wrote: Josh Berkus [EMAIL PROTECTED] writes: Aside from foriegn keys, though, is there any way in which INSERT page locks could block other inserts? Not for longer than the time needed to physically add a tuple to a page. The main problem on INSERTs is that it is usually the same few pages: the lead data block and the lead index block. There are ways of spreading the load out across an index, but I'm not sure what happens on the leading edge of the data relation, but I think it hits the same block each time. Only an issue if you have more than one CPU... -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Simon, Tom, The main problem on INSERTs is that it is usually the same few pages: the lead data block and the lead index block. There are ways of spreading the load out across an index, but I'm not sure what happens on the leading edge of the data relation, but I think it hits the same block each time. I actually have several test cases for this, can you give me a trace or profile suggestion that would show if this is happening? -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Josh Berkus [EMAIL PROTECTED] writes: The main problem on INSERTs is that it is usually the same few pages: the lead data block and the lead index block. There are ways of spreading the load out across an index, but I'm not sure what happens on the leading edge of the data relation, but I think it hits the same block each time. I actually have several test cases for this, can you give me a trace or profile suggestion that would show if this is happening? If it is a problem, the LockBuffer calls in RelationGetBufferForTuple would be the places showing contention delays. It could also be that the contention is for the WALInsertLock, ie, the right to stuff a WAL record into the shared buffers. This effect would be the same even if you were inserting into N separate tables. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
On Thu, 2004-11-18 at 22:51, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: The main problem on INSERTs is that it is usually the same few pages: the lead data block and the lead index block. There are ways of spreading the load out across an index, but I'm not sure what happens on the leading edge of the data relation, but I think it hits the same block each time. FSM does what it can to spread the insertion load across multiple pages, but of course this is not going to help much unless your table has lots of embedded free space. I think it would work pretty well on a table with lots of update turnover, but not on an INSERT-only workload. OK, thats what I thought. So with a table with an INSERT-only workload, the FSM is always empty, so there only ever is one block that gets locked. That means we can't ever go faster than 1 CPU can go - any other CPUs will just wait for the block lock. [In Josh's case, 32 INSERT streams won't go significantly faster than about 4 streams, allowing for some overlap of other operations] Would it be possible to: when a new block is allocated from the relation file (rather than reused), we check the FSM - if it is empty, then we allocate 8 new blocks and add them all to the FSM. The next few INSERTers will then use the FSM blocks normally. Doing that will definitely speed up DBT-2 and many other workloads. Many tables have SERIAL defined, or use a monotonically increasing unique key. -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Simon Riggs [EMAIL PROTECTED] writes: Would it be possible to: when a new block is allocated from the relation file (rather than reused), we check the FSM - if it is empty, then we allocate 8 new blocks and add them all to the FSM. The next few INSERTers will then use the FSM blocks normally. Most likely that would just shift the contention to the WALInsertLock. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
On Thu, 2004-11-18 at 23:19, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Would it be possible to: when a new block is allocated from the relation file (rather than reused), we check the FSM - if it is empty, then we allocate 8 new blocks and add them all to the FSM. The next few INSERTers will then use the FSM blocks normally. Most likely that would just shift the contention to the WALInsertLock. Well, removing any performance bottleneck shifts the bottleneck to another place, though that is not an argument against removing it. Can we subdivide the WALInsertLock so there are multiple entry points to wal_buffers, based upon hashing the xid? That would allow wal to be written sequentially by each transaction though slightly out of order for different transactions. Commit/Abort would all go through the same lock to guarantee serializability. -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [Testperf-general] Re: [HACKERS] ExclusiveLock
Simon Riggs [EMAIL PROTECTED] writes: Can we subdivide the WALInsertLock so there are multiple entry points to wal_buffers, based upon hashing the xid? I don't think so; WAL is inherently a linear log. (Awhile ago there was some talk of nonlinear log writing to get around the one-commit-per- disk-revolution syndrome, but the idea basically got rejected as unworkably complicated.) What's more, there are a lot of entries that must remain time-ordered independently of transaction ownership. Consider btree index page splits and sequence nextvals for two examples. Certainly I'd not buy into any such project without incontrovertible proof that it would solve a major bottleneck --- and right now we are only speculating with no evidence. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org