Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
> Naturally, there are other compression and delta encoding schemes. Does > anyone feel the need to explore further alternatives? > > We might eventually find the need for multiple, user-selectable, WAL > compression strategies. I don't recommend taking that step yet. > my currently implemented compression strategy is to run the wal block through gzip in the archive command. compresses pretty nicely and achieved 50%+ in my workload (generally closer to 70) on a multi core system it will take more cpu time but on a different core and not have any effect on tps. General compression should probably only be applied if it have positive gain on tps you could. Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On Tue, Oct 23, 2012 at 08:21:54PM -0400, Noah Misch wrote: > -Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8- > HEAD,-F80 816 164465281821 MiB > xlogscale,-F80 824 164365511826 MiB > xlogscale+lz,-F80 717 146659241137 MiB > xlogscale+lz,-F100 753 150859481548 MiB > > Those are short runs with no averaging of multiple iterations; don't put too > much faith in the absolute numbers. I decided to rerun those measurements with three 15-minute runs. I removed the -F100 test and added wal_update_changes_v2.patch (delta encoding version) to the mix. Median results: -Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8- HEAD,-F80 832 1679679744 GiB scale,-F80 830 1679679844 GiB scale+lz,-F80 736 1498616911 GiB scale+delta,-F80841 1713705610 GiB The numbers varied little across runs. So we see the same general trends as with the short runs; overall performance is slightly higher across the board, and the fraction of WAL avoided is much higher. I'm suspecting the patches shrink WAL better in these longer runs because the WAL of a short run contains a higher density of full-page images. >From these results, I think that the LZ approach is something we could only provide as an option; CPU-bound workloads may not be our bread and butter, but we shouldn't dock them 10% with no option to disable. Amit's delta encoding approach seems to be something we could safely enable across the board. Naturally, there are other compression and delta encoding schemes. Does anyone feel the need to explore further alternatives? We might eventually find the need for multiple, user-selectable, WAL compression strategies. I don't recommend taking that step yet. nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On Tue, Oct 23, 2012 at 8:21 PM, Noah Misch wrote: > -Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8- > HEAD,-F80 816 164465281821 MiB > xlogscale,-F80 824 164365511826 MiB > xlogscale+lz,-F80 717 146659241137 MiB > xlogscale+lz,-F100 753 150859481548 MiB Ouch. I've been pretty excited by this patch, but I don't think we want to take an "optimization" that produces a double-digit hit at 1 client and doesn't gain even at 8 clients. I'm surprised this is costing that much, though. It doesn't seem like it should. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On Wed, Oct 24, 2012 at 05:55:56AM +, Amit kapila wrote: > Wednesday, October 24, 2012 5:51 AM Noah Misch wrote: > > Stepping back a moment, I would expect this patch to change performance in > > at > > least four ways (Heikki largely covered this upthread): > > > a) High-concurrency workloads will improve thanks to reduced WAL insert > > contention. > > b) All workloads will degrade due to the CPU cost of identifying and > > implementing the optimization. > > c) Workloads starved for bulk WAL I/O will improve due to reduced WAL > > volume. > > d) Workloads composed primarily of long transactions with high WAL volume > > will > >improve due to having fewer end-of-WAL-segment fsync requests. > > All your points are very good summarization of work, but I think one point > can be added : > e) Reduced the cost of doing crc and copying less data in Xlog buffer in > XLogInsert() due to reduced size of xlog record. True. > > Your benchmark numbers show small gains and losses for single-client > > workloads, moving to moderate gains for 2-client workloads. This suggests > > strong influence from (a), some influence from (b), and little influence > > from > > (c) and (d). Actually, the response to scale evident in your numbers seems > > too good to be true; why would (a) have such a large effect over the > > transition from one client to two clients? > > I think if we just see from the point of LZ compression, there are > predominently 2 things, your point (b) and point (e) mentioned by me. > For single threads, the cost of doing compression supercedes the cost of crc > and other improvement in xloginsert(). > However when come to multi threads, the cost reduction due to point (e) will > reduce the time under lock and hence we see such a effect from > 1 client to 2 clients. Note that the CRC calculation over variable-size data in the WAL record happens before taking WALInsertLock. > > Also, for whatever reason, all > > your numbers show fairly bad scaling. With the XLOG scale and LZ patches, > > synchronous_commit=off, -F 80, and rec length 250, 8-client average > > performance is only 2x that of 1-client average performance. Correction: with the XLOG scale patch only, your benchmark runs show 8-client average performance as 2x that of 1-client average performance. With both the XLOG scale and LZ patches, it grows to almost 4x. However, both ought to be closer to 8x. > > -Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8- > > HEAD,-F80 816 164465281821 MiB > > xlogscale,-F80 824 164365511826 MiB > > xlogscale+lz,-F80 717 146659241137 MiB > > xlogscale+lz,-F100 753 150859481548 MiB > > > Those are short runs with no averaging of multiple iterations; don't put too > > much faith in the absolute numbers. Still, I consistently get linear > > scaling > > from 1 client to 8 clients. Why might your results have been so different > > in > > this regard? > > 1. The only reason for you seeing the difference of linear scalability can be > because of the numbers I have posted for 8 threads is > of run with -c16 -j8. I shall run with -c8 and post the performance numbers. > I am hoping it should match the way you see the numbers I doubt that. Your 2-client numbers also show scaling well-below linear. With 8 cores, 16-client performance should not fall off compared to 8 clients. Perhaps 2 clients saturate your I/O under this workload, but 1 client does not. Granted, that theory doesn't explain all your numbers, such as the improvement for record length 50 @ -c1. > 2. Now, if we see that in the results you have posted, > a) there is not much performance difference between head and xlog scale Note that the xlog scale patch addresses a different workload: http://archives.postgresql.org/message-id/505b3648.1040...@vmware.com > b) with LZ patch it shows there is decrease in performance >I think this can be because it has ran for very less time as you have also > mentioned. Yes, that's possible. > > It's also odd that your -F100 numbers tend to follow your -F80 numbers > > despite > > the optimization kicking in far more frequently for the latter. > > The results with avg of 3 - 15mins runs for LZ patch are: > -Patch- -tps@-c1- -tps@-c2- -tps@-c16-j8 > xlogscale+lz,-F80 6631232 2498 > xlogscale+lz,-F100 6601221 2361 > > The result is showing that avg. tps is better with -F80 which is I think what > is expected. Yes. Let me elaborate on the point I hoped to make. Based on my test above, -F80 more than doubles the bulk WAL savings compared to -F100. Your benchmark runs showed a 61.8% performance improvement at -F100 and a 62.5% performance improvement at -F80. If shrinking WAL increases performance, shrinking it more should increas
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On Wednesday, October 24, 2012 12:15 AM Alvaro Herrera wrote: Amit kapila wrote: > Rebased version of patch based on latest code. > Uhm, how can this patch change a caller of PageAddItem() by adding one > more argument, yet not touch bufpage.c at all? Are you sure this > compiles? It compiles, the same is confirmed even with latest Head. Can you please point me if you feel something is done wrong in the patch. > The email subject has a WIP tag; is that still the patch status? If so, > I assume it's okay to mark this Returned with Feedback and expect a > later version to be posted. The WIP word is from original mail chain discussion. The current status is as follows: I have update the patch with all bug fixes and performance results were posted. Noah has also taken the performance data. He believes that there is discrepency in performance data, but actually the reason according to me is just the way I have posted the data. Currently there is no clear feedback on which I can work, So I would be very thankfull to you if you can wait for some conclusion of the discussion. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Wednesday, October 24, 2012 5:51 AM Noah Misch wrote: >Hi Amit, Noah, Thank you for taking the performance data. >On Tue, Oct 16, 2012 at 09:22:39AM +, Amit kapila wrote: > On Saturday, October 06, 2012 7:34 PM Amit Kapila wrote: >> > Please find the readings of LZ patch along with Xlog-Scale patch. >> > The comparison is between for Update operations >> > base code + Xlog Scale Patch >> > base code + Xlog Scale Patch + Update WAL Optimization (LZ compression) > >> This contains all the consolidated data and comparison for both the >> approaches: > >> The difference of this testcase as compare to previous one is that it has >> default value of wal_page_size ( 8K ) as compare to previous one where >> configuration used for wal_page_size was 1K > What is "wal_page_size"? Is that ./configure --with-wal-blocksize? Yes. > Observations From Performance Data > -- > 1. With both the approaches Performance data is good. > LZ compression - upto 100% performance improvement. > Offset Approach - upto 160% performance improvement. > 2. The performance data is better for LZ compression approach when the > changed value of tuple is large. (Refer 500 length changed value). > 3. The performance data is better for Offset Approach for 1 thread for any > size of Data (it dips for LZ compression Approach). > Stepping back a moment, I would expect this patch to change performance in at > least four ways (Heikki largely covered this upthread): > a) High-concurrency workloads will improve thanks to reduced WAL insert > contention. > b) All workloads will degrade due to the CPU cost of identifying and > implementing the optimization. > c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume. > d) Workloads composed primarily of long transactions with high WAL volume will >improve due to having fewer end-of-WAL-segment fsync requests. All your points are very good summarization of work, but I think one point can be added : e) Reduced the cost of doing crc and copying less data in Xlog buffer in XLogInsert() due to reduced size of xlog record. > Your benchmark numbers show small gains and losses for single-client > workloads, moving to moderate gains for 2-client workloads. This suggests > strong influence from (a), some influence from (b), and little influence from > (c) and (d). Actually, the response to scale evident in your numbers seems > too good to be true; why would (a) have such a large effect over the > transition from one client to two clients? I think if we just see from the point of LZ compression, there are predominently 2 things, your point (b) and point (e) mentioned by me. For single threads, the cost of doing compression supercedes the cost of crc and other improvement in xloginsert(). However when come to multi threads, the cost reduction due to point (e) will reduce the time under lock and hence we see such a effect from 1 client to 2 clients. > Also, for whatever reason, all > your numbers show fairly bad scaling. With the XLOG scale and LZ patches, > synchronous_commit=off, -F 80, and rec length 250, 8-client average > performance is only 2x that of 1-client average performance. I am really sorry, this is my mistake about putting the numbers; the 8 threads number is actually a number with -c16 -j8 means 16 clients and 8 threads. That can be the reason it's just showing 2X otherwise it would have shown numbers similar to what you are seeing. > Benchmark results: > -Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8- > HEAD,-F80 816 164465281821 MiB > xlogscale,-F80 824 164365511826 MiB > xlogscale+lz,-F80 717 146659241137 MiB > xlogscale+lz,-F100 753 150859481548 MiB > Those are short runs with no averaging of multiple iterations; don't put too > much faith in the absolute numbers. Still, I consistently get linear scaling > from 1 client to 8 clients. Why might your results have been so different in > this regard? 1. The only reason for you seeing the difference of linear scalability can be because of the numbers I have posted for 8 threads is of run with -c16 -j8. I shall run with -c8 and post the performance numbers. I am hoping it should match the way you see the numbers 2. Now, if we see that in the results you have posted, a) there is not much performance difference between head and xlog scale b) with LZ patch it shows there is decrease in performance I think this can be because it has ran for very less time as you have also mentioned. > It's also odd that your -F100 numbers tend to follow your -F80 numbers despite > the optimization kicking in far more frequently for the latter. The results with avg of 3 - 15mins runs for LZ patch are: -Patch- -tps@-c1- -tps@-c2- -tps@-c16-j8 xlogscale+lz,-F80 663
[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Hi Amit, On Tue, Oct 16, 2012 at 09:22:39AM +, Amit kapila wrote: > On Saturday, October 06, 2012 7:34 PM Amit Kapila wrote: > > Please find the readings of LZ patch along with Xlog-Scale patch. > > The comparison is between for Update operations > > base code + Xlog Scale Patch > > base code + Xlog Scale Patch + Update WAL Optimization (LZ compression) > > This contains all the consolidated data and comparison for both the > approaches: > > The difference of this testcase as compare to previous one is that it has > default value of wal_page_size ( 8K ) as compare to previous one where > configuration used for wal_page_size was 1K What is "wal_page_size"? Is that ./configure --with-wal-blocksize? > Observations From Performance Data > -- > 1. With both the approaches Performance data is good. > LZ compression - upto 100% performance improvement. > Offset Approach - upto 160% performance improvement. > 2. The performance data is better for LZ compression approach when the > changed value of tuple is large. (Refer 500 length changed value). > 3. The performance data is better for Offset Approach for 1 thread for any > size of Data (it dips for LZ compression Approach). Stepping back a moment, I would expect this patch to change performance in at least four ways (Heikki largely covered this upthread): a) High-concurrency workloads will improve thanks to reduced WAL insert contention. b) All workloads will degrade due to the CPU cost of identifying and implementing the optimization. c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume. d) Workloads composed primarily of long transactions with high WAL volume will improve due to having fewer end-of-WAL-segment fsync requests. Your benchmark numbers show small gains and losses for single-client workloads, moving to moderate gains for 2-client workloads. This suggests strong influence from (a), some influence from (b), and little influence from (c) and (d). Actually, the response to scale evident in your numbers seems too good to be true; why would (a) have such a large effect over the transition from one client to two clients? Also, for whatever reason, all your numbers show fairly bad scaling. With the XLOG scale and LZ patches, synchronous_commit=off, -F 80, and rec length 250, 8-client average performance is only 2x that of 1-client average performance. I attempted to reproduce this effect on an EC2 m2.4xlarge instance (8 cores, 70 GiB) with the data directory under a tmpfs mount. This should thoroughly isolate effects (a) and (b) from (c) and (d). I used your pgbench_250.c[1] in 30s runs. Configuration: autovacuum | off checkpoint_segments | 500 checkpoint_timeout | 1h client_encoding | UTF8 lc_collate | C lc_ctype| C max_connections | 100 server_encoding | SQL_ASCII shared_buffers | 4GB wal_buffers | 16MB Benchmark results: -Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8- HEAD,-F80 816 164465281821 MiB xlogscale,-F80 824 164365511826 MiB xlogscale+lz,-F80 717 146659241137 MiB xlogscale+lz,-F100 753 150859481548 MiB Those are short runs with no averaging of multiple iterations; don't put too much faith in the absolute numbers. Still, I consistently get linear scaling from 1 client to 8 clients. Why might your results have been so different in this regard? It's also odd that your -F100 numbers tend to follow your -F80 numbers despite the optimization kicking in far more frequently for the latter. nm [1] http://archives.postgresql.org/message-id/001d01cda180$9f1e47a0$dd5ad6e0$@kap...@huawei.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Amit kapila wrote: > Rebased version of patch based on latest code. Uhm, how can this patch change a caller of PageAddItem() by adding one more argument, yet not touch bufpage.c at all? Are you sure this compiles? The email subject has a WIP tag; is that still the patch status? If so, I assume it's okay to mark this Returned with Feedback and expect a later version to be posted. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On Thursday, October 04, 2012 8:03 PM Heikki Linnakangas wrote: On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote: On Friday, September 28, 2012 7:03 PM Amit Kapila wrote: > > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote: > > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote: > > > On 25.09.2012 18:27, Amit Kapila wrote: > > > > If you feel it is must to do the comparison, we can do it in same > > way > > > > as we identify for HOT? > > > > > > > Now I shall do the various tests for following and post it here: > > a. Attached Patch in the mode where it takes advantage of history > > tuple b. By changing the logic for modified column calculation to use > > calculation for memcmp() > > > 1. Please find the results (pgbench_test.htm) for point -2 where there is > one fixed column updation (last few bytes are random) and second column > updation is 32 byte random string. The results for 50, 100 are still going > on others are attached with this mail. Please find the readings of LZ patch along with Xlog-Scale patch. The comparison is between for Update operations base code + Xlog Scale Patch base code + Xlog Scale Patch + Update WAL Optimization (LZ compression) The readings have been taken based on below data. pgbench_xlog_scale_50 - a. Updated Record size 50, Total Record size 1800 b. Threads 8, 1 ,2 c. Synchronous_commit - off, on pgbench_xlog_scale_250 - a. Updated Record size 250, Total Record size 1800 b. Threads 8, 1 ,2 c. Synchronous_commit - off, on pgbench_xlog_scale_500- a. Updated Record size 500, Total Record size 1800 b. Threads 8, 1 ,2 c. Synchronous_commit - off, on Observations -- a. There is still a good performance improvement even if we do Update WAL optimization on top of Xlog Sclaing Patch. b. There is a slight performance dip for 1 thread (only in Sync mode = off) with Update WAL optimization (LZ compression) but for 2 threads there is a performance increase. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
> On Thursday, October 04, 2012 12:54 PM Heikki Linnakangas > On 03.10.2012 19:03, Amit Kapila wrote: > > Any comments/suggestions regarding performance/functionality test? > > Hmm. Doing a lot of UPDATEs concurrently can be limited by the > WALInsertLock, which each inserter holds while copying the WAL record to > the buffer. Reducing the size of the WAL records, by compression or > delta encoding, alleviates that bottleneck: when WAL records are > smaller, the lock needs to be held for a shorter duration. That improves > throughput, even if individual backends need to do more CPU work to > compress the records, because that work can be done in parallel. I > suspect much of the benefit you're seeing in these tests might be > because of that effect. > > As it happens, I've been working on making WAL insertion scale better in > general: > http://archives.postgresql.org/message-id/5064779a.3050...@vmware.com. > That should also help most when inserting large WAL records. The > question is: assuming we commit the xloginsert-scale patch, how much > benefit is there left from the compression? It will surely still help to > reduce the size of WAL, which can certainly help if you're limited by > the WAL I/O, but I suspect the results from the pgbench tests you run > might look quite different. > > So, could you rerun these tests with the xloginsert-scale patch applied? I shall take care of doing the performance test with xloginsert-scale patch as well both for single and multi-thread. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On 03.10.2012 19:03, Amit Kapila wrote: Any comments/suggestions regarding performance/functionality test? Hmm. Doing a lot of UPDATEs concurrently can be limited by the WALInsertLock, which each inserter holds while copying the WAL record to the buffer. Reducing the size of the WAL records, by compression or delta encoding, alleviates that bottleneck: when WAL records are smaller, the lock needs to be held for a shorter duration. That improves throughput, even if individual backends need to do more CPU work to compress the records, because that work can be done in parallel. I suspect much of the benefit you're seeing in these tests might be because of that effect. As it happens, I've been working on making WAL insertion scale better in general: http://archives.postgresql.org/message-id/5064779a.3050...@vmware.com. That should also help most when inserting large WAL records. The question is: assuming we commit the xloginsert-scale patch, how much benefit is there left from the compression? It will surely still help to reduce the size of WAL, which can certainly help if you're limited by the WAL I/O, but I suspect the results from the pgbench tests you run might look quite different. So, could you rerun these tests with the xloginsert-scale patch applied? Reducing the WAL size might still be a good idea even if the patch doesn't have much effect on TPS, but I'd like to make sure that the compression doesn't hurt performance. Also, it would be a good idea to repeat the tests with just a single client; we don't want to hurt the performance in that scenario either. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
> On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote: > On 25.09.2012 18:27, Amit Kapila wrote: > > If you feel it is must to do the comparison, we can do it in same way > > as we identify for HOT? > > Yeah. (But as discussed, I think it would be even better to just treat > the old and new tuple as an opaque chunk of bytes, and run them through > a generic delta algorithm). > Thank you for the modified patch. > The conclusion is that there isn't very much difference among the > patches. They all squeeze the WAL to about the same size, and the > increase in TPS is roughly the same. > > I think more performance testing is required. The modified pgbench test > isn't necessarily very representative of a real-life application. The > gain (or loss) of this patch is going to depend a lot on how many > columns are updated, and in what ways. Need to test more scenarios, > with many different database schemas. > > The LZ approach has the advantage that it can take advantage of all > kinds of similarities between old and new tuple. For example, if you > swap the values of two columns, LZ will encode that efficiently. Or if > you insert a character in the middle of a long string. On the flipside, > it's probably more expensive. Then again, you have to do a memcmp() to > detect which columns have changed with your approach, and that's not > free either. That was not yet included in the patch version I tested. > Another consideration is that when you compress the record more, you > have less data to calculate CRC for. CRC calculation tends to be quite > expensive, so even quite aggressive compression might be a win. Yet > another consideration is that the compression/encoding is done while > holding a lock on the buffer. For the sake of concurrency, you want to > keep the duration the lock is held as short as possible. Now I shall do the various tests for following and post it here: a. Attached Patch in the mode where it takes advantage of history tuple b. By changing the logic for modified column calculation to use calculation for memcmp() With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On 25.09.2012 18:27, Amit Kapila wrote: If you feel it is must to do the comparison, we can do it in same way as we identify for HOT? Yeah. (But as discussed, I think it would be even better to just treat the old and new tuple as an opaque chunk of bytes, and run them through a generic delta algorithm). Can you please explain me why you think that after doing encoding doing LZ compression on it is better, as already we have reduced the amount of WAL for update by only storing changed column information? a. is it to further reduce the size of WAL b. storing diff WAL in some standard format c. or does it give any other kind of benefit Potentially all of those. I don't know if it'd be better or worse, but my gut feeling is that it would be simpler, and produce even more compact WAL. Attached is a simple patch to apply LZ compression to update WAL records. I modified the LZ compressor so that it can optionally use a separate "history" data, and the same history data must then be passed to the decompression function. That makes it work as a pretty efficient delta encoder, when you use the old tuple as the history data. I ran some performance tests with the modified version of pgbench that you posted earlier: Current PostgreSQL master - tps = 941.601924 (excluding connections establishing) pg_xlog_location_diff --- 721227944 pglz_wal_update_records.patch - tps = 1039.792527 (excluding connections establishing) pg_xlog_location_diff --- 419395208 pglz_wal_update_records.patch, COMPRESS_ONLY tps = 1009.682002 (excluding connections establishing) pg_xlog_location_diff --- 422505104 Amit's wal_update_changes_hot_update.patch -- tps = 1092.703883 (excluding connections establishing) pg_xlog_location_diff --- 436031544 The COMPRESS_ONLY result is with the attached patch, but it just uses LZ to compress the new tuple, without taking advantage of the old tuple. The pg_xlog_location_diff value is the amount of WAL generated during the pgbench run. Attached is also the shell script I used to run these tests. The conclusion is that there isn't very much difference among the patches. They all squeeze the WAL to about the same size, and the increase in TPS is roughly the same. I think more performance testing is required. The modified pgbench test isn't necessarily very representative of a real-life application. The gain (or loss) of this patch is going to depend a lot on how many columns are updated, and in what ways. Need to test more scenarios, with many different database schemas. The LZ approach has the advantage that it can take advantage of all kinds of similarities between old and new tuple. For example, if you swap the values of two columns, LZ will encode that efficiently. Or if you insert a character in the middle of a long string. On the flipside, it's probably more expensive. Then again, you have to do a memcmp() to detect which columns have changed with your approach, and that's not free either. That was not yet included in the patch version I tested. Another consideration is that when you compress the record more, you have less data to calculate CRC for. CRC calculation tends to be quite expensive, so even quite aggressive compression might be a win. Yet another consideration is that the compression/encoding is done while holding a lock on the buffer. For the sake of concurrency, you want to keep the duration the lock is held as short as possible. - Heikki diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 5a4591e..56b53a5 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -70,6 +70,7 @@ #include "utils/snapmgr.h" #include "utils/syscache.h" #include "utils/tqual.h" +#include "utils/pg_lzcompress.h" /* GUC variable */ @@ -85,6 +86,7 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid, CommandId cid, int options); static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from, Buffer newbuf, HeapTuple newtup, +HeapTuple oldtup, bool all_visible_cleared, bool new_all_visible_cleared); static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs, HeapTuple oldtup, HeapTuple newtup); @@ -3195,10 +3197,12 @@ l2: /* XLOG stuff */ if (RelationNeedsWAL(relation)) { - XLogRecPtr recptr = log_heap_update(relation, buffer, oldtup.t_self, - newbuf, heaptup, - all_visible_cleared, - all_visible_cleared_new); + XLogRecPtr recptr; + + recptr = log_heap_update(relation, buffer, oldtup.t_self, + newbuf, heaptup, &oldtup, + all_visible_cleared, +
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On Thursday, September 27, 2012 10:19 AM > Noah Misch writes: > > You cannot assume executor-unmodified columns are also unmodified > from > > heap_update()'s perspective. Expansion in one column may instigate > TOAST > > compression of a logically-unmodified column, and that counts as a > change for > > xlog delta purposes. > > Um ... what about BEFORE triggers? This optimization will not apply in case Before triggers updates the tuple. > > Frankly, I think that expecting the executor to tell you which columns > have been modified is a non-starter. We have a solution for HOT and > it's silly to do the same thing differently just a few lines away. > My apprehension is that it can hit the performance advantage if we compare all attributes to check which have been modified and that to under Buffer Exclusive Lock. In case of HOT only the index attributes get compared. I agree that doing things differently at 2 nearby places is not good. So I will do it same way as for HOT and then take the performance data again and if there is no big impact then we can do it that way. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Noah Misch writes: > You cannot assume executor-unmodified columns are also unmodified from > heap_update()'s perspective. Expansion in one column may instigate TOAST > compression of a logically-unmodified column, and that counts as a change for > xlog delta purposes. Um ... what about BEFORE triggers? Frankly, I think that expecting the executor to tell you which columns have been modified is a non-starter. We have a solution for HOT and it's silly to do the same thing differently just a few lines away. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On Mon, Sep 24, 2012 at 10:57:02AM +, Amit kapila wrote: > Rebased version of patch based on latest code. I like the direction you're taking with this patch; the gains are striking, especially considering the isolation of the changes. You cannot assume executor-unmodified columns are also unmodified from heap_update()'s perspective. Expansion in one column may instigate TOAST compression of a logically-unmodified column, and that counts as a change for xlog delta purposes. You do currently skip the optimization for relations having a TOAST table, but TOAST compression can still apply. Observe this with text columns of storage mode PLAIN. I see two ways out: skip the new behavior when need_toast=true, or compare all inline column data, not just what the executor modified. One can probably construct a benchmark favoring either choice. I'd lean toward the latter; wide tuples are the kind this change can most help. If the marginal advantage of ignoring known-unmodified columns proves important, we can always bring it back after designing a way to track which columns changed in the toaster. Given that, why not treat the tuple as an opaque series of bytes and not worry about datum boundaries? When several narrow columns change together, say a sequence of sixteen smallint columns, you will use fewer binary delta commands by representing the change with a single 32-byte substitution. If an UPDATE changes just part of a long datum, the delta encoding algorithm will still be able to save considerable space. That case arises in many forms: changing one word in a long string, changing one element in a long array, changing one field of a composite-typed column. Granted, this makes the choice of delta encoding algorithm more important. Like Heikki, I'm left wondering why your custom delta encoding is preferable to an encoding from the literature. Your encoding has much in common with VCDIFF, even sharing two exact command names. If a custom encoding is the right thing, code comments or a README section should at least discuss the advantages over an established alternative. Idle thought: it might pay off to use 1-byte sizes and offsets most of the time. Tuples shorter than 256 bytes are common; for longer tuples, we can afford wider offsets. The benchmarks you posted upthread were helpful. I think benchmarking with fsync=off is best if you don't have a battery-backed write controller or SSD. Otherwise, fsync time dominates a pgbench run. Please benchmark recovery. To do so, set up WAL archiving and take a base backup from a fresh cluster. Run pgbench for awhile. Finally, observe the elapsed time to recover your base backup to the end of archived WAL. > *** a/src/backend/access/common/heaptuple.c > --- b/src/backend/access/common/heaptuple.c > + /* > + * encode_xlog_update > + * Forms a diff tuple from old and new tuple with the modified > columns. > + * > + * att - attribute list. > + * oldtup - pointer to the old tuple. > + * heaptup - pointer to the modified tuple. > + * wal_tup - pointer to the wal record which needs to be formed > from old > + and new tuples by using the modified columns > list. > + * modifiedCols - modified columns list by the update command. > + */ > + void > + encode_xlog_update(Form_pg_attribute *att, HeapTuple oldtup, > +HeapTuple heaptup, HeapTuple wal_tup, > +Bitmapset *modifiedCols) This name is too generic for an extern function. Maybe "heap_delta_encode"? > + void > + decode_xlog_update(HeapTupleHeader htup, uint32 old_tup_len, char *data, > +uint32 *new_tup_len, char *waldata, uint32 > wal_len) Likwise, maybe "heap_delta_decode" here. > *** a/src/backend/access/heap/heapam.c > --- b/src/backend/access/heap/heapam.c > *** > *** 71,77 > #include "utils/syscache.h" > #include "utils/tqual.h" > > - > /* GUC variable */ > boolsynchronize_seqscans = true; > Spurious whitespace change. > *** > *** 3195,3204 l2: > /* XLOG stuff */ > if (RelationNeedsWAL(relation)) > { > ! XLogRecPtr recptr = log_heap_update(relation, buffer, > oldtup.t_self, > ! > newbuf, heaptup, > ! > all_visible_cleared, > ! > all_visible_cleared_new); > > if (newbuf != buffer) > { > --- 3203,3233 > /* XLOG stuff */ > if (RelationNeedsWAL(relation)) > { > ! XLogRecPtr recptr; > ! > ! /* > ! * Apply the xlog diff update algorithm only for hot updates. >
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
> On Tuesday, September 25, 2012 7:30 PM Heikki Linnakangas wrote: > On 24.09.2012 13:57, Amit kapila wrote: > > Rebased version of patch based on latest code. > > When HOT was designed, we decided that heap_update needs to compare the > old and new attributes directly, with memcmp(), to determine whether > any > of the indexed columns have changed. It was not deemed infeasible to > pass down that information from the executor. I don't remember the > details of why that was, but you seem to trying to same thing in this > patch, and pass the bitmap of modified cols from the executor to > heap_update(). I'm pretty sure that won't work, for the same reasons we > didn't do it for HOT. I think the reason of not relying on modified columns can be some such case where modified columns might not give the correct information. It may be due to Before triggers can change the modified columns that's why for HOT update we need to do Comparison. In our case we have taken care of such a case by not doing optimization, so not relying on modified columns. If you feel it is must to do the comparison, we can do it in same way as we identify for HOT? > I still feel that it would probably be better to use a generic delta > encoding scheme, instead of inventing one. How about VCDIFF > (http://tools.ietf.org/html/rfc3284), for example? Or you could reuse > the LZ compressor that we already have in the source tree. You can use > LZ for delta compression by initializing the history buffer of the > algorithm with the old tuple, and then compressing the new tuple as > usual. >Or you could still use the knowledge of where the attributes > begin and end and which attributes were updated, and do the encoding > similar to how you did in the patch, but use LZ as the output format. > That way the decoding would be the same as LZ decompression. Can you please explain me why you think that after doing encoding doing LZ compression on it is better, as already we have reduced the amount of WAL for update by only storing changed column information? a. is it to further reduce the size of WAL b. storing diff WAL in some standard format c. or does it give any other kind of benefit With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
On 24.09.2012 13:57, Amit kapila wrote: Rebased version of patch based on latest code. When HOT was designed, we decided that heap_update needs to compare the old and new attributes directly, with memcmp(), to determine whether any of the indexed columns have changed. It was not deemed infeasible to pass down that information from the executor. I don't remember the details of why that was, but you seem to trying to same thing in this patch, and pass the bitmap of modified cols from the executor to heap_update(). I'm pretty sure that won't work, for the same reasons we didn't do it for HOT. I still feel that it would probably be better to use a generic delta encoding scheme, instead of inventing one. How about VCDIFF (http://tools.ietf.org/html/rfc3284), for example? Or you could reuse the LZ compressor that we already have in the source tree. You can use LZ for delta compression by initializing the history buffer of the algorithm with the old tuple, and then compressing the new tuple as usual. Or you could still use the knowledge of where the attributes begin and end and which attributes were updated, and do the encoding similar to how you did in the patch, but use LZ as the output format. That way the decoding would be the same as LZ decompression. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation
From: Heikki Linnakangas [mailto:heikki(dot)linnakangas(at)enterprisedb(dot)com] Sent: Monday, August 27, 2012 5:58 PM To: Amit kapila On 27.08.2012 15:18, Amit kapila wrote: >>> I have implemented the WAL Reduction Patch for the case of HOT Update as pointed out by Simon and Robert. In this patch it only goes for Optimized WAL in case of HOT Update with other restrictions same as in previous patch. >> >>> The performance numbers for this patch are attached in this mail. It has improved by 90% if the page has fillfactor 80. >> >>> Now going forward I have following options: >>> a. Upload the patch in Open CF for WAL Reduction which contains reductution for HOT and non-HOT updates. >>> b. Upload the patch in Open CF for WAL Reduction which contains reductution for HOT updates. >>> c. Upload both the patches as different versions. >> Let's do it for HOT updates only. Simon & Robert made good arguments on >> why this is a bad idea for non-HOT updates. >Okay, I shall do it that way. >So now I shall send information about all the testing I have done for this >Patch and then Upload it in CF. Rebased version of patch based on latest code. With Regards, Amit Kapila. wal_update_changes_v2.patch Description: wal_update_changes_v2.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers