Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-25 Thread Jesper Krogh

> Naturally, there are other compression and delta encoding schemes.  Does
> anyone feel the need to explore further alternatives?
> 
> We might eventually find the need for multiple, user-selectable, WAL
> compression strategies.  I don't recommend taking that step yet.
> 

my currently implemented compression strategy is to run the wal block through 
gzip in the archive command. compresses pretty nicely and achieved 50%+ in my 
workload (generally closer to 70)

on a multi core system it will take more cpu time but on a different core and 
not have any effect on tps. 

General compression should probably only be applied if it have positive gain on 
tps you could.

Jesper




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-24 Thread Noah Misch
On Tue, Oct 23, 2012 at 08:21:54PM -0400, Noah Misch wrote:
> -Patch- -tps@-c1-   -tps@-c2-   -tps@-c8-   -WAL@-c8-
> HEAD,-F80   816 164465281821 MiB
> xlogscale,-F80  824 164365511826 MiB
> xlogscale+lz,-F80   717 146659241137 MiB
> xlogscale+lz,-F100  753 150859481548 MiB
> 
> Those are short runs with no averaging of multiple iterations; don't put too
> much faith in the absolute numbers.

I decided to rerun those measurements with three 15-minute runs.  I removed
the -F100 test and added wal_update_changes_v2.patch (delta encoding version)
to the mix.  Median results:

-Patch- -tps@-c1-   -tps@-c2-   -tps@-c8-   -WAL@-c8-
HEAD,-F80   832 1679679744 GiB
scale,-F80  830 1679679844 GiB
scale+lz,-F80   736 1498616911 GiB
scale+delta,-F80841 1713705610 GiB

The numbers varied little across runs.  So we see the same general trends as
with the short runs; overall performance is slightly higher across the board,
and the fraction of WAL avoided is much higher.  I'm suspecting the patches
shrink WAL better in these longer runs because the WAL of a short run contains
a higher density of full-page images.

>From these results, I think that the LZ approach is something we could only
provide as an option; CPU-bound workloads may not be our bread and butter, but
we shouldn't dock them 10% with no option to disable.  Amit's delta encoding
approach seems to be something we could safely enable across the board.

Naturally, there are other compression and delta encoding schemes.  Does
anyone feel the need to explore further alternatives?

We might eventually find the need for multiple, user-selectable, WAL
compression strategies.  I don't recommend taking that step yet.

nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-24 Thread Robert Haas
On Tue, Oct 23, 2012 at 8:21 PM, Noah Misch  wrote:
> -Patch- -tps@-c1-   -tps@-c2-   -tps@-c8-   -WAL@-c8-
> HEAD,-F80   816 164465281821 MiB
> xlogscale,-F80  824 164365511826 MiB
> xlogscale+lz,-F80   717 146659241137 MiB
> xlogscale+lz,-F100  753 150859481548 MiB

Ouch.  I've been pretty excited by this patch, but I don't think we
want to take an "optimization" that produces a double-digit hit at 1
client and doesn't gain even at 8 clients. I'm surprised this is
costing that much, though.  It doesn't seem like it should.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-24 Thread Noah Misch
On Wed, Oct 24, 2012 at 05:55:56AM +, Amit kapila wrote:
> Wednesday, October 24, 2012 5:51 AM Noah Misch wrote:
> > Stepping back a moment, I would expect this patch to change performance in 
> > at
> > least four ways (Heikki largely covered this upthread):
> 
> > a) High-concurrency workloads will improve thanks to reduced WAL insert
> > contention.
> > b) All workloads will degrade due to the CPU cost of identifying and
> > implementing the optimization.
> > c) Workloads starved for bulk WAL I/O will improve due to reduced WAL 
> > volume.
> > d) Workloads composed primarily of long transactions with high WAL volume 
> > will
> >improve due to having fewer end-of-WAL-segment fsync requests.
> 
> All your points are very good summarization of work, but I think one point 
> can be added :
> e) Reduced the cost of doing crc and copying less data in Xlog buffer in 
> XLogInsert()  due to reduced size of xlog record.

True.

> > Your benchmark numbers show small gains and losses for single-client
> > workloads, moving to moderate gains for 2-client workloads.  This suggests
> > strong influence from (a), some influence from (b), and little influence 
> > from
> > (c) and (d).  Actually, the response to scale evident in your numbers seems
> > too good to be true; why would (a) have such a large effect over the
> > transition from one client to two clients? 
> 
> I think if we just see from the point of LZ compression, there are 
> predominently 2 things, your point (b) and point (e) mentioned by me.
> For single threads, the cost of doing compression supercedes the cost of crc 
> and other improvement in xloginsert().
> However when come to multi threads, the cost reduction due to point (e) will 
> reduce the time under lock and hence we see such a effect from 
> 1 client to 2 clients.

Note that the CRC calculation over variable-size data in the WAL record
happens before taking WALInsertLock.

> > Also, for whatever reason, all
> > your numbers show fairly bad scaling.  With the XLOG scale and LZ patches,
> > synchronous_commit=off, -F 80, and rec length 250, 8-client average
> > performance is only 2x that of 1-client average performance.

Correction: with the XLOG scale patch only, your benchmark runs show 8-client
average performance as 2x that of 1-client average performance.  With both the
XLOG scale and LZ patches, it grows to almost 4x.  However, both ought to be
closer to 8x.

> > -Patch- -tps@-c1-   -tps@-c2-   -tps@-c8-   -WAL@-c8-
> > HEAD,-F80   816 164465281821 MiB
> > xlogscale,-F80  824 164365511826 MiB
> > xlogscale+lz,-F80   717 146659241137 MiB
> > xlogscale+lz,-F100  753 150859481548 MiB
> 
> > Those are short runs with no averaging of multiple iterations; don't put too
> > much faith in the absolute numbers.  Still, I consistently get linear 
> > scaling
> > from 1 client to 8 clients.  Why might your results have been so different 
> > in
> > this regard?
> 
> 1. The only reason for you seeing the difference of linear scalability can be 
> because of the numbers I have posted for 8 threads is
> of run with -c16 -j8. I shall run with -c8 and post the performance numbers. 
> I am hoping it should match the way you see the numbers

I doubt that.  Your 2-client numbers also show scaling well-below linear.
With 8 cores, 16-client performance should not fall off compared to 8 clients.

Perhaps 2 clients saturate your I/O under this workload, but 1 client does
not.  Granted, that theory doesn't explain all your numbers, such as the
improvement for record length 50 @ -c1.

> 2. Now, if we see that in the results you have posted, 
> a) there is not much performance difference between head and xlog scale 

Note that the xlog scale patch addresses a different workload:
http://archives.postgresql.org/message-id/505b3648.1040...@vmware.com

> b) with LZ patch it shows there is decrease in performance
>I think this can be because it has ran for very less time as you have also 
> mentioned.

Yes, that's possible.

> > It's also odd that your -F100 numbers tend to follow your -F80 numbers 
> > despite
> > the optimization kicking in far more frequently for the latter.
> 
> The results with avg of 3 - 15mins runs for LZ patch are:
>  -Patch-   -tps@-c1-   -tps@-c2-   -tps@-c16-j8  
>  xlogscale+lz,-F80 6631232   2498  
>  xlogscale+lz,-F100   6601221   2361  
> 
> The result is showing that avg. tps is better with -F80 which is I think what 
> is expected.

Yes.  Let me elaborate on the point I hoped to make.  Based on my test above,
-F80 more than doubles the bulk WAL savings compared to -F100.  Your benchmark
runs showed a 61.8% performance improvement at -F100 and a 62.5% performance
improvement at -F80.  If shrinking WAL increases performance, shrinking it
more should increas

Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-23 Thread Amit kapila
On Wednesday, October 24, 2012 12:15 AM Alvaro Herrera wrote:

Amit kapila wrote:

> Rebased version of patch based on latest code.

> Uhm, how can this patch change a caller of PageAddItem() by adding one
> more argument, yet not touch bufpage.c at all?  Are you sure this
> compiles?

It compiles, the same is confirmed even with latest Head.
Can you please point me if you feel something is done wrong in the patch.

> The email subject has a WIP tag; is that still the patch status?  If so,
> I assume it's okay to mark this Returned with Feedback and expect a
> later version to be posted.

The WIP word is from original mail chain discussion. The current status is as 
follows:
I have update the patch with all bug fixes and performance results were posted. 
Noah has also taken the performance data.
He believes that there is discrepency in performance data, but actually the 
reason according to me is just the way I have posted the data.

Currently there is no clear feedback on which I can work, So I would be very 
thankfull to you if you can wait for some conclusion of the discussion.


With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-23 Thread Amit kapila

Wednesday, October 24, 2012 5:51 AM Noah Misch wrote:
>Hi Amit,

Noah, Thank you for taking the performance data.

>On Tue, Oct 16, 2012 at 09:22:39AM +, Amit kapila wrote:
> On Saturday, October 06, 2012 7:34 PM Amit Kapila wrote:
>> > Please find the readings of LZ patch along with Xlog-Scale patch.
>> > The comparison is between for Update operations
>> > base code + Xlog Scale Patch
>> > base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)
>
>> This contains all the consolidated data and comparison for both the 
>> approaches:
>
>> The difference of this testcase as compare to previous one is that it has 
>> default value of wal_page_size ( 8K ) as compare to previous one where 
>> configuration used for wal_page_size was 1K

> What is "wal_page_size"?  Is that ./configure --with-wal-blocksize?
Yes.

> Observations From Performance Data
> --
> 1. With both the approaches Performance data is good.
> LZ compression - upto 100% performance improvement.
> Offset Approach - upto 160% performance improvement.
> 2. The performance data is better for LZ compression approach when the 
> changed value of tuple is large. (Refer 500 length changed value).
> 3. The performance data is better for Offset Approach for 1 thread for any 
> size of Data (it dips for LZ compression Approach).

> Stepping back a moment, I would expect this patch to change performance in at
> least four ways (Heikki largely covered this upthread):

> a) High-concurrency workloads will improve thanks to reduced WAL insert
> contention.
> b) All workloads will degrade due to the CPU cost of identifying and
> implementing the optimization.
> c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume.
> d) Workloads composed primarily of long transactions with high WAL volume will
>improve due to having fewer end-of-WAL-segment fsync requests.

All your points are very good summarization of work, but I think one point can 
be added :
e) Reduced the cost of doing crc and copying less data in Xlog buffer in 
XLogInsert()  due to reduced size of xlog record.

> Your benchmark numbers show small gains and losses for single-client
> workloads, moving to moderate gains for 2-client workloads.  This suggests
> strong influence from (a), some influence from (b), and little influence from
> (c) and (d).  Actually, the response to scale evident in your numbers seems
> too good to be true; why would (a) have such a large effect over the
> transition from one client to two clients? 

I think if we just see from the point of LZ compression, there are 
predominently 2 things, your point (b) and point (e) mentioned by me.
For single threads, the cost of doing compression supercedes the cost of crc 
and other improvement in xloginsert().
However when come to multi threads, the cost reduction due to point (e) will 
reduce the time under lock and hence we see such a effect from 
1 client to 2 clients.

> Also, for whatever reason, all
> your numbers show fairly bad scaling.  With the XLOG scale and LZ patches,
> synchronous_commit=off, -F 80, and rec length 250, 8-client average
> performance is only 2x that of 1-client average performance.

I am really sorry, this is my mistake about putting the numbers; the 8 threads 
number is actually a number with -c16 -j8
means 16 clients and 8 threads. That can be the reason it's just showing 2X 
otherwise it would have shown numbers similar to what you are seeing.


> Benchmark results:

> -Patch- -tps@-c1-   -tps@-c2-   -tps@-c8-   -WAL@-c8-
> HEAD,-F80   816 164465281821 MiB
> xlogscale,-F80  824 164365511826 MiB
> xlogscale+lz,-F80   717 146659241137 MiB
> xlogscale+lz,-F100  753 150859481548 MiB

> Those are short runs with no averaging of multiple iterations; don't put too
> much faith in the absolute numbers.  Still, I consistently get linear scaling
> from 1 client to 8 clients.  Why might your results have been so different in
> this regard?

1. The only reason for you seeing the difference of linear scalability can be 
because of the numbers I have posted for 8 threads is
of run with -c16 -j8. I shall run with -c8 and post the performance numbers. I 
am hoping it should match the way you see the numbers
2. Now, if we see that in the results you have posted, 
a) there is not much performance difference between head and xlog scale 
b) with LZ patch it shows there is decrease in performance
   I think this can be because it has ran for very less time as you have also 
mentioned.



> It's also odd that your -F100 numbers tend to follow your -F80 numbers despite
> the optimization kicking in far more frequently for the latter.

The results with avg of 3 - 15mins runs for LZ patch are:
 -Patch-   -tps@-c1-   -tps@-c2-   -tps@-c16-j8  
 xlogscale+lz,-F80 663   

[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-23 Thread Noah Misch
Hi Amit,

On Tue, Oct 16, 2012 at 09:22:39AM +, Amit kapila wrote:
> On Saturday, October 06, 2012 7:34 PM Amit Kapila wrote:
> > Please find the readings of LZ patch along with Xlog-Scale patch.
> > The comparison is between for Update operations
> > base code + Xlog Scale Patch
> > base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)
> 
> This contains all the consolidated data and comparison for both the 
> approaches:
> 
> The difference of this testcase as compare to previous one is that it has 
> default value of wal_page_size ( 8K ) as compare to previous one where 
> configuration used for wal_page_size was 1K

What is "wal_page_size"?  Is that ./configure --with-wal-blocksize?

> Observations From Performance Data
> --
> 1. With both the approaches Performance data is good.
> LZ compression - upto 100% performance improvement.
> Offset Approach - upto 160% performance improvement.
> 2. The performance data is better for LZ compression approach when the 
> changed value of tuple is large. (Refer 500 length changed value).
> 3. The performance data is better for Offset Approach for 1 thread for any 
> size of Data (it dips for LZ compression Approach).

Stepping back a moment, I would expect this patch to change performance in at
least four ways (Heikki largely covered this upthread):

a) High-concurrency workloads will improve thanks to reduced WAL insert
contention.
b) All workloads will degrade due to the CPU cost of identifying and
implementing the optimization.
c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume.
d) Workloads composed primarily of long transactions with high WAL volume will
improve due to having fewer end-of-WAL-segment fsync requests.

Your benchmark numbers show small gains and losses for single-client
workloads, moving to moderate gains for 2-client workloads.  This suggests
strong influence from (a), some influence from (b), and little influence from
(c) and (d).  Actually, the response to scale evident in your numbers seems
too good to be true; why would (a) have such a large effect over the
transition from one client to two clients?  Also, for whatever reason, all
your numbers show fairly bad scaling.  With the XLOG scale and LZ patches,
synchronous_commit=off, -F 80, and rec length 250, 8-client average
performance is only 2x that of 1-client average performance.

I attempted to reproduce this effect on an EC2 m2.4xlarge instance (8 cores,
70 GiB) with the data directory under a tmpfs mount.  This should thoroughly
isolate effects (a) and (b) from (c) and (d).  I used your pgbench_250.c[1] in
30s runs.  Configuration:

 autovacuum  | off
 checkpoint_segments | 500
 checkpoint_timeout  | 1h
 client_encoding | UTF8
 lc_collate  | C
 lc_ctype| C
 max_connections | 100
 server_encoding | SQL_ASCII
 shared_buffers  | 4GB
 wal_buffers | 16MB

Benchmark results:

-Patch- -tps@-c1-   -tps@-c2-   -tps@-c8-   -WAL@-c8-
HEAD,-F80   816 164465281821 MiB
xlogscale,-F80  824 164365511826 MiB
xlogscale+lz,-F80   717 146659241137 MiB
xlogscale+lz,-F100  753 150859481548 MiB

Those are short runs with no averaging of multiple iterations; don't put too
much faith in the absolute numbers.  Still, I consistently get linear scaling
from 1 client to 8 clients.  Why might your results have been so different in
this regard?

It's also odd that your -F100 numbers tend to follow your -F80 numbers despite
the optimization kicking in far more frequently for the latter.

nm

[1] 
http://archives.postgresql.org/message-id/001d01cda180$9f1e47a0$dd5ad6e0$@kap...@huawei.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-23 Thread Alvaro Herrera
Amit kapila wrote:

> Rebased version of patch based on latest code.

Uhm, how can this patch change a caller of PageAddItem() by adding one
more argument, yet not touch bufpage.c at all?  Are you sure this
compiles?

The email subject has a WIP tag; is that still the patch status?  If so,
I assume it's okay to mark this Returned with Feedback and expect a
later version to be posted.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-06 Thread Amit kapila
On Thursday, October 04, 2012 8:03 PM Heikki Linnakangas wrote:
On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote:
On Friday, September 28, 2012 7:03 PM Amit Kapila wrote:
> > On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:
> > > On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
> > > On 25.09.2012 18:27, Amit Kapila wrote:
> > > > If you feel it is must to do the comparison, we can do it in same
> > way
> > > > as we identify for HOT?
> > >
>
>
> > Now I shall do the various tests for following and post it here:
> > a. Attached Patch in the mode where it takes advantage of history
> > tuple b. By changing the logic for modified column calculation to use
> > calculation for memcmp()
>
>

> 1. Please find the results (pgbench_test.htm) for point -2 where there is
> one fixed column updation (last few bytes are random) and second column
> updation is 32 byte random string. The results for 50, 100 are still going
> on others are attached with this mail.

Please find the readings of LZ patch along with Xlog-Scale patch. 
The comparison is between for Update operations
base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

The readings have been taken based on below data.
pgbench_xlog_scale_50 -
a. Updated Record size 50, Total Record size 1800
b. Threads 8, 1 ,2 
c. Synchronous_commit - off, on

pgbench_xlog_scale_250 - 
a. Updated Record size 250, Total Record size 1800
b. Threads 8, 1 ,2 
c. Synchronous_commit - off, on

pgbench_xlog_scale_500- 
a. Updated Record size 500, Total Record size 1800
b. Threads 8, 1 ,2 
c. Synchronous_commit - off, on 

Observations
--
a. There is still a good performance improvement even if we do Update WAL 
optimization on top of Xlog Sclaing Patch.
b. There is a slight performance dip for 1 thread (only in Sync mode = off) 
with Update WAL optimization (LZ compression) 
but for 2 threads there is a performance increase.


With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-04 Thread Amit Kapila
> On Thursday, October 04, 2012 12:54 PM Heikki Linnakangas
> On 03.10.2012 19:03, Amit Kapila wrote:
> > Any comments/suggestions regarding performance/functionality test?
> 
> Hmm. Doing a lot of UPDATEs concurrently can be limited by the
> WALInsertLock, which each inserter holds while copying the WAL record to
> the buffer. Reducing the size of the WAL records, by compression or
> delta encoding, alleviates that bottleneck: when WAL records are
> smaller, the lock needs to be held for a shorter duration. That improves
> throughput, even if individual backends need to do more CPU work to
> compress the records, because that work can be done in parallel. I
> suspect much of the benefit you're seeing in these tests might be
> because of that effect.
> 
> As it happens, I've been working on making WAL insertion scale better in
> general:
> http://archives.postgresql.org/message-id/5064779a.3050...@vmware.com.
> That should also help most when inserting large WAL records. The
> question is: assuming we commit the xloginsert-scale patch, how much
> benefit is there left from the compression? It will surely still help to
> reduce the size of WAL, which can certainly help if you're limited by
> the WAL I/O, but I suspect the results from the pgbench tests you run
> might look quite different.
> 
> So, could you rerun these tests with the xloginsert-scale patch applied?

I shall take care of doing the performance test with xloginsert-scale patch
as well
both for single and multi-thread.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-10-04 Thread Heikki Linnakangas

On 03.10.2012 19:03, Amit Kapila wrote:

Any comments/suggestions regarding performance/functionality test?


Hmm. Doing a lot of UPDATEs concurrently can be limited by the 
WALInsertLock, which each inserter holds while copying the WAL record to 
the buffer. Reducing the size of the WAL records, by compression or 
delta encoding, alleviates that bottleneck: when WAL records are 
smaller, the lock needs to be held for a shorter duration. That improves 
throughput, even if individual backends need to do more CPU work to 
compress the records, because that work can be done in parallel. I 
suspect much of the benefit you're seeing in these tests might be 
because of that effect.


As it happens, I've been working on making WAL insertion scale better in 
general: 
http://archives.postgresql.org/message-id/5064779a.3050...@vmware.com. 
That should also help most when inserting large WAL records. The 
question is: assuming we commit the xloginsert-scale patch, how much 
benefit is there left from the compression? It will surely still help to 
reduce the size of WAL, which can certainly help if you're limited by 
the WAL I/O, but I suspect the results from the pgbench tests you run 
might look quite different.


So, could you rerun these tests with the xloginsert-scale patch applied? 
Reducing the WAL size might still be a good idea even if the patch 
doesn't have much effect on TPS, but I'd like to make sure that the 
compression doesn't hurt performance. Also, it would be a good idea to 
repeat the tests with just a single client; we don't want to hurt the 
performance in that scenario either.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-09-27 Thread Amit Kapila
> On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
> On 25.09.2012 18:27, Amit Kapila wrote:
> > If you feel it is must to do the comparison, we can do it in same way
> > as we identify for HOT?
> 
> Yeah. (But as discussed, I think it would be even better to just treat
> the old and new tuple as an opaque chunk of bytes, and run them through
> a generic delta algorithm).
> 

Thank you for the modified patch.
 
> The conclusion is that there isn't very much difference among the
> patches. They all squeeze the WAL to about the same size, and the
> increase in TPS is roughly the same.
> 
> I think more performance testing is required. The modified pgbench test
> isn't necessarily very representative of a real-life application. The
> gain (or loss) of this patch is going to depend a lot on how many
> columns are updated, and in what ways. Need to test more scenarios,
> with many different database schemas.
> 
> The LZ approach has the advantage that it can take advantage of all
> kinds of similarities between old and new tuple. For example, if you
> swap the values of two columns, LZ will encode that efficiently. Or if
> you insert a character in the middle of a long string. On the flipside,
> it's probably more expensive. Then again, you have to do a memcmp() to
> detect which columns have changed with your approach, and that's not
> free either. That was not yet included in the patch version I tested.
> Another consideration is that when you compress the record more, you
> have less data to calculate CRC for. CRC calculation tends to be quite
> expensive, so even quite aggressive compression might be a win. Yet
> another consideration is that the compression/encoding is done while
> holding a lock on the buffer. For the sake of concurrency, you want to
> keep the duration the lock is held as short as possible.

Now I shall do the various tests for following and post it here:
a. Attached Patch in the mode where it takes advantage of history tuple
b. By changing the logic for modified column calculation to use calculation
for memcmp()


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-09-27 Thread Heikki Linnakangas

On 25.09.2012 18:27, Amit Kapila wrote:

If you feel it is must to do the comparison, we can do it in same way as we
identify for HOT?


Yeah. (But as discussed, I think it would be even better to just treat 
the old and new tuple as an opaque chunk of bytes, and run them through 
a generic delta algorithm).



   Can you please explain me why you think that after doing encoding doing LZ
compression on it is better, as already we have reduced the amount of WAL
for update by only storing changed column information?

a. is it to further reduce the size of WAL
b. storing diff WAL in some standard format
c. or does it give any other kind of benefit


Potentially all of those. I don't know if it'd be better or worse, but 
my gut feeling is that it would be simpler, and produce even more 
compact WAL.


Attached is a simple patch to apply LZ compression to update WAL 
records. I modified the LZ compressor so that it can optionally use a 
separate "history" data, and the same history data must then be passed 
to the decompression function. That makes it work as a pretty efficient 
delta encoder, when you use the old tuple as the history data.


I ran some performance tests with the modified version of pgbench that 
you posted earlier:


Current PostgreSQL master
-

tps = 941.601924 (excluding connections establishing)
 pg_xlog_location_diff
---
 721227944

pglz_wal_update_records.patch
-

tps = 1039.792527 (excluding connections establishing)
 pg_xlog_location_diff
---
 419395208

pglz_wal_update_records.patch, COMPRESS_ONLY


tps = 1009.682002 (excluding connections establishing)
 pg_xlog_location_diff
---
 422505104


Amit's wal_update_changes_hot_update.patch
--

tps = 1092.703883 (excluding connections establishing)
 pg_xlog_location_diff
---
 436031544


The COMPRESS_ONLY result is with the attached patch, but it just uses LZ 
to compress the new tuple, without taking advantage of the old tuple. 
The pg_xlog_location_diff value is the amount of WAL generated during 
the pgbench run. Attached is also the shell script I used to run these 
tests.


The conclusion is that there isn't very much difference among the 
patches. They all squeeze the WAL to about the same size, and the 
increase in TPS is roughly the same.


I think more performance testing is required. The modified pgbench test 
isn't necessarily very representative of a real-life application. The 
gain (or loss) of this patch is going to depend a lot on how many 
columns are updated, and in what ways. Need to test more scenarios, with 
many different database schemas.


The LZ approach has the advantage that it can take advantage of all 
kinds of similarities between old and new tuple. For example, if you 
swap the values of two columns, LZ will encode that efficiently. Or if 
you insert a character in the middle of a long string. On the flipside, 
it's probably more expensive. Then again, you have to do a memcmp() to 
detect which columns have changed with your approach, and that's not 
free either. That was not yet included in the patch version I tested. 
Another consideration is that when you compress the record more, you 
have less data to calculate CRC for. CRC calculation tends to be quite 
expensive, so even quite aggressive compression might be a win. Yet 
another consideration is that the compression/encoding is done while 
holding a lock on the buffer. For the sake of concurrency, you want to 
keep the duration the lock is held as short as possible.


- Heikki
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5a4591e..56b53a5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
@@ -85,6 +86,7 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 	TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
 ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+HeapTuple oldtup,
 bool all_visible_cleared, bool new_all_visible_cleared);
 static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
 	   HeapTuple oldtup, HeapTuple newtup);
@@ -3195,10 +3197,12 @@ l2:
 	/* XLOG stuff */
 	if (RelationNeedsWAL(relation))
 	{
-		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
-			 newbuf, heaptup,
-			 all_visible_cleared,
-			 all_visible_cleared_new);
+		XLogRecPtr	recptr;
+
+		recptr = log_heap_update(relation, buffer, oldtup.t_self,
+ newbuf, heaptup, &oldtup,
+ all_visible_cleared,
+		

Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-09-26 Thread Amit Kapila
  On Thursday, September 27, 2012 10:19 AM
> Noah Misch  writes:
> > You cannot assume executor-unmodified columns are also unmodified
> from
> > heap_update()'s perspective.  Expansion in one column may instigate
> TOAST
> > compression of a logically-unmodified column, and that counts as a
> change for
> > xlog delta purposes.
> 
> Um ... what about BEFORE triggers?
 
This optimization will not apply in case Before triggers updates the tuple.
 
> 
> Frankly, I think that expecting the executor to tell you which columns
> have been modified is a non-starter.  We have a solution for HOT and
> it's silly to do the same thing differently just a few lines away.
> 
My apprehension is that it can hit the performance advantage if we compare
all attributes to check which have been modified and that to under Buffer
Exclusive Lock.
In case of HOT only the index attributes get compared.

I agree that doing things differently at 2 nearby places is not good. 
So I will do it same way as for HOT and then take the performance data again
and if there is no big impact then
we can do it that way. 


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-09-26 Thread Tom Lane
Noah Misch  writes:
> You cannot assume executor-unmodified columns are also unmodified from
> heap_update()'s perspective.  Expansion in one column may instigate TOAST
> compression of a logically-unmodified column, and that counts as a change for
> xlog delta purposes.

Um ... what about BEFORE triggers?

Frankly, I think that expecting the executor to tell you which columns
have been modified is a non-starter.  We have a solution for HOT and
it's silly to do the same thing differently just a few lines away.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-09-26 Thread Noah Misch
On Mon, Sep 24, 2012 at 10:57:02AM +, Amit kapila wrote:
> Rebased version of patch based on latest code.

I like the direction you're taking with this patch; the gains are striking,
especially considering the isolation of the changes.

You cannot assume executor-unmodified columns are also unmodified from
heap_update()'s perspective.  Expansion in one column may instigate TOAST
compression of a logically-unmodified column, and that counts as a change for
xlog delta purposes.  You do currently skip the optimization for relations
having a TOAST table, but TOAST compression can still apply.  Observe this
with text columns of storage mode PLAIN.  I see two ways out: skip the new
behavior when need_toast=true, or compare all inline column data, not just
what the executor modified.  One can probably construct a benchmark favoring
either choice.  I'd lean toward the latter; wide tuples are the kind this
change can most help.  If the marginal advantage of ignoring known-unmodified
columns proves important, we can always bring it back after designing a way to
track which columns changed in the toaster.

Given that, why not treat the tuple as an opaque series of bytes and not worry
about datum boundaries?  When several narrow columns change together, say a
sequence of sixteen smallint columns, you will use fewer binary delta commands
by representing the change with a single 32-byte substitution.  If an UPDATE
changes just part of a long datum, the delta encoding algorithm will still be
able to save considerable space.  That case arises in many forms:  changing
one word in a long string, changing one element in a long array, changing one
field of a composite-typed column.  Granted, this makes the choice of delta
encoding algorithm more important.

Like Heikki, I'm left wondering why your custom delta encoding is preferable
to an encoding from the literature.  Your encoding has much in common with
VCDIFF, even sharing two exact command names.  If a custom encoding is the
right thing, code comments or a README section should at least discuss the
advantages over an established alternative.  Idle thought: it might pay off to
use 1-byte sizes and offsets most of the time.  Tuples shorter than 256 bytes
are common; for longer tuples, we can afford wider offsets.

The benchmarks you posted upthread were helpful.  I think benchmarking with
fsync=off is best if you don't have a battery-backed write controller or SSD.
Otherwise, fsync time dominates a pgbench run.  Please benchmark recovery.  To
do so, set up WAL archiving and take a base backup from a fresh cluster.  Run
pgbench for awhile.  Finally, observe the elapsed time to recover your base
backup to the end of archived WAL.


> *** a/src/backend/access/common/heaptuple.c
> --- b/src/backend/access/common/heaptuple.c

> + /*
> +  * encode_xlog_update
> +  *  Forms a diff tuple from old and new tuple with the modified 
> columns.
> +  *
> +  *  att - attribute list.
> +  *  oldtup - pointer to the old tuple.
> +  *  heaptup - pointer to the modified tuple.
> +  *  wal_tup - pointer to the wal record which needs to be formed 
> from old
> +   and new tuples by using the modified columns 
> list.
> +  *  modifiedCols - modified columns list by the update command.
> +  */
> + void
> + encode_xlog_update(Form_pg_attribute *att, HeapTuple oldtup,
> +HeapTuple heaptup, HeapTuple wal_tup,
> +Bitmapset *modifiedCols)

This name is too generic for an extern function.  Maybe "heap_delta_encode"?

> + void
> + decode_xlog_update(HeapTupleHeader htup, uint32 old_tup_len, char *data,
> +uint32 *new_tup_len, char *waldata, uint32 
> wal_len)

Likwise, maybe "heap_delta_decode" here.

> *** a/src/backend/access/heap/heapam.c
> --- b/src/backend/access/heap/heapam.c
> ***
> *** 71,77 
>   #include "utils/syscache.h"
>   #include "utils/tqual.h"
>   
> - 
>   /* GUC variable */
>   boolsynchronize_seqscans = true;
>   

Spurious whitespace change.

> ***
> *** 3195,3204  l2:
>   /* XLOG stuff */
>   if (RelationNeedsWAL(relation))
>   {
> ! XLogRecPtr  recptr = log_heap_update(relation, buffer, 
> oldtup.t_self,
> ! 
>  newbuf, heaptup,
> ! 
>  all_visible_cleared,
> ! 
>  all_visible_cleared_new);
>   
>   if (newbuf != buffer)
>   {
> --- 3203,3233 
>   /* XLOG stuff */
>   if (RelationNeedsWAL(relation))
>   {
> ! XLogRecPtr  recptr;
> ! 
> ! /*
> !  * Apply the xlog diff update algorithm only for hot updates.
>

Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-09-25 Thread Amit Kapila
> On Tuesday, September 25, 2012 7:30 PM Heikki Linnakangas wrote:
> On 24.09.2012 13:57, Amit kapila wrote:
> > Rebased version of patch based on latest code.
> 
> When HOT was designed, we decided that heap_update needs to compare the
> old and new attributes directly, with memcmp(), to determine whether
> any
> of the indexed columns have changed. It was not deemed infeasible to
> pass down that information from the executor. I don't remember the
> details of why that was, but you seem to trying to same thing in this
> patch, and pass the bitmap of modified cols from the executor to
> heap_update(). I'm pretty sure that won't work, for the same reasons we
> didn't do it for HOT.

I think the reason of not relying on modified columns can be some such case
where modified columns might not give the correct information. 
It may be due to Before triggers can change the modified columns that's why
for HOT update we need to do 
Comparison. In our case we have taken care of such a case by not doing
optimization, so not relying on modified columns.

If you feel it is must to do the comparison, we can do it in same way as we
identify for HOT? 

> I still feel that it would probably be better to use a generic delta
> encoding scheme, instead of inventing one. How about VCDIFF
> (http://tools.ietf.org/html/rfc3284), for example? Or you could reuse
> the LZ compressor that we already have in the source tree. You can use
> LZ for delta compression by initializing the history buffer of the
> algorithm with the old tuple, and then compressing the new tuple as
> usual. 

>Or you could still use the knowledge of where the attributes
> begin and end and which attributes were updated, and do the encoding
> similar to how you did in the patch, but use LZ as the output format.
> That way the decoding would be the same as LZ decompression.

  Can you please explain me why you think that after doing encoding doing LZ
compression on it is better, as already we have reduced the amount of WAL
for update by only storing changed column information?

a. is it to further reduce the size of WAL
b. storing diff WAL in some standard format
c. or does it give any other kind of benefit

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-09-25 Thread Heikki Linnakangas

On 24.09.2012 13:57, Amit kapila wrote:

Rebased version of patch based on latest code.


When HOT was designed, we decided that heap_update needs to compare the 
old and new attributes directly, with memcmp(), to determine whether any 
of the indexed columns have changed. It was not deemed infeasible to 
pass down that information from the executor. I don't remember the 
details of why that was, but you seem to trying to same thing in this 
patch, and pass the bitmap of modified cols from the executor to 
heap_update(). I'm pretty sure that won't work, for the same reasons we 
didn't do it for HOT.


I still feel that it would probably be better to use a generic delta 
encoding scheme, instead of inventing one. How about VCDIFF 
(http://tools.ietf.org/html/rfc3284), for example? Or you could reuse 
the LZ compressor that we already have in the source tree. You can use 
LZ for delta compression by initializing the history buffer of the 
algorithm with the old tuple, and then compressing the new tuple as 
usual. Or you could still use the knowledge of where the attributes 
begin and end and which attributes were updated, and do the encoding 
similar to how you did in the patch, but use LZ as the output format. 
That way the decoding would be the same as LZ decompression.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

2012-09-24 Thread Amit kapila
From: Heikki Linnakangas [mailto:heikki(dot)linnakangas(at)enterprisedb(dot)com]
Sent: Monday, August 27, 2012 5:58 PM
To: Amit kapila
On 27.08.2012 15:18, Amit kapila wrote:
>>> I have implemented the WAL Reduction Patch for the case of HOT Update as
pointed out by Simon and Robert. In this patch it only goes for Optimized
WAL in case of HOT Update with other restrictions same as in previous patch.
>>
>>> The performance numbers for this patch are attached in this mail. It has
improved by 90% if the page has fillfactor 80.
>>
>>> Now going forward I have following options:
>>> a. Upload the patch in Open CF for WAL Reduction which contains
reductution for HOT and non-HOT updates.
>>> b. Upload the patch in Open CF for WAL Reduction which contains
reductution for HOT updates.
>>> c. Upload both the patches as different versions.

>> Let's do it for HOT updates only. Simon & Robert made good arguments on
>> why this is a bad idea for non-HOT updates.

>Okay, I shall do it that way.
>So now I shall send information about all the testing I have done for this
>Patch and then Upload it in CF.



Rebased version of patch based on latest code.




With Regards,

Amit Kapila.


wal_update_changes_v2.patch
Description: wal_update_changes_v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers