Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Sat, Oct 27, 2012 at 04:57:46PM +0530, Amit Kapila wrote: On Saturday, October 27, 2012 4:03 AM Noah Misch wrote: Could you elaborate on your reason for continuing to treat TOAST as a special case? As best I recall, the only reason to do so before was the fact that TOAST can change the physical representation of a column even when executor did not change its logical content. Since you're no longer relying on the executor's opinion of what changed, a TOASTed value is not special. I thought for initial version of patch, without this change, patch will have less impact and less test. Not that I'm aware. If you still think so, please explain. For this patch I am interested to go with delta encoding approach based on column boundaries. Fair enough. If you conclude that finding sub-column similarity is not worthwhile, at least teach your algorithm to aggregate runs of changing or unchanging columns into fewer delta instructions. If a table contains twenty unchanging bool columns, you currently use at least 80 bytes to encode that fact. By treating the run of columns as a unit for delta encoding purposes, you could encode it in 23 bytes. Do you mean to say handle for non-continuous unchanged columns? My statement above was a mess. I believe for continuous unchanged columns its already handled until there are any alignment changes. Example create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7 bool, f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13 bool, f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19 bool, f20 bool, f21 bool); insert into tbl values(10, 't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t', 't'); update tbl set f1 = 20; The delta algorithm for the above operation reduced the size of the tuple from 24 bytes to 12 bytes. 4 bytes - IGN command and LEN 4 bytes - ADD command and LEN 4 bytes - Data block I now see that this case is already handled. Sorry for the noise. Incidentally, I tried this variant: create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7 bool, f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13 bool, f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19 bool, f20 bool, f21 bool, f22 int, f23 int); insert into tbl values(1, 't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t', 't', 2, 3); update tbl set f1 = 2, f22 = 4, f23 = 6; It yielded an erroneous delta: IGN 4, ADD 4, COPY 24, IGN 4, ADD 4, COPY 28, IGN 4, ADD 4. (The delta happens to be longer than the data and goes unused). nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical to physical page mapping
On 27.10.2012 16:43, Tom Lane wrote: Jan Wieckjanwi...@yahoo.com writes: The reason why we need full_page_writes is that we need to guard against torn pages or partial writes. So what if smgr would manage a mapping between logical page numbers and their physical location in the relation? At the moment where we today require a full page write into WAL, we would mark the buffer as needs relocation. The smgr would then write this page into another physical location whenever it is time to write it (via the background writer, hopefully). After that page is flushed, it would update the page location pointer, or whatever we want to call it. A thus free'd physical page location can be reused, once the location pointer has been flushed to disk. This is a critical ordering of writes. First the page at the new location, second the pointer to the current location. Doing so would make write(2) appear atomic to us, which is exactly what we need for crash recovery. Hmm, aka copy-on-write. I think you're just moving the atomic-write problem from the data pages to wherever you keep these pointers. If the pointers are stored as simple 4-byte integers, you probably could assume that they're atomic, and won't be torn. There's a lot of practical problems in adding another level of indirection to every page access, though. It'll surely add some overhead to every access, even if the data never changes. And it's not at all clear to me that it would perform better than full_page_writes. You're writing and flushing out roughly the same amount of data AFAICS. What exactly is the problem with full_page_writes that we're trying to solve? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On 27.10.2012 14:27, Amit Kapila wrote: On Saturday, October 27, 2012 4:03 AM Noah Misch wrote: In my previous review, I said: Given [not relying on the executor to know which columns changed], why not treat the tuple as an opaque series of bytes and not worry about datum boundaries? When several narrow columns change together, say a sequence of sixteen smallint columns, you will use fewer binary delta commands by representing the change with a single 32-byte substitution. If an UPDATE changes just part of a long datum, the delta encoding algorithm will still be able to save considerable space. That case arises in many forms: changing one word in a long string, changing one element in a long array, changing one field of a composite-typed column. Granted, this makes the choice of delta encoding algorithm more important. We may be leaving considerable savings on the table by assuming that column boundaries are the only modified-range boundaries worth recognizing. What is your willingness to explore general algorithms for choosing such boundaries? Such an investigation may, of course, be a dead end. For this patch I am interested to go with delta encoding approach based on column boundaries. However I shall try to do it separately and if it gives positive results then I will share with hackers. I will try with VCDiff once or let me know if you have any other algorithm in mind. One idea is to use the LZ format in the WAL record, but use your memcmp() code to construct it. I believe the slow part in LZ compression is in trying to locate matches in the history, so if you just replace that with your code that's aware of the column boundaries and uses simple memcmp() to detect what parts changed, you could create LZ compressed output just as quickly as the custom encoded format. It would leave the door open for making the encoding smarter or to do actual compression in the future, without changing the format and the code to decode it. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical to physical page mapping
On Sat, Oct 27, 2012 at 3:41 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I think you're just moving the atomic-write problem from the data pages to wherever you keep these pointers. If the pointers are stored as simple 4-byte integers, you probably could assume that they're atomic, and won't be torn. That could backfire. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical to physical page mapping
Claudio Freire klaussfre...@gmail.com writes: On Sat, Oct 27, 2012 at 3:41 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I think you're just moving the atomic-write problem from the data pages to wherever you keep these pointers. If the pointers are stored as simple 4-byte integers, you probably could assume that they're atomic, and won't be torn. That could backfire. Yeah, the potential loss in resiliency is scary. Assume for the sake of argument that we're storing these indirection pointers in 8K pages, 2000 or so to the page. If you get a read failure on a regular heap page, you've lost 8K worth of data. If you get a read failure on an indirection page, you've lost 16MB worth. (Though perhaps the pointers could be reconstructed, given additional overhead data on each regular heap page; but that wouldn't be very cheap.) Also, the write traffic on a pointer page is 2000X as much as it is on an average heap page, offering 2000X the opportunities for the disk hardware to drop a bit, or for SSD storage to wear down, etc. (Perhaps it's not that bad for typical access patterns, but then again perhaps it's worse, given the already noted strict requirements for write ordering.) So it seems like there'd be a rather nasty magnification of consequences for hardware errors. You've also got something like a 2000X concentration of access contention for pointer pages compared to regular pages. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical to physical page mapping
On 28/10/12 07:41, Heikki Linnakangas wrote: On 27.10.2012 16:43, Tom Lane wrote: Jan Wieckjanwi...@yahoo.com writes: The reason why we need full_page_writes is that we need to guard against torn pages or partial writes. So what if smgr would manage a mapping between logical page numbers and their physical location in the relation? At the moment where we today require a full page write into WAL, we would mark the buffer as needs relocation. The smgr would then write this page into another physical location whenever it is time to write it (via the background writer, hopefully). After that page is flushed, it would update the page location pointer, or whatever we want to call it. A thus free'd physical page location can be reused, once the location pointer has been flushed to disk. This is a critical ordering of writes. First the page at the new location, second the pointer to the current location. Doing so would make write(2) appear atomic to us, which is exactly what we need for crash recovery. Hmm, aka copy-on-write. I think you're just moving the atomic-write problem from the data pages to wherever you keep these pointers. If the pointers are stored as simple 4-byte integers, you probably could assume that they're atomic, and won't be torn. There's a lot of practical problems in adding another level of indirection to every page access, though. It'll surely add some overhead to every access, even if the data never changes. And it's not at all clear to me that it would perform better than full_page_writes. You're writing and flushing out roughly the same amount of data AFAICS. What exactly is the problem with full_page_writes that we're trying to solve? - Heikki Would a 4 byte pointer be adequate for a 64 bit machine with well over 4GB used by Postgres? Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] My first patch! (to \df output)
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 This was actually kind of anti-climactic, since it only took about 5 minutes to make the change and get it working. Didn't really feel the way I expected it to ;) Well, we can reject your patch and start bike-shedding it for the next four months, if that makes you feel better! :) Congrats! - -- Greg Sabino Mullane g...@turnstep.com PGP Key: 0x14964AC8 201210271914 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -BEGIN PGP SIGNATURE- iEYEAREDAAYFAlCMau4ACgkQvJuQZxSWSshdoQCg6eJ14LLcJrn04rN2/efO14iz swgAoPbBSv8PAre6qtVrRH3LL/iNQqeD =m/ns -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers