On 03/17/2014 03:20 PM, Fujii Masao wrote:
On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
<aekorot...@gmail.com> wrote:
On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas
<hlinnakan...@vmware.com> wrote:

I ran "pg_xlogdump | grep Gin" and checked the size of GIN-related WAL,
and then found its max seems more than 256B. Am I missing something?

What I observed is

[In HEAD]
At first, the size of GIN-related WAL is gradually increasing up to about 1400B.
     rmgr: Gin         len (rec/tot):     48/    80, tx:       1813,
lsn: 0/020020D8, prev 0/02000070, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F
     rmgr: Gin         len (rec/tot):     56/    88, tx:       1813,
lsn: 0/02002440, prev 0/020023F8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
     rmgr: Gin         len (rec/tot):     64/    96, tx:       1813,
lsn: 0/020044D8, prev 0/02004490, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
     ...
     rmgr: Gin         len (rec/tot):   1376/  1408, tx:       1813,
lsn: 0/02A7EE90, prev 0/02A7E910, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T
     rmgr: Gin         len (rec/tot):   1392/  1424, tx:       1813,
lsn: 0/02A7F458, prev 0/02A7F410, bkp: 0000, desc: Create posting
tree, node: 1663/12945/16441 blkno: 4

This corresponds to the stage where the items are stored in-line in the entry-tree. After it reaches a certain size, a posting tree is created.

Then the size decreases to about 100B and is gradually increasing
again up to 320B.

     rmgr: Gin         len (rec/tot):    116/   148, tx:       1813,
lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1372 (compressed)
     rmgr: Gin         len (rec/tot):     40/    72, tx:       1813,
lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T
     ...
     rmgr: Gin         len (rec/tot):    118/   150, tx:       1813,
lsn: 0/02A83BA0, prev 0/02A83B58, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1374 (compressed)
     ...
     rmgr: Gin         len (rec/tot):    288/   320, tx:       1813,
lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280
length: 1544 (compressed)

Then the size decreases to 66B and is gradually increasing again up to 320B.
This increase and decrease of WAL size seems to continue.

Here the new items are appended to posting tree pages. This is where the maximum of 256 bytes I mentioned applies. 256 bytes is the max size of one compressed posting list, the WAL record containing it includes some other stuff too, which adds up to that 320 bytes.

[In 9.3]
At first, the size of GIN-related WAL is gradually increasing up to about 2700B.

     rmgr: Gin         len (rec/tot):     52/    84, tx:       1812,
lsn: 0/02000430, prev 0/020003D8, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T
isdelete F updateBlkno:4294967295
     rmgr: Gin         len (rec/tot):     60/    92, tx:       1812,
lsn: 0/020004D0, prev 0/02000488, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295
     ...
     rmgr: Gin         len (rec/tot):   2740/  2772, tx:       1812,
lsn: 0/026D1670, prev 0/026D0B98, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295
     rmgr: Gin         len (rec/tot):   2714/  2746, tx:       1812,
lsn: 0/026D21A8, prev 0/026D2160, bkp: 0000, desc: Create posting
tree, node: 1663/12896/16441 blkno: 6

The size decreases to 66B and then is never changed.

Same mechanism on 9.3, but the insertions to the posting tree pages are constant size.

That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...

Let me know if you want me to write patch addressing this issue.

Yeah, I really want you to address this problem! That's definitely useful
for every users disabling FASTUPDATE option for some reasons.

Ok, let's think about it a little bit. I think there are three fairly simple ways to address this:

1. The GIN data leaf "recompress" record contains an offset called "unmodifiedlength", and the data that comes after that offset. Currently, the record is written so that unmodifiedlength points to the end of the last compressed posting list stored on the page that was not modified, followed by all the modified ones. The straightforward way to cut down the WAL record size would be to be more fine-grained than that, and for the posting lists that were modified, only store the difference between the old and new version.

To make this approach work well for random insertions, not just appending to the end, we would also need to make the logic in leafRepackItems a bit smarter so that it would not re-encode all the posting lists, after the first modified one.

2. Instead of storing the new compressed posting list in the WAL record, store only the new item pointers added to the page. WAL replay would then have to duplicate the work done in the main insertion code path: find the right posting lists to insert to, decode them, add the new items, and re-encode.

The upside of that would be that the WAL format would be very compact. It would be quite simple to implement - you just need to call the same functions we use in the main insertion codepath to insert the new items. It could be more expensive, CPU-wise, to replay the records, however.

This record format would be higher-level, in the sense that we would not store the physical copy of the compressed posting list as it was formed originally. The same work would be done at WAL replay. As the code stands, it will produce exactly the same result, but that's not guaranteed if we make bugfixes to the code later, and a master and standby are running different minor version. There's not necessarily anything wrong with that, but it's something to keep in mind.

3. Just reduce the GinPostingListSegmentMaxSize constant from 256, to say 128. That would halve the typical size of a WAL record that appends to the end. However, it would not help with insertions in the middle of a posting list, only appends to the end, and it would bloat the pages somewhat, as you would waste more space on the posting list headers.


I'm leaning towards option 2. Alexander, what do you think?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to