The issue commonly referred to as 'btree index bloat' is better after
the release of 7.4 but still exists.  What follows is a description
of the problem (for the non-developers or those that have forgotten),
how it can occur in real tables, current work-arounds (REINDEX), and a
possible solution that doesn't hold long-term exclusive locks.  Skip to
the prototype solution section if you like.  I'd like some comments on
from developers.

[Please CC me on any replies, I'll check the archives but I'm not
 currently subscribed to this list, thanks.]


Problem Description
===================
The bloating issue is that when tuples are deleted index pages may
become sparsely populated and under some circumstances this can effect
the performance of the index.  PostgreSQL-7.4 addressed part of this
problem by introducing the ability of VACUUM to identify and mark for
reuse completely free index pages.  Before 7.4 the possible size of an
index was completely unbounded, under extreme circumstances you could
generate an arbitrary large index file with a very small set of rows;
now the worst you can possibly get is a single index key per index page,
8 KB of index per row (still very bad, but at least it's bounded now).

Deletes should not be a problem for an index of some arbitrary
general-use columns since future updates/inserts for index keys in the
same range can re-use the space in existing index pages.  Indeed, this
is why REINDEX and CREATE INDEX don't pack the index pages full to
start with, to avoid needing to split pages right away on a following
insert or update.  The problem arises if the insert and delete patterns
are such that this space is never reused.  It's exactly this pattern
that many of the table in the schema I work with use, and the resulting
sparse btree indexes are causing us issues on production databases.

The only current solution for this is that once the index becomes sparse
REINDEX is run; this locks readers and writers from accessing the table.
CREATE INDEX/DROP INDEX can be used to build a replacement index for
non-primary key indices instead with the benefit that this still allows
readers, but it still locks out writers.  With large tables either of
these operations can take hours which makes this undesirable (or in my
case unusable).

Real-Life Example
=================
We have tables that hold timestamped data.  The data is inserted
with the current time stamp such that this field is monotonically
incrementing.  Additionally, there are serial fields that also
monotonically increment.  The granularity of the data during insert is
high (e.g. 1 row per item every 5 minutes) but as the data ages the
granularity can be lower (e.g. after a week we only need 1 row per item
every hour; and after a month we only need 1 row per item every day;
etc).  The tables are designed such that the granularity can be changed
by simply deleting a subset of rows (e.g. delete 11 of the 12 rows
inserted for an item during an hour).  Old rows are never updated.

Keys for indices starting with the time stamp or id field always get
inserted into the right most index page.  The current btree code splits
this page 70/30 (instead of the normal 50/50) such that the after
repeated inserting the index pages are ~70% full of index keys (note
that a REINDEX fills index pages to 90%).  As the data ages we delete a
subset of the old data which makes those pages sparse (or in some cases
empty, those are reused).  Since no updates or inserts occur on old time
stamp or serial values these pages become and stay quite sparse.  The
attached example SQL script has a case where indices pages become only
20% on average.


Prototype Solution
==================
The solution I've toyed with to improve performance and make the space
re-usable but NOT lock out readers or writers for any length of time is
a naive btree index page merger.  This does the opposite of a index page
split.  The merger/compressor scans the pages of the index and compares
each page with its right-page.  If the items on the two pages can fit
in one page at a capacity less than some target amount (e.g. 90%) the
items from the left-page are moved into the right-page, the parent-page
updated to reflect this, and then all three pages written out.

The reason I call it naive is that it only considers pairs of pages,
not runs of several pages that could be merged (e.g. compress 3 pages
into 2).  The code locks the three pages in question but because
I wasn't sure if this would be safe in all cases it also grabs an
exclusive lock on the table and index during the scan.  However, unlike
re-index, the scan can be incremental, e.g. scan pages 1-1000 then
release the lock, then scan pages 1001-2000 and release the lock, etc,
so that writers are only periodically and briefly locked out.

After this a VACUUM of the table will put the now empty index pages on
the FSM (Free Space Map) for later re-use.  (E.g. this doesn't make the
size of the index smaller yet, it just helps limit further growth and
increase performance).


Attached is an SQL script that generates a small table with several
sparse btree indexes that I've used to test the effectiveness of my
btree compressor.  Comments in the script contain the numbers I've seen
and seem to indicate that my naive merge strategy works well.

Also attached is proto-type of the btree page compressor that can be
compiled as a shared object and added as a database function.  The
biggest issue with it is that as a loadable object it can't generate
WAL records (since it would need a new type of record) therefore if the
database crashes before the next checkpoint the index will likely be
corrupt (a REINDEX should fix this).  Also since there are no WAL logs
this won't play nice with WAL log archiving for PITR in 8.0.  A shell
script would be used to call the function repeatedly to scan the whole
index a small chunk at a time (see README.btree_compress in the attached
tar file).

I'm looking for comments from the developers as to
* the desirability of doing such btree page merges,
* issues with the approach I'm using to implement it,
* if I'm doing the locking correctly, 
* if the exclusive lock can be removed so that it doesn't need to
  be incremental, 
* if a patch to do this in the back-end would be welcome,
etc.

My goal is to take any comments provided and produce a patch that
implements this feature in the back-end with a new WAL record so
that it's safe during a database crash and as an easy to use command
(e.g. VACUUM INDEX foo, or COMPRESS INDEX foo).  If the command still
needs to operate incrementally I'd like to do that all in the command
(like the way VACUUM and/or ANALYZE currently get and release locks
while running).

On a related note I don't see why a version of VACUUM FULL couldn't
be implemented to work incrementally with a series of shorter lived
exclusive locks as well.

-- 
Dave Chapeskie
OpenPGP Key ID: 0x3D2B6B34
-- Script to test btree bloating and compression

BEGIN;

CREATE TABLE test_btree (
        id      serial          PRIMARY KEY,
        ts      timestamptz     NOT NULL DEFAULT CURRENT_TIMESTAMP,
        txt     text
);
-- The primary key index is created before the data is added...
-- Create this index before the data gets added...
-- For inserts that increase monatomicly this should result
-- in btree pages ~70% full.
CREATE INDEX test_btree_ts_idx ON test_btree(ts);

-- generate_series was added in pg8.x it returns a set of integers,
-- use it to generate a suitable number or rows in increasing
-- ts ending 'now'.
-- If you don't have generate_series() you'll need to replace this
-- with another method of inserting the required number or rows.
\echo 'Inserting 8*32*1024 rows...'
INSERT INTO test_btree (ts)
    SELECT date_trunc('seconds',current_timestamp)
            - cast(generate_series||' seconds' as interval)
        FROM generate_series (8*32*1024,1,-1);

-- Create these indicies after the data is added...
-- This is similar what would be done by a REINDEX and
-- should result in btree pages ~90% full.
CREATE INDEX test_btree_id_idx2 ON test_btree(id);
CREATE INDEX test_btree_ts_idx2 ON test_btree(ts);

COMMIT;

-- Now we remove some fraction of 'old' rows simulating an expiration
-- policy where we only need to keep some shrinking fraction of rows
-- as they get older.
-- The specific values used here are bogus but illustrate the point.

\echo 'Leave 1/2 of the rows < 6*32*1024'
DELETE FROM test_btree
    WHERE id&1 != 0 AND id < 6*32*1024;

\echo 'Leave 1/4 of the rows < 5*32*1024'
DELETE FROM test_btree
    WHERE id&3 != 0 AND id < 5*32*1024;

\echo 'Leave 1/8 of the rows < 4*32*1024'
DELETE FROM test_btree
    WHERE id&7 != 0 AND id < 4*32*1024;

\echo 'Leave 1/16 of the rows < 3*32*1024'
DELETE FROM test_btree
    WHERE id&15 != 0 AND id < 3*32*1024;

\echo 'Leave 1/32 of the rows < 2*32*1024'
DELETE FROM test_btree
    WHERE id&31 != 0 AND id < 2*32*1024;

-- VACUUM FULL isn't required here.  In a real application if a normal
-- VACUUM were to occur after each expiration run and expiration where
-- to happen reqularly during the inserts then the table would end up
-- approximatly the same size (although with the rows in different
-- locations).
\echo 'Doing VACUUM FULL ...'
VACUUM FULL test_btree;

-- This is similar what would be done by a REINDEX and
-- should result in btree pages ~90% full.
CREATE INDEX test_btree_id_idx3 ON test_btree(id);
CREATE INDEX test_btree_ts_idx3 ON test_btree(ts);

\echo 'Now compare the on disk sizes of the various test_btree indices...'
-- The  pre-insert indices should end up ~4.6 times as large as the last.
-- The post-insert indices should end up ~3.3 times as large as the last.
SELECT pg_class.relname, pg_class.relpages
    FROM pg_class
    WHERE pg_class.relname ~ '^test_btree'
    ORDER BY pg_class.relpages DESC;

-- At this point various in-place compression strategies can be tried
-- on the indices to see how close to the size of *_idx3 they can get
-- without locking the whole table for the duration of a reindex.


-- On PostgreSQL 8.0.1 on FreeBSD with a pagesize of 8k this is what
-- I get:
-- (relsize is just relpages*8KB, avg_page_cap is the average page
--  capacity as a percentage and was arrived at by using pg_filedump)
--
--       relname       | relpages |   relsize    | avg_page_cap
-- --------------------+----------+--------------+--------------
--  test_btree_ts_idx  |     1252 |    9.781 MB  |  20.11%
--  test_btree_pkey    |     1001 |    7.820 MB  |  20.03%
--  test_btree_ts_idx2 |      901 |    7.039 MB  |  27.59%
--  test_btree_id_idx2 |      720 |    5.625 MB  |  27.53%
--  test_btree         |      579 |    4.523 MB  |
--  test_btree_ts_idx3 |      271 |    2.117 MB  |  90.04%
--  test_btree_id_idx3 |      217 |    1.695 MB  |  90.09%

-- Running btree_index_compress('xx'::regclass::oid, 0.9, 0, 2000)
-- (i.e. with a target capacity <= 90%)
-- on each of the indices gives:
-- (free is the number of currently reusable pages reported by vacuum,
--  used is relpages-free, non_empty_avg is the average capacity of
--  just the non-empty pages)
--
--       relname       | relpages | free | used | non_empty_avg
-- --------------------+----------+------+------+---------------
--  test_btree_ts_idx  |     1252 |  886 |  366 |  66.68%
--  test_btree_pkey    |     1001 |  709 |  292 |  67.03%
--  test_btree_ts_idx2 |      900 |  319 |  581 |  52.32%
--  test_btree_id_idx2 |      719 |  432 |  287 |  49.07%
--  test_btree         |      579 |      |      |
--  test_btree_ts_idx3 |      271 |    0 |  271 | 
--  test_btree_id_idx3 |      217 |    0 |  217 |
--
-- In test_btree_id_idx2 (as an example) there are 347 pages at between
-- 43% and 48% but btree_index_compress() doesn't try and merge these as
-- they are not adjacent to pages they can merge with and stay below the
-- 90% target.
-- Used page count in the resulting indices are ~1.35 times as large as
-- with a REINDEXed index.

-- Recreating the table and indices from scratch and then running
-- btree_index_compress('xx'::regclass::oid, 1, 0 2000) on each gives:
-- (i.e. with a target capacity <= 100%)
-- (full is the number of pages > 95% full, e.g. likely to be split, bad)
--
--       relname       | relpages | free | used | full | non_empty_avg
-- --------------------+----------+------+------+------+---------------
--  test_btree_ts_idx  |     1251 |  912 |  339 |   80 |  71.93%
--  test_btree_pkey    |     1001 |  730 |  271 |   63 |  72.20%
--  test_btree_ts_idx2 |      901 |  627 |  274 |   12 |  88.83%
--  test_btree_id_idx2 |      720 |  501 |  219 |   11 |  89.31%
--  test_btree         |      579 |      |      |      |
--  test_btree_ts_idx3 |      271 |    0 |  271 |      |
--  test_btree_id_idx3 |      217 |    0 |  217 |      |
--
-- Used page count in the resulting indices are ~1.24 times as large as
-- with a REINDEXed index and the numver of nearly-full pages isn't too bad.

-- Although the above are small tables where the time the REINDEX holds
-- an exclusive lock is small, and the btree_index_compress() call does
-- the whole table in a single scan...  It's assumed that these numbers
-- would hold for larger tables where the btree_index_compress() could
-- be run repeated doing say 1000 pages a run and thus only holding the
-- exclusive lock briedly unlike a REINDEX.

Attachment: btree_compress.tar.gz
Description: PgSQL contrib module for proto-type btree compressor

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to