Re: [HACKERS] Index-only scans with btree_gist
On 03/29/2015 04:30 AM, Andreas Karlsson wrote: On 03/29/2015 03:25 AM, Andreas Karlsson wrote: On 03/28/2015 09:36 PM, Andreas Karlsson wrote: Thanks! Do you know if it is possible to add index-only scan support to range indexes? I have never looked at those and do not know if they are lossy. Seems like range types are not compressed at all so implementing index-only scans was trivial. A patch is attached. Noticed a couple of typos, so to avoid having them sneak into the code here is a version without them. Thanks, pushed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans with btree_gist
On 03/28/2015 09:36 PM, Andreas Karlsson wrote: Thanks! Do you know if it is possible to add index-only scan support to range indexes? I have never looked at those and do not know if they are lossy. Seems like range types are not compressed at all so implementing index-only scans was trivial. A patch is attached. -- Andreas Karlsson diff --git a/src/backend/utils/adt/rangetypes_gist.c b/src/backend/utils/adt/rangetypes_gist.c index c1ff471..dea8f04 100644 --- a/src/backend/utils/adt/rangetypes_gist.c +++ b/src/backend/utils/adt/rangetypes_gist.c @@ -216,7 +216,7 @@ range_gist_union(PG_FUNCTION_ARGS) PG_RETURN_RANGE(result_range); } -/* compress, decompress are no-ops */ +/* compress, decompress, fecth are no-ops */ Datum range_gist_compress(PG_FUNCTION_ARGS) { @@ -233,6 +233,14 @@ range_gist_decompress(PG_FUNCTION_ARGS) PG_RETURN_POINTER(entry); } +Datum +range_gist_fetch(PG_FUNCTION_ARGS) +{ + GISTENTRY *entry = (GISTENTRY *) PG_GETARG_POINTER(0); + + PG_RETURN_POINTER(entry); +} + /* * GiST page split penalty function. * diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h index 037684c..e80c897 100644 --- a/src/include/catalog/pg_amproc.h +++ b/src/include/catalog/pg_amproc.h @@ -235,6 +235,7 @@ DATA(insert ( 3919 3831 3831 4 3878 )); DATA(insert ( 3919 3831 3831 5 3879 )); DATA(insert ( 3919 3831 3831 6 3880 )); DATA(insert ( 3919 3831 3831 7 3881 )); +DATA(insert ( 3919 3831 3831 9 3996 )); /* gin */ diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index a96d369..3cd7851 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -4951,6 +4951,8 @@ DATA(insert OID = 3877 ( range_gist_compress PGNSP PGUID 12 1 0 0 0 f f f f t f DESCR(GiST support); DATA(insert OID = 3878 ( range_gist_decompress PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ range_gist_decompress _null_ _null_ _null_ )); DESCR(GiST support); +DATA(insert OID = 3996 ( range_gist_fetch PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ range_gist_fetch _null_ _null_ _null_ )); +DESCR(GiST support); DATA(insert OID = 3879 ( range_gist_penalty PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 2281 2281 2281 2281 _null_ _null_ _null_ _null_ range_gist_penalty _null_ _null_ _null_ )); DESCR(GiST support); DATA(insert OID = 3880 ( range_gist_picksplit PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 2281 2281 _null_ _null_ _null_ _null_ range_gist_picksplit _null_ _null_ _null_ )); diff --git a/src/include/utils/rangetypes.h b/src/include/utils/rangetypes.h index 173bf74..43c80f4 100644 --- a/src/include/utils/rangetypes.h +++ b/src/include/utils/rangetypes.h @@ -209,6 +209,7 @@ extern RangeType *make_empty_range(TypeCacheEntry *typcache); extern Datum range_gist_consistent(PG_FUNCTION_ARGS); extern Datum range_gist_compress(PG_FUNCTION_ARGS); extern Datum range_gist_decompress(PG_FUNCTION_ARGS); +extern Datum range_gist_fetch(PG_FUNCTION_ARGS); extern Datum range_gist_union(PG_FUNCTION_ARGS); extern Datum range_gist_penalty(PG_FUNCTION_ARGS); extern Datum range_gist_picksplit(PG_FUNCTION_ARGS); diff --git a/src/test/regress/expected/rangetypes.out b/src/test/regress/expected/rangetypes.out index 35d0dd3..8654e03 100644 --- a/src/test/regress/expected/rangetypes.out +++ b/src/test/regress/expected/rangetypes.out @@ -1072,6 +1072,25 @@ select count(*) from test_range_spgist where ir -|- int4range(100,500); 5 (1 row) +-- test index-only scans +explain (costs off) +select ir from test_range_spgist where ir -|- int4range(10,20) order by ir; + QUERY PLAN + + Sort + Sort Key: ir + - Index Only Scan using test_range_spgist_idx on test_range_spgist + Index Cond: (ir -|- '[10,20)'::int4range) +(4 rows) + +select ir from test_range_spgist where ir -|- int4range(10,20) order by ir; + ir + + [20,30) + [20,30) + [20,10020) +(3 rows) + RESET enable_seqscan; RESET enable_indexscan; RESET enable_bitmapscan; diff --git a/src/test/regress/sql/rangetypes.sql b/src/test/regress/sql/rangetypes.sql index aa026ca..af13352 100644 --- a/src/test/regress/sql/rangetypes.sql +++ b/src/test/regress/sql/rangetypes.sql @@ -286,6 +286,11 @@ select count(*) from test_range_spgist where ir int4range(100,500); select count(*) from test_range_spgist where ir int4range(100,500); select count(*) from test_range_spgist where ir -|- int4range(100,500); +-- test index-only scans +explain (costs off) +select ir from test_range_spgist where ir -|- int4range(10,20) order by ir; +select ir from test_range_spgist where ir -|- int4range(10,20) order by ir; + RESET enable_seqscan; RESET enable_indexscan; RESET enable_bitmapscan; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your
Re: [HACKERS] Index-only scans with btree_gist
On 03/29/2015 03:25 AM, Andreas Karlsson wrote: On 03/28/2015 09:36 PM, Andreas Karlsson wrote: Thanks! Do you know if it is possible to add index-only scan support to range indexes? I have never looked at those and do not know if they are lossy. Seems like range types are not compressed at all so implementing index-only scans was trivial. A patch is attached. Noticed a couple of typos, so to avoid having them sneak into the code here is a version without them. -- Andreas Karlsson diff --git a/src/backend/utils/adt/rangetypes_gist.c b/src/backend/utils/adt/rangetypes_gist.c index c1ff471..ef84121 100644 --- a/src/backend/utils/adt/rangetypes_gist.c +++ b/src/backend/utils/adt/rangetypes_gist.c @@ -216,7 +216,7 @@ range_gist_union(PG_FUNCTION_ARGS) PG_RETURN_RANGE(result_range); } -/* compress, decompress are no-ops */ +/* compress, decompress, fetch are no-ops */ Datum range_gist_compress(PG_FUNCTION_ARGS) { @@ -233,6 +233,14 @@ range_gist_decompress(PG_FUNCTION_ARGS) PG_RETURN_POINTER(entry); } +Datum +range_gist_fetch(PG_FUNCTION_ARGS) +{ + GISTENTRY *entry = (GISTENTRY *) PG_GETARG_POINTER(0); + + PG_RETURN_POINTER(entry); +} + /* * GiST page split penalty function. * diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h index 037684c..8a43f64 100644 --- a/src/include/catalog/pg_amproc.h +++ b/src/include/catalog/pg_amproc.h @@ -235,6 +235,7 @@ DATA(insert ( 3919 3831 3831 4 3878 )); DATA(insert ( 3919 3831 3831 5 3879 )); DATA(insert ( 3919 3831 3831 6 3880 )); DATA(insert ( 3919 3831 3831 7 3881 )); +DATA(insert ( 3919 3831 3831 9 3996 )); /* gin */ diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index a96d369..3cd7851 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -4951,6 +4951,8 @@ DATA(insert OID = 3877 ( range_gist_compress PGNSP PGUID 12 1 0 0 0 f f f f t f DESCR(GiST support); DATA(insert OID = 3878 ( range_gist_decompress PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ range_gist_decompress _null_ _null_ _null_ )); DESCR(GiST support); +DATA(insert OID = 3996 ( range_gist_fetch PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ range_gist_fetch _null_ _null_ _null_ )); +DESCR(GiST support); DATA(insert OID = 3879 ( range_gist_penalty PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 2281 2281 2281 2281 _null_ _null_ _null_ _null_ range_gist_penalty _null_ _null_ _null_ )); DESCR(GiST support); DATA(insert OID = 3880 ( range_gist_picksplit PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 2281 2281 _null_ _null_ _null_ _null_ range_gist_picksplit _null_ _null_ _null_ )); diff --git a/src/include/utils/rangetypes.h b/src/include/utils/rangetypes.h index 173bf74..43c80f4 100644 --- a/src/include/utils/rangetypes.h +++ b/src/include/utils/rangetypes.h @@ -209,6 +209,7 @@ extern RangeType *make_empty_range(TypeCacheEntry *typcache); extern Datum range_gist_consistent(PG_FUNCTION_ARGS); extern Datum range_gist_compress(PG_FUNCTION_ARGS); extern Datum range_gist_decompress(PG_FUNCTION_ARGS); +extern Datum range_gist_fetch(PG_FUNCTION_ARGS); extern Datum range_gist_union(PG_FUNCTION_ARGS); extern Datum range_gist_penalty(PG_FUNCTION_ARGS); extern Datum range_gist_picksplit(PG_FUNCTION_ARGS); diff --git a/src/test/regress/expected/rangetypes.out b/src/test/regress/expected/rangetypes.out index 35d0dd3..8654e03 100644 --- a/src/test/regress/expected/rangetypes.out +++ b/src/test/regress/expected/rangetypes.out @@ -1072,6 +1072,25 @@ select count(*) from test_range_spgist where ir -|- int4range(100,500); 5 (1 row) +-- test index-only scans +explain (costs off) +select ir from test_range_spgist where ir -|- int4range(10,20) order by ir; + QUERY PLAN + + Sort + Sort Key: ir + - Index Only Scan using test_range_spgist_idx on test_range_spgist + Index Cond: (ir -|- '[10,20)'::int4range) +(4 rows) + +select ir from test_range_spgist where ir -|- int4range(10,20) order by ir; + ir + + [20,30) + [20,30) + [20,10020) +(3 rows) + RESET enable_seqscan; RESET enable_indexscan; RESET enable_bitmapscan; diff --git a/src/test/regress/sql/rangetypes.sql b/src/test/regress/sql/rangetypes.sql index aa026ca..af13352 100644 --- a/src/test/regress/sql/rangetypes.sql +++ b/src/test/regress/sql/rangetypes.sql @@ -286,6 +286,11 @@ select count(*) from test_range_spgist where ir int4range(100,500); select count(*) from test_range_spgist where ir int4range(100,500); select count(*) from test_range_spgist where ir -|- int4range(100,500); +-- test index-only scans +explain (costs off) +select ir from test_range_spgist where ir -|- int4range(10,20) order by ir; +select ir from test_range_spgist where ir -|- int4range(10,20) order by ir; + RESET
Re: [HACKERS] Index-only scans with btree_gist
On 03/28/2015 02:12 PM, Heikki Linnakangas wrote: Looks good to me. Committed, thanks. Thanks! Do you know if it is possible to add index-only scan support to range indexes? I have never looked at those and do not know if they are lossy. Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans with btree_gist
On 03/28/2015 01:14 AM, Andreas Karlsson wrote: On 03/26/2015 10:31 PM, Heikki Linnakangas wrote: I've pushed Anastasia's patch to support index-only scans with GiST, and it's time to add opclasses support for all the opclasses that are not lossy. I think at least all the btree_gist opclasses need to be supported, it would be pretty surprising if they didn't support index-only scans, while some more complex opclasses did. Attached is a WIP patch for that. It covers all the varlen types that btree_gist supports, and int2, int4 and oid. The rest of the fixed-width types should be just a matter of copy-pasting. I'll continue adding those, but wanted to let people know I'm working on this. Would it also be worth doing the same for the inet_ops class for inet/cidr? I have hacked a quick WIP patch which I believe should work, but have not looked into the index only scan code enough to be sure. Looks good to me. Committed, thanks. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans with btree_gist
On 03/26/2015 10:31 PM, Heikki Linnakangas wrote: I've pushed Anastasia's patch to support index-only scans with GiST, and it's time to add opclasses support for all the opclasses that are not lossy. I think at least all the btree_gist opclasses need to be supported, it would be pretty surprising if they didn't support index-only scans, while some more complex opclasses did. Attached is a WIP patch for that. It covers all the varlen types that btree_gist supports, and int2, int4 and oid. The rest of the fixed-width types should be just a matter of copy-pasting. I'll continue adding those, but wanted to let people know I'm working on this. Would it also be worth doing the same for the inet_ops class for inet/cidr? I have hacked a quick WIP patch which I believe should work, but have not looked into the index only scan code enough to be sure. -- Andreas Karlsson diff --git a/src/backend/utils/adt/network_gist.c b/src/backend/utils/adt/network_gist.c index 14dd62b..0133032 100644 --- a/src/backend/utils/adt/network_gist.c +++ b/src/backend/utils/adt/network_gist.c @@ -105,6 +105,27 @@ typedef struct GistInetKey #define SET_GK_VARSIZE(dst) \ SET_VARSIZE_SHORT(dst, offsetof(GistInetKey, ipaddr) + gk_ip_addrsize(dst)) +Datum +inet_gist_fetch(PG_FUNCTION_ARGS) +{ + GISTENTRY *entry = (GISTENTRY *) PG_GETARG_POINTER(0); + GistInetKey *key = DatumGetInetKeyP(entry-key); + GISTENTRY *retval; + inet *dst; + + dst = (inet *) palloc0(sizeof(inet)); + + ip_family(dst) = gk_ip_family(key); + ip_bits(dst) = gk_ip_minbits(key); + memcpy(ip_addr(dst), gk_ip_addr(key), ip_addrsize(dst)); + SET_INET_VARSIZE(dst); + + retval = palloc(sizeof(GISTENTRY)); + gistentryinit(*retval, InetPGetDatum(dst), entry-rel, entry-page, + entry-offset, FALSE); + + PG_RETURN_POINTER(retval); +} /* * The GiST query consistency check diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h index 612a9d2..3b87f90 100644 --- a/src/include/catalog/pg_amproc.h +++ b/src/include/catalog/pg_amproc.h @@ -411,6 +411,7 @@ DATA(insert ( 3550 869 869 4 3556 )); DATA(insert ( 3550 869 869 5 3557 )); DATA(insert ( 3550 869 869 6 3558 )); DATA(insert ( 3550 869 869 7 3559 )); +DATA(insert ( 3550 869 869 9 3573 )); /* sp-gist */ DATA(insert ( 3474 3831 3831 1 3469 )); diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index 77b7717..a96d369 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -2240,6 +2240,8 @@ DATA(insert OID = 3555 ( inet_gist_compress PGNSP PGUID 12 1 0 0 0 f f f f t f DESCR(GiST support); DATA(insert OID = 3556 ( inet_gist_decompress PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ inet_gist_decompress _null_ _null_ _null_ )); DESCR(GiST support); +DATA(insert OID = 3573 ( inet_gist_fetch PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ inet_gist_fetch _null_ _null_ _null_ )); +DESCR(GiST support); DATA(insert OID = 3557 ( inet_gist_penalty PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 2281 2281 2281 2281 _null_ _null_ _null_ _null_ inet_gist_penalty _null_ _null_ _null_ )); DESCR(GiST support); DATA(insert OID = 3558 ( inet_gist_picksplit PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 2281 2281 _null_ _null_ _null_ _null_ inet_gist_picksplit _null_ _null_ _null_ )); diff --git a/src/include/utils/inet.h b/src/include/utils/inet.h index 6694688..2d2bae4 100644 --- a/src/include/utils/inet.h +++ b/src/include/utils/inet.h @@ -123,6 +123,7 @@ extern int bitncommon(const unsigned char *l, const unsigned char *r, int n); /* * GiST support functions in network_gist.c */ +extern Datum inet_gist_fetch(PG_FUNCTION_ARGS); extern Datum inet_gist_consistent(PG_FUNCTION_ARGS); extern Datum inet_gist_union(PG_FUNCTION_ARGS); extern Datum inet_gist_compress(PG_FUNCTION_ARGS); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for GiST.
On 03/02/2015 12:43 AM, Heikki Linnakangas wrote: On 02/27/2015 04:19 PM, Anastasia Lubennikova wrote: I add MemoryContext listCxt to avoid memory leak. listCxt is created once in gistrescan (only for index-only scan plan ) and reseted when scan of the leaf page is finished. I do not sure if the problem was completely solved, so I wait for feedback. Yeah, I think that solves it. On further testing, there was actually still a leak with kNN-searches. Fixed. I spent a little time cleaning this up. I reverted that pageData change so that it's an array again, put back the gist_indexonly.sql and expected output files that were missing from your latest version, removed a couple of unused local variables that gcc complained about. I refactored gistFetchTuple a bit, because it was doing IMHO quite bogus things with NULLs. It was passing NULLs to the opclass' fetch function, but it didn't pass the isNull flag correctly. I changed it so that the fetch function is not called at all for NULLs. I think this is pretty close to being committable. I'll make a round of editorializing over the docs, and the code comments as well. The opr_sanity regression test is failing, there's apparently something wrong with the pg_proc entries of the *canreturn functions. I haven't looked into that yet; could you fix that? I have pushed this, after fixing the opr_sanity failure, some bug fixes, and documentation and comment cleanup. Thanks for the patch! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index-only scans with btree_gist
I've pushed Anastasia's patch to support index-only scans with GiST, and it's time to add opclasses support for all the opclasses that are not lossy. I think at least all the btree_gist opclasses need to be supported, it would be pretty surprising if they didn't support index-only scans, while some more complex opclasses did. Attached is a WIP patch for that. It covers all the varlen types that btree_gist supports, and int2, int4 and oid. The rest of the fixed-width types should be just a matter of copy-pasting. I'll continue adding those, but wanted to let people know I'm working on this. - Heikki From d0a1cd0aff05ac3fdfc3d5cea4d3bc6c738ffc23 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas heikki.linnakangas@iki.fi Date: Thu, 26 Mar 2015 23:10:14 +0200 Subject: [PATCH 1/1] Add index-only scan support to btree_gist. WIP: not all datatypes are covered yet. --- contrib/btree_gist/Makefile |3 +- contrib/btree_gist/btree_gist--1.0--1.1.sql | 58 + contrib/btree_gist/btree_gist--1.0.sql | 1491 -- contrib/btree_gist/btree_gist--1.1.sql | 1522 +++ contrib/btree_gist/btree_gist.control |2 +- contrib/btree_gist/btree_int2.c |8 + contrib/btree_gist/btree_int4.c |8 + contrib/btree_gist/btree_oid.c |8 + contrib/btree_gist/btree_utils_num.c| 55 + contrib/btree_gist/btree_utils_num.h|1 + contrib/btree_gist/btree_utils_var.c| 18 + 11 files changed, 1681 insertions(+), 1493 deletions(-) create mode 100644 contrib/btree_gist/btree_gist--1.0--1.1.sql delete mode 100644 contrib/btree_gist/btree_gist--1.0.sql create mode 100644 contrib/btree_gist/btree_gist--1.1.sql diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile index 19a5929..9b7d61d 100644 --- a/contrib/btree_gist/Makefile +++ b/contrib/btree_gist/Makefile @@ -9,7 +9,8 @@ OBJS = btree_gist.o btree_utils_num.o btree_utils_var.o btree_int2.o \ btree_numeric.o $(WIN32RES) EXTENSION = btree_gist -DATA = btree_gist--1.0.sql btree_gist--unpackaged--1.0.sql +DATA = btree_gist--1.1.sql btree_gist--unpackaged--1.0.sql \ + btree_gist--1.0--1.1.sql PGFILEDESC = btree_gist - B-tree equivalent GIST operator classes REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \ diff --git a/contrib/btree_gist/btree_gist--1.0--1.1.sql b/contrib/btree_gist/btree_gist--1.0--1.1.sql new file mode 100644 index 000..6b6f496 --- /dev/null +++ b/contrib/btree_gist/btree_gist--1.0--1.1.sql @@ -0,0 +1,58 @@ +/* contrib/btree_gist/btree_gist--1.0--1.1.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use CREATE EXTENSION btree_gist FROM unpackaged to load this file. \quit + +-- Index-only scan support new in 9.5. +CREATE FUNCTION gbt_var_fetch(internal) +RETURNS internal +AS 'MODULE_PATHNAME' +LANGUAGE C IMMUTABLE STRICT; + +CREATE FUNCTION gbt_oid_fetch(internal) +RETURNS internal +AS 'MODULE_PATHNAME' +LANGUAGE C IMMUTABLE STRICT; + +CREATE FUNCTION gbt_int2_fetch(internal) +RETURNS internal +AS 'MODULE_PATHNAME' +LANGUAGE C IMMUTABLE STRICT; + +CREATE FUNCTION gbt_int4_fetch(internal) +RETURNS internal +AS 'MODULE_PATHNAME' +LANGUAGE C IMMUTABLE STRICT; + +ALTER OPERATOR FAMILY gist_oid_ops USING gist ADD + FUNCTION 9 (oid, oid) gbt_oid_fetch (internal) ; + +ALTER OPERATOR FAMILY gist_int2_ops USING gist ADD + FUNCTION 9 (int2, int2) gbt_int2_fetch (internal) ; + +ALTER OPERATOR FAMILY gist_int4_ops USING gist ADD + FUNCTION 9 (int4, int4) gbt_int4_fetch (internal) ; + +ALTER OPERATOR FAMILY gist_text_ops USING gist ADD + FUNCTION 9 (text, text) gbt_var_fetch (internal) ; + +ALTER OPERATOR FAMILY gist_bpchar_ops USING gist ADD + FUNCTION 9 (bpchar, bpchar) gbt_var_fetch (internal) ; + +ALTER OPERATOR FAMILY gist_bytea_ops USING gist ADD + FUNCTION 9 (bytea, bytea) gbt_var_fetch (internal) ; + +ALTER OPERATOR FAMILY gist_numeric_ops USING gist ADD + FUNCTION 9 (numeric, numeric) gbt_var_fetch (internal) ; + +ALTER OPERATOR FAMILY gist_bit_ops USING gist ADD + FUNCTION 9 (bit, bit) gbt_var_fetch (internal) ; + +ALTER OPERATOR FAMILY gist_vbit_ops USING gist ADD + FUNCTION 9 (varbit, varbit) gbt_var_fetch (internal) ; + +ALTER OPERATOR FAMILY gist_inet_ops USING gist ADD + FUNCTION 9 (inet, inet) gbt_var_fetch (internal) ; + +ALTER OPERATOR FAMILY gist_cidr_ops USING gist ADD + FUNCTION 9 (cidr, cidr) gbt_var_fetch (internal) ; diff --git a/contrib/btree_gist/btree_gist--1.0.sql b/contrib/btree_gist/btree_gist--1.0.sql deleted file mode 100644 index c5c9587..000 --- a/contrib/btree_gist/btree_gist--1.0.sql +++ /dev/null @@ -1,1491 +0,0 @@ -/* contrib/btree_gist/btree_gist--1.0.sql */ - --- complain if script is sourced in psql, rather than via CREATE EXTENSION -\echo Use CREATE EXTENSION btree_gist to load this file. \quit - -CREATE FUNCTION gbtreekey4_in(cstring) -RETURNS gbtreekey4 -AS
Re: [HACKERS] Index-only scans for GiST.
On 02/27/2015 04:19 PM, Anastasia Lubennikova wrote: I add MemoryContext listCxt to avoid memory leak. listCxt is created once in gistrescan (only for index-only scan plan ) and reseted when scan of the leaf page is finished. I do not sure if the problem was completely solved, so I wait for feedback. Yeah, I think that solves it. * What's the reason for turning GISTScanOpaqueData.pageData from an array to a List? This array is field of structure GISTScanOpaqueData. Memory for that structure allocated in function gistbeginscan(). The array is static so it's declared only one time in structure: GISTSearchHeapItem pageData [BLCKSZ/sizeof(IndexTupleData)] But how could we know size of array if we don't know what data would be returned? I mean type and amount. You're only adding a pointer to the IndexTuple to GISTSearchHeapItem. The GISTSearchHeapItem struct itself is still of constant size. I spent a little time cleaning this up. I reverted that pageData change so that it's an array again, put back the gist_indexonly.sql and expected output files that were missing from your latest version, removed a couple of unused local variables that gcc complained about. I refactored gistFetchTuple a bit, because it was doing IMHO quite bogus things with NULLs. It was passing NULLs to the opclass' fetch function, but it didn't pass the isNull flag correctly. I changed it so that the fetch function is not called at all for NULLs. I think this is pretty close to being committable. I'll make a round of editorializing over the docs, and the code comments as well. The opr_sanity regression test is failing, there's apparently something wrong with the pg_proc entries of the *canreturn functions. I haven't looked into that yet; could you fix that? - Heikki diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c index db2a452..53750da 100644 --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -1404,6 +1404,14 @@ initGISTstate(Relation index) else giststate-distanceFn[i].fn_oid = InvalidOid; + /* opclasses are not required to provide a Fetch method */ + if (OidIsValid(index_getprocid(index, i + 1, GIST_FETCH_PROC))) + fmgr_info_copy((giststate-fetchFn[i]), + index_getprocinfo(index, i + 1, GIST_FETCH_PROC), + scanCxt); + else + giststate-fetchFn[i].fn_oid = InvalidOid; + /* * If the index column has a specified collation, we should honor that * while doing comparisons. However, we may have a collatable storage @@ -1426,6 +1434,22 @@ initGISTstate(Relation index) return giststate; } +/* + * Gistcanreturn is supposed to be true if ANY FetchFn method is defined. + * If FetchFn exists it would be used in index-only scan + * Thus the responsibility rests with the opclass developer. + */ + +Datum +gistcanreturn(PG_FUNCTION_ARGS) { + Relation index = (Relation) PG_GETARG_POINTER(0); + int i = PG_GETARG_INT32(1); + if (OidIsValid(index_getprocid(index, i+1, GIST_FETCH_PROC))) + PG_RETURN_BOOL(true); + else + PG_RETURN_BOOL(false); +} + void freeGISTstate(GISTSTATE *giststate) { diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c index 717cb85..80532c8 100644 --- a/src/backend/access/gist/gistget.c +++ b/src/backend/access/gist/gistget.c @@ -229,6 +229,8 @@ gistindex_keytest(IndexScanDesc scan, * we're doing a plain or ordered indexscan. For a plain indexscan, heap * tuple TIDs are returned into so-pageData[]. For an ordered indexscan, * heap tuple TIDs are pushed into individual search queue items. + * If index-only scan is possible, heap tuples themselves are returned + * into so-pageData or into search queue. * * If we detect that the index page has split since we saw its downlink * in the parent, we push its new right sibling onto the queue so the @@ -241,6 +243,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, GISTScanOpaque so = (GISTScanOpaque) scan-opaque; Buffer buffer; Page page; + GISTSTATE *giststate = so-giststate; + Relation r = scan-indexRelation; GISTPageOpaque opaque; OffsetNumber maxoff; OffsetNumber i; @@ -288,6 +292,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, } so-nPageData = so-curPageData = 0; + if (so-pageDataCxt) + MemoryContextReset(so-pageDataCxt); /* * check all tuples on page @@ -326,10 +332,20 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, else if (scan-numberOfOrderBys == 0 GistPageIsLeaf(page)) { /* - * Non-ordered scan, so report heap tuples in so-pageData[] + * Non-ordered scan, so report tuples in so-pageData[] */ so-pageData[so-nPageData].heapPtr = it-t_tid; so-pageData[so-nPageData].recheck = recheck; + + /* + * If index-only scan is possible add the data fetched from index. + */ + if (scan-xs_want_itup) + { +oldcxt =
Re: [HACKERS] Index-only scans for GiST.
I add MemoryContext listCxt to avoid memory leak. listCxt is created once in gistrescan (only for index-only scan plan ) and reseted when scan of the leaf page is finished. I do not sure if the problem was completely solved, so I wait for feedback. * What's the reason for turning GISTScanOpaqueData.pageData from an array to a List? This array is field of structure GISTScanOpaqueData. Memory for that structure allocated in function gistbeginscan(). The array is static so it's declared only one time in structure: GISTSearchHeapItem pageData [BLCKSZ/sizeof(IndexTupleData)] But how could we know size of array if we don't know what data would be returned? I mean type and amount. I asked Alexander about that and he offered me to use List instead of Array. indexonlyscan_gist_2.2.patch Description: Binary data indexonlyscan_gist_docs.patch Description: Binary data test_performance.sql Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for GiST.
Thanks for answer. Now it seems to be applied correctly. 2015-02-12 3:12 GMT+04:00 Thom Brown t...@linux.com: On 11 February 2015 at 22:50, Anastasia Lubennikova lubennikov...@gmail.com wrote: Finally there is a new version of patch (in attachments). It provides multicolumn index-only scan for GiST indexes. - Memory leak is fixed. - little code cleanup - example of performance test in attachmens - function OIDs have debugging values (*) just to avoid merge conflicts while testing patch Wiki page of the project is https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 Waiting for feedback. Hi Anastasia. Thanks for the updated patch. I've just tried applying it to head and it doesn't appear to apply cleanly. $ patch -p1 ~/Downloads/indexonlyscan_gist_2.0.patch (Stripping trailing CRs from patch; use --binary to disable.) patching file src/backend/access/gist/gist.c Hunk #1 succeeded at 1404 (offset 9 lines). Hunk #2 succeeded at 1434 (offset 9 lines). (Stripping trailing CRs from patch; use --binary to disable.) patching file src/backend/access/gist/gistget.c Hunk #1 succeeded at 227 (offset 1 line). Hunk #2 succeeded at 243 (offset 1 line). Hunk #3 succeeded at 293 (offset -4 lines). Hunk #4 succeeded at 330 (offset -4 lines). Hunk #5 succeeded at 365 (offset -5 lines). Hunk #6 succeeded at 444 (offset -27 lines). Hunk #7 succeeded at 474 (offset -27 lines). Hunk #8 FAILED at 518. Hunk #9 succeeded at 507 (offset -28 lines). Hunk #10 succeeded at 549 with fuzz 1 (offset -28 lines). Hunk #11 FAILED at 601. 2 out of 11 hunks FAILED -- saving rejects to file src/backend/access/gist/gistget.c.rej ... -- Thom -- Best regards, Lubennikova Anastasia diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c index db2a452..53750da 100644 --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -1404,6 +1404,14 @@ initGISTstate(Relation index) else giststate-distanceFn[i].fn_oid = InvalidOid; + /* opclasses are not required to provide a Fetch method */ + if (OidIsValid(index_getprocid(index, i + 1, GIST_FETCH_PROC))) + fmgr_info_copy((giststate-fetchFn[i]), + index_getprocinfo(index, i + 1, GIST_FETCH_PROC), + scanCxt); + else + giststate-fetchFn[i].fn_oid = InvalidOid; + /* * If the index column has a specified collation, we should honor that * while doing comparisons. However, we may have a collatable storage @@ -1426,6 +1434,22 @@ initGISTstate(Relation index) return giststate; } +/* + * Gistcanreturn is supposed to be true if ANY FetchFn method is defined. + * If FetchFn exists it would be used in index-only scan + * Thus the responsibility rests with the opclass developer. + */ + +Datum +gistcanreturn(PG_FUNCTION_ARGS) { + Relation index = (Relation) PG_GETARG_POINTER(0); + int i = PG_GETARG_INT32(1); + if (OidIsValid(index_getprocid(index, i+1, GIST_FETCH_PROC))) + PG_RETURN_BOOL(true); + else + PG_RETURN_BOOL(false); +} + void freeGISTstate(GISTSTATE *giststate) { diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c index 717cb85..0925e56 100644 --- a/src/backend/access/gist/gistget.c +++ b/src/backend/access/gist/gistget.c @@ -227,8 +227,10 @@ gistindex_keytest(IndexScanDesc scan, * If tbm/ntids aren't NULL, we are doing an amgetbitmap scan, and heap * tuples should be reported directly into the bitmap. If they are NULL, * we're doing a plain or ordered indexscan. For a plain indexscan, heap - * tuple TIDs are returned into so-pageData[]. For an ordered indexscan, + * tuple TIDs are returned into so-pageData. For an ordered indexscan, * heap tuple TIDs are pushed into individual search queue items. + * If index-only scan is possible, heap tuples themselves are returned + * into so-pageData or into search queue. * * If we detect that the index page has split since we saw its downlink * in the parent, we push its new right sibling onto the queue so the @@ -241,6 +243,10 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, GISTScanOpaque so = (GISTScanOpaque) scan-opaque; Buffer buffer; Page page; + GISTSTATE *giststate = so-giststate; + Relation r = scan-indexRelation; + boolisnull[INDEX_MAX_KEYS]; + GISTSearchHeapItem *tmpListItem; GISTPageOpaque opaque; OffsetNumber maxoff; OffsetNumber i; @@ -287,8 +293,6 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, MemoryContextSwitchTo(oldcxt); } - so-nPageData = so-curPageData = 0; - /* * check all tuples on page */ @@ -326,11 +330,20 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, else if (scan-numberOfOrderBys == 0 GistPageIsLeaf(page)) { /* - * Non-ordered scan, so report heap tuples in so-pageData[] + * Non-ordered scan, so report tuples in so-pageData + */ + + /* form tmpListItem and
Re: [HACKERS] Index-only scans for GiST.
On 12 February 2015 at 10:40, Anastasia Lubennikova lubennikov...@gmail.com wrote: Thanks for answer. Now it seems to be applied correctly. (please avoid top-posting) Thanks for the updated patch. I can confirm that it now cleanly applies and compiles fine. I've run the tested in the SQL file you provided, and it's behaving as expected. Thom
Re: [HACKERS] Index-only scans for GiST.
On 12 February 2015 at 16:40, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/12/2015 12:40 PM, Anastasia Lubennikova wrote: Thanks for answer. Now it seems to be applied correctly. * Documentation is missing. Anastasia provided a documentation patch in the first email. Thom
Re: [HACKERS] Index-only scans for GiST.
On Thu, Feb 12, 2015 at 3:07 PM, Thom Brown t...@linux.com wrote: On 12 February 2015 at 16:40, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/12/2015 12:40 PM, Anastasia Lubennikova wrote: Thanks for answer. Now it seems to be applied correctly. * Documentation is missing. Anastasia provided a documentation patch in the first email. Yeah, but it's a good practice send all the patches together. Will make the life of the reviewers better ;-) Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello
Re: [HACKERS] Index-only scans for GiST.
On 02/12/2015 12:40 PM, Anastasia Lubennikova wrote: Thanks for answer. Now it seems to be applied correctly. Thanks, it would be great to get this completed. This still leaks memory with the same test query as earlier. The leak seems to be into a different memory context now; it used to be to the GiST scan context, but now it's to the ExecutorState context. See attached patch which makes the leak evident. Looking at my previous comments from August: * What's the reason for turning GISTScanOpaqueData.pageData from an array to a List? * Documentation is missing. - Heikki diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c index 0925e56..3768c9c 100644 --- a/src/backend/access/gist/gistget.c +++ b/src/backend/access/gist/gistget.c @@ -526,6 +526,12 @@ gistgettuple(PG_FUNCTION_ARGS) * It's always head of so-pageData */ so-pageData = list_delete_first(so-pageData); +{ + static int lastreport = 0; + if ((lastreport++) % 1 == 0) + MemoryContextStats(CurrentMemoryContext); +} + PG_RETURN_BOOL(TRUE); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for GiST.
On 11 February 2015 at 22:50, Anastasia Lubennikova lubennikov...@gmail.com wrote: Finally there is a new version of patch (in attachments). It provides multicolumn index-only scan for GiST indexes. - Memory leak is fixed. - little code cleanup - example of performance test in attachmens - function OIDs have debugging values (*) just to avoid merge conflicts while testing patch Wiki page of the project is https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 Waiting for feedback. Hi Anastasia. Thanks for the updated patch. I've just tried applying it to head and it doesn't appear to apply cleanly. $ patch -p1 ~/Downloads/indexonlyscan_gist_2.0.patch (Stripping trailing CRs from patch; use --binary to disable.) patching file src/backend/access/gist/gist.c Hunk #1 succeeded at 1404 (offset 9 lines). Hunk #2 succeeded at 1434 (offset 9 lines). (Stripping trailing CRs from patch; use --binary to disable.) patching file src/backend/access/gist/gistget.c Hunk #1 succeeded at 227 (offset 1 line). Hunk #2 succeeded at 243 (offset 1 line). Hunk #3 succeeded at 293 (offset -4 lines). Hunk #4 succeeded at 330 (offset -4 lines). Hunk #5 succeeded at 365 (offset -5 lines). Hunk #6 succeeded at 444 (offset -27 lines). Hunk #7 succeeded at 474 (offset -27 lines). Hunk #8 FAILED at 518. Hunk #9 succeeded at 507 (offset -28 lines). Hunk #10 succeeded at 549 with fuzz 1 (offset -28 lines). Hunk #11 FAILED at 601. 2 out of 11 hunks FAILED -- saving rejects to file src/backend/access/gist/gistget.c.rej ... -- Thom
[HACKERS] Index-only scans for GiST.
Finally there is a new version of patch (in attachments). It provides multicolumn index-only scan for GiST indexes. - Memory leak is fixed. - little code cleanup - example of performance test in attachmens - function OIDs have debugging values (*) just to avoid merge conflicts while testing patch Wiki page of the project is https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 Waiting for feedback. -- Best regards, Lubennikova Anastasia indexonlyscan_gist_2.0.patch Description: Binary data test_performance.sql Description: Binary data indexonlyscan_gist_docs.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for GIST
On 18 August 2014 09:05, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 08/17/2014 07:15 PM, Anastasia Lubennikova wrote: 2014-08-07 0:30 GMT+04:00 Heikki Linnakangas hlinnakan...@vmware.com: * I'm getting two regression failures with this (opr_sanity and join). opr_sanity failure is corrected. But there is remain question with join. I check the latest version of my github repo and there's no fail in join regression test All 145 tests passed. To tell the truth, I don't understand which changes could led to this failure. Could you show me regression diffs? Sure, here you go. It seems like a change in a plan. At a quick glance it seems harmless: the new plan is identical except that the left and right side of a join have been reversed. But I don't understand either why this patch would change that, so it needs to be investigated. * The regression test queries that use LIMIT are not guaranteed to always return the same rows, hence they're not very good regression test cases. I'd suggest using more restricting WHERE clauses, so that each query only returns a handful of rows. Thank you for comment, I rewrote wrong queries. But could you explain why LIMIT queries may return different results? Is it happens because of different query optimization? Imagine that you have a table with two rows, A and B. If you run a query like SELECT * FROM table LIMIT 1, the system can legally return either row A or B, because there's no ORDER BY. Now, admittedly you have a similar problem even without the LIMIT - the system can legally return the rows in either order - but it's less of an issue because at least you can quickly see from the diff that the result set is in fact the same, the rows are just in different order. You could fix that by adding an ORDER BY to all test queries, but we haven't done that in the regression suite because then we would not have any test coverage for cases where you don't have an ORDER BY. As a compromise, test cases are usually written without an ORDER BY, but if e.g. the buildfarm starts failing because of differences in the result set order across platforms, then we add an ORDER BY to make it stable. * I think it's leaking memory, in GIST scan context. I tested this with a variant of the regression tests: insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i, 0.05*i)), point(0.05*i, 0.05*i) FROM generate_series(0, 1000) as i; CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p); set enable_seqscan=off; set enable_bitmapscan=off; explain analyze select p from gist_tbl where p @ box(point(0,0), point(999,999)) and length(p::text) 10; while the final query runs, 'top' shows constantly increasing memory usage. I don't think it's memory leak. After some increasing, memory using remain the same. It works similar without using indexonlyscan. No, it's definitely a leak caused by the patch. Test with the attached patch, which prints out to the server log the amount of memory used by the GiST scan memory context every 1 rows. It clearly shows increasing memory usage all the way to the end of the query. It's cleaned up at the end of the query, but that's not good enough because for a large query you might accumulate gigabytes of leaked memory until the query has finished. If you (manually) apply the same patch to git master, you'll see that the memory usage stays consistent and small. Hi Anastasia, Do you have time to address the issues brought up in Heikki's review? It would be good if we could your work into PostgreSQL 9.5. Thanks Thom
Re: [HACKERS] Index-only scans for GIST
Updated patch * Compiler, merge and regression fails checked * Regression tests was impoved * GiST and amcanreturn docs updated -- Best regards, Lubennikova Anastasia indexonlyscan_gist2.patch Description: Binary data indexonlyscan_gist_docs.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for GIST
On 08/17/2014 07:15 PM, Anastasia Lubennikova wrote: 2014-08-07 0:30 GMT+04:00 Heikki Linnakangas hlinnakan...@vmware.com: * I'm getting two regression failures with this (opr_sanity and join). opr_sanity failure is corrected. But there is remain question with join. I check the latest version of my github repo and there's no fail in join regression test All 145 tests passed. To tell the truth, I don't understand which changes could led to this failure. Could you show me regression diffs? Sure, here you go. It seems like a change in a plan. At a quick glance it seems harmless: the new plan is identical except that the left and right side of a join have been reversed. But I don't understand either why this patch would change that, so it needs to be investigated. * The regression test queries that use LIMIT are not guaranteed to always return the same rows, hence they're not very good regression test cases. I'd suggest using more restricting WHERE clauses, so that each query only returns a handful of rows. Thank you for comment, I rewrote wrong queries. But could you explain why LIMIT queries may return different results? Is it happens because of different query optimization? Imagine that you have a table with two rows, A and B. If you run a query like SELECT * FROM table LIMIT 1, the system can legally return either row A or B, because there's no ORDER BY. Now, admittedly you have a similar problem even without the LIMIT - the system can legally return the rows in either order - but it's less of an issue because at least you can quickly see from the diff that the result set is in fact the same, the rows are just in different order. You could fix that by adding an ORDER BY to all test queries, but we haven't done that in the regression suite because then we would not have any test coverage for cases where you don't have an ORDER BY. As a compromise, test cases are usually written without an ORDER BY, but if e.g. the buildfarm starts failing because of differences in the result set order across platforms, then we add an ORDER BY to make it stable. * I think it's leaking memory, in GIST scan context. I tested this with a variant of the regression tests: insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i, 0.05*i)), point(0.05*i, 0.05*i) FROM generate_series(0, 1000) as i; CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p); set enable_seqscan=off; set enable_bitmapscan=off; explain analyze select p from gist_tbl where p @ box(point(0,0), point(999,999)) and length(p::text) 10; while the final query runs, 'top' shows constantly increasing memory usage. I don't think it's memory leak. After some increasing, memory using remain the same. It works similar without using indexonlyscan. No, it's definitely a leak caused by the patch. Test with the attached patch, which prints out to the server log the amount of memory used by the GiST scan memory context every 1 rows. It clearly shows increasing memory usage all the way to the end of the query. It's cleaned up at the end of the query, but that's not good enough because for a large query you might accumulate gigabytes of leaked memory until the query has finished. If you (manually) apply the same patch to git master, you'll see that the memory usage stays consistent and small. - Heikki *** /home/heikki/git-sandbox/postgresql/src/test/regress/expected/join.out 2014-08-18 09:38:55.146171394 +0300 --- /home/heikki/git-sandbox/postgresql/src/test/regress/results/join.out 2014-08-18 10:28:30.326491898 +0300 *** *** 2579,2590 --- Sort Sort Key: t1.q1, t1.q2 !- Hash Left Join ! Hash Cond: (t1.q2 = t2.q1) Filter: (1 = (SubPlan 1)) ! - Seq Scan on int8_tbl t1 - Hash !- Seq Scan on int8_tbl t2 SubPlan 1 - Limit - Result --- 2579,2590 --- Sort Sort Key: t1.q1, t1.q2 !- Hash Right Join ! Hash Cond: (t2.q1 = t1.q2) Filter: (1 = (SubPlan 1)) ! - Seq Scan on int8_tbl t2 - Hash !- Seq Scan on int8_tbl t1 SubPlan 1 - Limit - Result *** *** 3589,3603 lateral (values(x.q1,y.q1,y.q2)) v(xq1,yq1,yq2); q1|q2 |q1|q2 | xq1| yq1|yq2 --+---+--+---+--+--+--- - 123 | 456 | | | 123 | | - 123 | 4567890123456789 | 4567890123456789
Re: [HACKERS] Index-only scans for GIST
2014-08-07 0:30 GMT+04:00 Heikki Linnakangas hlinnakan...@vmware.com: * I'm getting two regression failures with this (opr_sanity and join). opr_sanity failure is corrected. But there is remain question with join. I check the latest version of my github repo and there's no fail in join regression test All 145 tests passed. To tell the truth, I don't understand which changes could led to this failure. Could you show me regression diffs? I want to understand what's wrong with the patch. * The regression test queries that use LIMIT are not guaranteed to always return the same rows, hence they're not very good regression test cases. I'd suggest using more restricting WHERE clauses, so that each query only returns a handful of rows. Thank you for comment, I rewrote wrong queries. But could you explain why LIMIT queries may return different results? Is it happens because of different query optimization? * I think it's leaking memory, in GIST scan context. I tested this with a variant of the regression tests: insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i, 0.05*i)), point(0.05*i, 0.05*i) FROM generate_series(0, 1000) as i; CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p); set enable_seqscan=off; set enable_bitmapscan=off; explain analyze select p from gist_tbl where p @ box(point(0,0), point(999,999)) and length(p::text) 10; while the final query runs, 'top' shows constantly increasing memory usage. I don't think it's memory leak. After some increasing, memory using remain the same. It works similar without using indexonlyscan. -- Best regards, Lubennikova Anastasia
Re: [HACKERS] Index-only scans for GIST
On 08/01/2014 10:58 AM, Anastasia Lubennikova wrote: Hi, hackers! I work on a GSoC project Index-only scans for GIST https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. Thanks! Some comments: * I got this compiler warning: gistget.c:556:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] ListCell *tmpPageData = so-curPageData; ^ * I'm getting two regression failures with this (opr_sanity and join). * After merging with master, build fails because of duplicate OIDs. * The regression test queries that use LIMIT are not guaranteed to always return the same rows, hence they're not very good regression test cases. I'd suggest using more restricting WHERE clauses, so that each query only returns a handful of rows. * What's the reason for turning GISTScanOpaqueData.pageData from an array to a List? * I think it's leaking memory, in GIST scan context. I tested this with a variant of the regression tests: insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i, 0.05*i)), point(0.05*i, 0.05*i) FROM generate_series(0, 1000) as i; CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p); set enable_seqscan=off; set enable_bitmapscan=off; explain analyze select p from gist_tbl where p @ box(point(0,0), point(999,999)) and length(p::text) 10; while the final query runs, 'top' shows constantly increasing memory usage. It includes index-only scans for multicolumn GIST and new regression test. Fetch() method is realized for box and point opclasses. Can we have Fetch functions for all the datatypes in btree_gist contrib module, please? Do other contrib modules contain GiST opclasses that could have Fetch functions? Documentation is not updated yet, but I'm going to do it till the end of GSoC. I've got one question about query with OR condition. It is the last query in regression test gist_indexonly. It doesn't fail but it doensn't use index-only scans. Could someone explain to me how it works? It seems to depend on build_paths_for_OR http://doxygen.postgresql.org/indxpath_8c.html#ae660d2e886355e53ed3b9ec693e4afd2 function. But I couldn't understand how. The query is: select * from gist_tbl where b @ box(point(5,5), point(6,6)) or p @ box(point(0,0), point(100,100)) limit 10; It cannot use an index(-only) scan for this, because a single index scan can only return rows based on one key. In this case, you need to do two scans, and then return the rows returned by either scan, removing duplicates. A bitmap scan is possible, because it can remove the duplicates, but the planner can't produce a plain index scan plan that would do the same. A common trick when that happens in a real-world application is to re-write the query using UNION: select * from gist_tbl where b @ box(point(5,5), point(6,6)) UNION select * from gist_tbl where p @ box(point(0,0), point(100,100)) limit 10; Although that doesn't seem to actually work: ERROR: could not identify an equality operator for type box LINE 1: select * from gist_tbl ^ but that's not your patch's fault, the same happens with unpatched master. IOW, you don't need to worry about that case. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index-only scans for GIST
Hi, hackers! I work on a GSoC project Index-only scans for GIST https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. It includes index-only scans for multicolumn GIST and new regression test. Fetch() method is realized for box and point opclasses. Documentation is not updated yet, but I'm going to do it till the end of GSoC. I've got one question about query with OR condition. It is the last query in regression test gist_indexonly. It doesn't fail but it doensn't use index-only scans. Could someone explain to me how it works? It seems to depend on build_paths_for_OR http://doxygen.postgresql.org/indxpath_8c.html#ae660d2e886355e53ed3b9ec693e4afd2 function. But I couldn't understand how. -- Best regards, Lubennikova Anastasia indexonlyscan_gist.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for GIST
On Fri, Aug 1, 2014 at 4:58 AM, Anastasia Lubennikova lubennikov...@gmail.com wrote: Hi, hackers! I work on a GSoC project Index-only scans for GIST https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. It includes index-only scans for multicolumn GIST and new regression test. Fetch() method is realized for box and point opclasses. Documentation is not updated yet, but I'm going to do it till the end of GSoC. I've got one question about query with OR condition. It is the last query in regression test gist_indexonly. It doesn't fail but it doensn't use index-only scans. Could someone explain to me how it works? It seems to depend on build_paths_for_OR function. But I couldn't understand how. Very nice... please add your patch to the next commit fest [1]. Regards, [1] https://commitfest.postgresql.org/action/commitfest_view?id=23 -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog sobre TI: http://fabriziomello.blogspot.com Perfil Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello
Re: [HACKERS] Index-only scans for GIST
Thank you for comment Patch is already added in Performance topic. 2014-08-01 20:25 GMT+04:00 Fabrízio de Royes Mello fabriziome...@gmail.com : On Fri, Aug 1, 2014 at 4:58 AM, Anastasia Lubennikova lubennikov...@gmail.com wrote: Hi, hackers! I work on a GSoC project Index-only scans for GIST https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. It includes index-only scans for multicolumn GIST and new regression test. Fetch() method is realized for box and point opclasses. Documentation is not updated yet, but I'm going to do it till the end of GSoC. I've got one question about query with OR condition. It is the last query in regression test gist_indexonly. It doesn't fail but it doensn't use index-only scans. Could someone explain to me how it works? It seems to depend on build_paths_for_OR function. But I couldn't understand how. Very nice... please add your patch to the next commit fest [1]. Regards, [1] https://commitfest.postgresql.org/action/commitfest_view?id=23 -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog sobre TI: http://fabriziomello.blogspot.com Perfil Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello -- Best regards, Lubennikova Anastasia
Re: [HACKERS] Index-only scans for multicolumn GIST
That seems like a nonstarter :-(. Index-only scans don't have a license to return approximations. If we drop the behavior for circles, how much functionality do we have left? It should work with exact operator classes, box_ops, point_ops, range_ops, inet_ops. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for multicolumn GIST
On 07/21/2014 10:47 PM, Anastasia Lubennikova wrote: Hi, hackers! There are new results of my work on GSoC project Index-only scans for GIST. Previous post is here: http://postgresql.1045698.n5.nabble.com/Index-only-scans-for-GIST-td5804892.html Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. It includes indexonlyscan for multicolumn GiST. It works correctly - results are the same with another scan plans. Fetch() method is realized for box and circle opclasses Improvement for circle opclass is not such distinct as for box opclass, because of recheck. For a circle, the GiST index stores a bounding box of the circle. The new fetch function reverses that, calculating the radius and center of the circle from the bounding box. Those conversions lose some precision due to rounding. Are we okay with that? Floating point math is always subject to rounding errors, but there's a good argument to be made that it's not acceptable to get a different value back when the user didn't explicitly invoke any math functions. As an example: create table circle_tbl (c circle); create index circle_tbl_idx on circle_tbl using gist (c); insert into circle_tbl values ('1.23456789012345,1.23456789012345,1e300'); postgres=# set enable_seqscan=off; set enable_bitmapscan=off; set enable_indexonlyscan=on; SET SET SET postgres=# select * from circle_tbl ; c (0,0),1e+300 (1 row) postgres=# set enable_indexonlyscan=off; SET postgres=# select * from circle_tbl ; c -- (1.23456789012345,1.23456789012345),1e+300 (1 row) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for multicolumn GIST
On Tue, Jul 22, 2014 at 2:55 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 07/21/2014 10:47 PM, Anastasia Lubennikova wrote: Hi, hackers! There are new results of my work on GSoC project Index-only scans for GIST. Previous post is here: http://postgresql.1045698.n5.nabble.com/Index-only-scans-for-GIST-td5804892.html Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. It includes indexonlyscan for multicolumn GiST. It works correctly - results are the same with another scan plans. Fetch() method is realized for box and circle opclasses Improvement for circle opclass is not such distinct as for box opclass, because of recheck. For a circle, the GiST index stores a bounding box of the circle. The new fetch function reverses that, calculating the radius and center of the circle from the bounding box. Those conversions lose some precision due to rounding. Are we okay with that? Floating point math is always subject to rounding errors, but there's a good argument to be made that it's not acceptable to get a different value back when the user didn't explicitly invoke any math functions. As an example: create table circle_tbl (c circle); create index circle_tbl_idx on circle_tbl using gist (c); insert into circle_tbl values ('1.23456789012345,1.23456789012345,1e300'); postgres=# set enable_seqscan=off; set enable_bitmapscan=off; set enable_indexonlyscan=on; SET SET SET postgres=# select * from circle_tbl ; c (0,0),1e+300 (1 row) postgres=# set enable_indexonlyscan=off; SET postgres=# select * from circle_tbl ; c -- (1.23456789012345,1.23456789012345),1e+300 (1 row) I really don't think it's ever OK for a query to produce different answers depending on which plan is chosen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for multicolumn GIST
Heikki Linnakangas hlinnakan...@vmware.com writes: For a circle, the GiST index stores a bounding box of the circle. The new fetch function reverses that, calculating the radius and center of the circle from the bounding box. Those conversions lose some precision due to rounding. Are we okay with that? That seems like a nonstarter :-(. Index-only scans don't have a license to return approximations. If we drop the behavior for circles, how much functionality do we have left? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index-only scans for multicolumn GIST
Hi, hackers! There are new results of my work on GSoC project Index-only scans for GIST. Previous post is here: http://postgresql.1045698.n5.nabble.com/Index-only-scans-for-GIST-td5804892.html Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. It includes indexonlyscan for multicolumn GiST. It works correctly - results are the same with another scan plans. Fetch() method is realized for box and circle opclasses Improvement for circle opclass is not such distinct as for box opclass, because of recheck. I remember that all elog and other bad comments must be fixed before this patch can be committed. Any comments are welcome -- Best regards, Lubennikova Anastasia indexonlygist_2.1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans and non-MVCC snapshots
On 2014-06-26 22:47:47 -0600, Ryan Johnson wrote: Hi, As part of a research project, I'm trying to change Read Committed isolation to use HeapTupleSatisfiesNow rather than acquiring a new snapshot at every command [1]. Things appear to have gone reasonably well so far, except certain queries fail with ERROR: non-MVCC snapshots are not supported in index-only scans. You're aware that unless you employ additional locking you can simply miss individual rows or see them several times because of concurrent updates? The reason it has worked ( 9.4) for system catalogs is that updates of rows were only performed while objects were locked access exclusively - that's the reason why some places in the code use inplace updates btw... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans and non-MVCC snapshots
Ryan Johnson wrote: On 26/06/2014 11:04 PM, Alvaro Herrera wrote: Ryan Johnson wrote: As part of a research project, I'm trying to change Read Committed isolation to use HeapTupleSatisfiesNow rather than acquiring a new snapshot at every command [1]. Are you aware of this? commit 813fb0315587d32e3b77af1051a0ef517d187763 Author: Robert Haas rh...@postgresql.org Date: Thu Aug 1 10:46:19 2013 -0400 Remove SnapshotNow and HeapTupleSatisfiesNow. That would be wonderful news... if snapshots weren't so darned expensive to create. I take it you aren't aware of this other effort, either: http://archives.postgresql.org/message-id/539ad153.9000...@vmware.com -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans and non-MVCC snapshots
On 27/06/2014 3:14 AM, Andres Freund wrote: On 2014-06-26 22:47:47 -0600, Ryan Johnson wrote: Hi, As part of a research project, I'm trying to change Read Committed isolation to use HeapTupleSatisfiesNow rather than acquiring a new snapshot at every command [1]. Things appear to have gone reasonably well so far, except certain queries fail with ERROR: non-MVCC snapshots are not supported in index-only scans. You're aware that unless you employ additional locking you can simply miss individual rows or see them several times because of concurrent updates? The reason it has worked ( 9.4) for system catalogs is that updates of rows were only performed while objects were locked access exclusively - that's the reason why some places in the code use inplace updates btw... Yes, I was aware of the need for locking. The documentation just made it sound that locking was already in place for non-MVCC index scans. I was hoping I'd missed some easy way to activate it. Regards, Ryan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans and non-MVCC snapshots
On 2014-06-27 08:39:13 -0600, Ryan Johnson wrote: On 27/06/2014 3:14 AM, Andres Freund wrote: On 2014-06-26 22:47:47 -0600, Ryan Johnson wrote: Hi, As part of a research project, I'm trying to change Read Committed isolation to use HeapTupleSatisfiesNow rather than acquiring a new snapshot at every command [1]. Things appear to have gone reasonably well so far, except certain queries fail with ERROR: non-MVCC snapshots are not supported in index-only scans. You're aware that unless you employ additional locking you can simply miss individual rows or see them several times because of concurrent updates? The reason it has worked ( 9.4) for system catalogs is that updates of rows were only performed while objects were locked access exclusively - that's the reason why some places in the code use inplace updates btw... Yes, I was aware of the need for locking. The documentation just made it sound that locking was already in place for non-MVCC index scans. I was hoping I'd missed some easy way to activate it. Well, it is/was for the places (i.e. DDL) that actually use non-MVCC scans. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans and non-MVCC snapshots
On 27/06/2014 8:20 AM, Alvaro Herrera wrote: Ryan Johnson wrote: On 26/06/2014 11:04 PM, Alvaro Herrera wrote: Ryan Johnson wrote: As part of a research project, I'm trying to change Read Committed isolation to use HeapTupleSatisfiesNow rather than acquiring a new snapshot at every command [1]. Are you aware of this? commit 813fb0315587d32e3b77af1051a0ef517d187763 Author: Robert Haas rh...@postgresql.org Date: Thu Aug 1 10:46:19 2013 -0400 Remove SnapshotNow and HeapTupleSatisfiesNow. That would be wonderful news... if snapshots weren't so darned expensive to create. I take it you aren't aware of this other effort, either: http://archives.postgresql.org/message-id/539ad153.9000...@vmware.com That is good news, though from reading the thread it sounds like proc array accesses are being exchanged for accesses to an SLRU, so a lot of lwlock calls will remain. It will definitely help, though. SLRU will get ex-locked a lot less often, so the main source of contention will be for the actual lwlock acquire/release operations. Regards, Ryan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index-only scans and non-MVCC snapshots
Hi, As part of a research project, I'm trying to change Read Committed isolation to use HeapTupleSatisfiesNow rather than acquiring a new snapshot at every command [1]. Things appear to have gone reasonably well so far, except certain queries fail with ERROR: non-MVCC snapshots are not supported in index-only scans. I'm using v9.3.2, and the docs claim that index-only scans work without MVCC, but require some extra locking to avoid races [2]. Is this not actually implemented? If that is the case, shouldn't the query optimizer avoid selecting index-only scans for non-MVCC snapshots? I realize I'm playing with fire here, but any pointers to sections of code I might look at to either work around or fix this issue would be greatly appreciated. I've been looking around in index_fetch_heap (indexam.c) as well as other locations that use scan-xs_continue_hot; there seems to be code in place to detect when a non-MVCC snapshot is in use, as if that were nothing out of the ordinary, but nothing prevents the error from arising if a hot chain is actually encountered. Thanks, Ryan [1] Right now, Read Committed is significantly *slower* than Repeatable Read---for transactions involving multiple short commands---because the former acquires multiple snapshots per transaction and causes a lwlock bottleneck on my 12-core machine. [2] http://www.postgresql.org/docs/9.3/static/index-locking.html: with a non-MVCC-compliant snapshot (such as SnapshotNow), it would be possible to accept and return a row that does not in fact match the scan keys ... [so] we use a pin on an index page as a proxy to indicate that the reader might still be in flight from the index entry to the matching heap entry. Making ambulkdelete block on such a pin ensures that VACUUM cannot delete the heap entry before the reader is done with it. ... This solution requires that index scans be synchronous: we have to fetch each heap tuple immediately after scanning the corresponding index entry. This is expensive for a number of reasons. An asynchronous scan in which we collect many TIDs from the index, and only visit the heap tuples sometime later, requires much less index locking overhead and can allow a more efficient heap access pattern. Per the above analysis, we must use the synchronous approach for non-MVCC-compliant snapshots. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans and non-MVCC snapshots
Ryan Johnson wrote: As part of a research project, I'm trying to change Read Committed isolation to use HeapTupleSatisfiesNow rather than acquiring a new snapshot at every command [1]. Are you aware of this? commit 813fb0315587d32e3b77af1051a0ef517d187763 Author: Robert Haas rh...@postgresql.org Date: Thu Aug 1 10:46:19 2013 -0400 Remove SnapshotNow and HeapTupleSatisfiesNow. We now use MVCC catalog scans, and, per discussion, have eliminated all other remaining uses of SnapshotNow, so that we can now get rid of it. This will break third-party code which is still using it, which is intentional, as we want such code to be updated to do things the new way. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans and non-MVCC snapshots
On 26/06/2014 11:04 PM, Alvaro Herrera wrote: Ryan Johnson wrote: As part of a research project, I'm trying to change Read Committed isolation to use HeapTupleSatisfiesNow rather than acquiring a new snapshot at every command [1]. Are you aware of this? commit 813fb0315587d32e3b77af1051a0ef517d187763 Author: Robert Haas rh...@postgresql.org Date: Thu Aug 1 10:46:19 2013 -0400 Remove SnapshotNow and HeapTupleSatisfiesNow. That would be wonderful news... if snapshots weren't so darned expensive to create. I guess there's no avoiding that bottleneck now, though. Ryan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for GIST
On Sun, May 25, 2014 at 6:12 AM, Anastasia Lubennikova lubennikov...@gmail.com wrote: Hi, hackers! There are first results of my work on GSoC project Index-only scans for GIST. Cool. 1. Version of my project code is in forked repository https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments - This version is only for one-column indexes That's probably a limitation that needs to be fixed before this can be committed. - fetch() method is realized only for box opclass (because it's trivial) That might not need to be fixed before this can be committed. 2. I test Index-only scans with SQL script box_test.sql and it works really faster. (results in box_test_out) I'll be glad to get your feedback about this feature. Since this is a GSoC project, it would be nice if one of the people who is knowledgeable about GIST (Heikki, Alexander, etc.) could weigh in on this before too much time goes by, so that Anastasia can press forward with this work. I don't know enough to offer too many substantive comments, but I think you should remove all of the //-style comments (most of which are debugging leftovers) and add some more comments describing what you're actually doing and, more importantly, why. This comment doesn't appear to make sense: + /* +* The offset number on tuples on internal pages is unused. For historical +* reasons, it is set 0x. +*/ The reason this doesn't make sense is because the tuple in question is not on an internal page, or indeed any page at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index-only scans for GIST
Hi, hackers! There are first results of my work on GSoC project Index-only scans for GIST. 1. Version of my project code is in forked repository https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments - This version is only for one-column indexes - fetch() method is realized only for box opclass (because it's trivial) 2. I test Index-only scans with SQL script box_test.sql and it works really faster. (results in box_test_out) I'll be glad to get your feedback about this feature. -- Best regards, Lubennikova Anastasia indexonlygist_2.0.patch Description: Binary data box_test.sql Description: Binary data box_test_out.out Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index only scans wiki page
On 13 November 2012 01:25, Peter Geoghegan pe...@2ndquadrant.com wrote: Attached is a piece I wrote on the feature. That might form the basis of a new wiki page. Since no one else moved on this, I've completely replaced the existing page with the contents of the user-orientated document I posted. I don't believe that any of the information that has been removed was of general interest, since it only existed as a developer-orientated page. If anyone thinks that I shouldn't have removed everything that was there, or indeed who would like to change what I've added, well, it's a wiki. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index only scans wiki page
On Mon, Nov 12, 2012 at 8:25 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: On 13 November 2012 01:03, Jeff Janes jeff.ja...@gmail.com wrote: https://wiki.postgresql.org/wiki/Index-only_scans This page is seriously out of date. It suggests they are not yet implemented, but only being talked about. Attached is a piece I wrote on the feature. That might form the basis of a new wiki page. Feel free to incorporate this material as you see fit. I found this an interesting read. As one of the people who worked on the feature, I'm sort of curious whether people have any experience yet with how this actually shakes out in the field. Are you (or is anyone) aware of positive/negative field experiences with this feature? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index only scans wiki page
On 13 November 2012 16:37, Robert Haas robertmh...@gmail.com wrote: I found this an interesting read. As one of the people who worked on the feature, I'm sort of curious whether people have any experience yet with how this actually shakes out in the field. Are you (or is anyone) aware of positive/negative field experiences with this feature? Unfortunately, I don't think that I have any original insight about the problems with index-only scans in the field right now. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index only scans wiki page
On 13 November 2012 16:37, Robert Haas robertmh...@gmail.com wrote: I found this an interesting read. As one of the people who worked on the feature, I'm sort of curious whether people have any experience yet with how this actually shakes out in the field. Are you (or is anyone) aware of positive/negative field experiences with this feature? Unfortunately, I don't think that I have any original insight about the problems with index-only scans in the field right now. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index only scans wiki page
https://wiki.postgresql.org/wiki/Index-only_scans This page is seriously out of date. It suggests they are not yet implemented, but only being talked about. Would someone who knows a lot about the subject (which is why I'm sending this hackers, not web) like to take a whack at updating this? Otherwise, I'd like to just replace the whole page with a discussion of the limitations and trade-offs involved in using them as they currently exist Thanks, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index only scans wiki page
On 13 November 2012 01:03, Jeff Janes jeff.ja...@gmail.com wrote: https://wiki.postgresql.org/wiki/Index-only_scans This page is seriously out of date. It suggests they are not yet implemented, but only being talked about. Attached is a piece I wrote on the feature. That might form the basis of a new wiki page. Feel free to incorporate this material as you see fit. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services index_only_scans.rst Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans versus serializable transactions
Tom Lane wrote: Kevin Grittner writes: By not visiting the heap page for tuples, index-only scans fail to acquire all of the necessary predicate locks for correct behavior at the serializable transaction isolation level. The tag for the tuple-level predicate locks includes the xmin, to avoid possible problems with tid re-use. (This was not covered in initial pre-release versions of SSI, and testing actually hit the problem.) When an index-only scan does need to look at the heap because the visibility map doesn't indicate that the tuple is visible to all transactions, the tuple-level predicate lock is acquired. The best we can do without visiting the heap is a page level lock on the heap page, so that is what the attached patch does. If there are no objections, I will apply to HEAD and 9.2. This isn't right in detail: there are paths through the loop where tuple is not NULL at the beginning of the next iteration (specifically, consider failure of a lossy-qual recheck). I think that only results in wasted work, but it's still not operating as intended. I'd suggest moving the declaration/initialization of the tuple variable to inside the while loop, since there's no desire for its value to carry across loops. You're right. It looks to me like moving the declaration (and initialization) to more local scape (just inside the loop) fixes it. New version attached. Will apply if no further problems are found. -Kevin index-only-serializable-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans versus serializable transactions
Kevin Grittner wrote: New version attached. Will apply if no further problems are found. Pushed to master and REL9_2_STABLE. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] index-only scans versus serializable transactions
By not visiting the heap page for tuples, index-only scans fail to acquire all of the necessary predicate locks for correct behavior at the serializable transaction isolation level. The tag for the tuple-level predicate locks includes the xmin, to avoid possible problems with tid re-use. (This was not covered in initial pre-release versions of SSI, and testing actually hit the problem.) When an index-only scan does need to look at the heap because the visibility map doesn't indicate that the tuple is visible to all transactions, the tuple-level predicate lock is acquired. The best we can do without visiting the heap is a page level lock on the heap page, so that is what the attached patch does. If there are no objections, I will apply to HEAD and 9.2. -Kevin index-only-serializable-v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans versus serializable transactions
Kevin Grittner kevin.gritt...@wicourts.gov writes: By not visiting the heap page for tuples, index-only scans fail to acquire all of the necessary predicate locks for correct behavior at the serializable transaction isolation level. The tag for the tuple-level predicate locks includes the xmin, to avoid possible problems with tid re-use. (This was not covered in initial pre-release versions of SSI, and testing actually hit the problem.) When an index-only scan does need to look at the heap because the visibility map doesn't indicate that the tuple is visible to all transactions, the tuple-level predicate lock is acquired. The best we can do without visiting the heap is a page level lock on the heap page, so that is what the attached patch does. If there are no objections, I will apply to HEAD and 9.2. This isn't right in detail: there are paths through the loop where tuple is not NULL at the beginning of the next iteration (specifically, consider failure of a lossy-qual recheck). I think that only results in wasted work, but it's still not operating as intended. I'd suggest moving the declaration/initialization of the tuple variable to inside the while loop, since there's no desire for its value to carry across loops. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans vs. Hot Standby, round two
On 2 May 2012 13:41, Robert Haas robertmh...@gmail.com wrote: So on further reflection I'm thinking it may be best just to stick with a hard conflict for now and see what feedback we get from beta testers. Which is what I was expecting y'all to conclude once you'd looked at the task in more detail. And I'm happy with the concept of beta being a period where we listen to feedback, not just bugs, and decide whether further refinements are required. What I think is possible is to alter the conflict as it hits the backend. If a backend has enable_indexonly = off then it wouldn't be affected by those conflicts anyhow. So if we simply record whether we are using an index only plan then we'll know whether to ignore it or abort. I think we can handle that by marking the snapshot at executor startup time. Needs a few other pushups but not that hard. The likelihood is that SQL that uses index only won't run for long enough to be cancelled anyway. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans vs. Hot Standby, round two
On Thu, Apr 26, 2012 at 8:03 PM, Robert Haas robertmh...@gmail.com wrote: So, as a first step, I've committed a patch that just throws a hard conflict. I think we probably want to optimize this further, and I'm going to work investigate that next. But it seemed productive to get this much out of the way first, so I did. I've been thinking about this some more. What's worrying me is that a visibility conflict, however we implement it, could be *worse* from the user's point of view than just killing the query. After all, there's a reasonable likelihood that a single visibility map page covers the whole relation (or all the blocks that the user is interested in), so any sort of conflict is basically going to turn the index-only scan into an index-scan plus some extra overhead. And if the planner had known that the user was going to get an index-only scan rather than just a plain index scan, it might well have picked some other plan in the first place. Another problem is that, if we add a test for visibility conflicts into visibilitymap_test(), I'm afraid we're going to drive up the cost of that function very significantly. Previous testing suggests that that efficiency or lack thereof of that function is already a performance problem for index-only scans, which kinda makes me not that excited about adding another branch in there somewhere (and even less excited about any proposed implementation that would add an lwlock acquire/release or similar). So on further reflection I'm thinking it may be best just to stick with a hard conflict for now and see what feedback we get from beta testers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans vs. Hot Standby, round two
On Mon, Apr 16, 2012 at 4:13 PM, Robert Haas robertmh...@gmail.com wrote: But fundamentally we all seem to be converging on some variant of the soft conflict idea. So, as a first step, I've committed a patch that just throws a hard conflict. I think we probably want to optimize this further, and I'm going to work investigate that next. But it seemed productive to get this much out of the way first, so I did. In studying this, it strikes me that it would be rather nicer if we recovery conflicts could somehow arrange to roll back the active transaction by some means short of a FATAL error. I think there are some protocol issues with doing that, but I still wish we could figure out a way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans vs. Hot Standby, round two
On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote: In the department of query cancellations, I believe Noah argued previously that this wasn't really going to cause a problem. And, indeed, if the master has a mix of inserts, updates, and deletes, then it seems likely that any recovery conflicts generated this way would be hitting about the same place in the XID space as the ones caused by pruning away dead tuples. What will be different, if we go this route, is that an insert-only master, which right now only generates conflicts at freezing time, will instead generate conflicts much sooner, as soon as the relation is vacuumed. I do not use Hot Standby enough to know whether this is a problem, and I'm not trying to block this approach, but I do want to make sure that everyone agrees that it's acceptable before we go do it, because inevitably somebody is going to get bit. Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible. A standby with the GUC off would ignore all-visible indicators and also decline to generate conflicts when flipping them on. As to correctness, after further review of lazy_scan_heap(), I think there are some residual bugs here that need to be fixed whatever we decide to do about the Hot Standby case: 1. I noticed that when we do PageSetAllVisible() today we follow it up with SetBufferCommitInfoNeedsSave(). PD_ALL_VISIBLE is not merely a hint, so I think these should be changed to MarkBufferDirty(), which shouldn't make much difference given current code, but proposals to handle hint changes differently than non-hint changes abound, so it's probably not good to rely on those meaning the same thing forever. Do you refer to PD_ALL_VISIBLE as not merely a hint due to the requirement to prevent a page from simultaneously having a negative PD_ALL_VISIBLE and a positive visibility map bit? That is to say, PD_ALL_VISIBLE is fully a hint in its role as a witness of tuple visibility, but it is not a hint in its role as a witness of the visibility map status? In any event, I agree that those call sites should use MarkBufferDirty(). The large comment in SetBufferCommitInfoNeedsSave() seems incorrect. On systems with weaker memory ordering, the FlushBuffer() process's clearing of BM_JUST_DIRTIED may not be visible to the SetBufferCommitInfoNeedsSave() process until some time after the former retrieved buffer contents from shared memory. Memory barriers could make the comment true, but we should probably just update the comment to explain that a race condition may eat the update entirely. Incidentally, is there a good reason for MarkBufferDirty() to increment pgBufferUsage.shared_blks_dirtied and SetBufferCommitInfoNeedsSave() not to do so? Looks like an oversight. 2. More seriously, lazy_scan_heap() releases the buffer lock before setting the all-visible bit on the heap. This looks just plain wrong (and it's my fault, to be clear). Somebody else can come along after we've released the buffer lock and mark the page not-all-visible. That concurrent process will check the visibility map bit, find it already clear, and go on its merry way. Then, we set the visibility map bit and go on our merry way. Now we have a not-all-visible page with the all-visible bit set in the visibility map. Oops. Hmm, indeed. This deserves its own open item. I think this probably needs to be rewritten so that lazy_scan_heap() acquires a pin on the visibility map page before locking the buffer for cleanup and holds onto that pin until either we exit the range of heap pages covered by that visibility map page or trigger index vac due to maintenance_work_mem exhaustion. With that infrastructure in place, we can set the visibility map bit at the same time we set the page-level bit without risking holding the buffer lock across a buffer I/O (we might have to hold the buffer lock across a WAL flush in the worst case, but that hazard exists in a number of other places as well and fixing it here is well out of scope). Looks reasonable at a glance. 1. vacuum on master sets the page-level bit and the visibility map bit 2. the heap page with the bit is written to disk BEFORE the WAL entry generated in (1) gets flushed; this is allowable, since it's not an error for the page-level bit to be set while the visibility-map bit is unset, only the other way around 3. crash (still before the WAL entry is flushed) 4. some operation on the master emits an FPW for the page without first rendering it not-all-visible At present, I'm not sure there's any real problem here, since normally an all-visible heap page is only going to get another WAL entry if somebody does an insert, update, or delete on it, in which case the visibility map bit is going to get cleared anyway. Freezing the page might emit a new FPW, but that's going to generate conflicts anyway, so no problem there. So I think there's no real problem here, but it
Re: [HACKERS] index-only scans vs. Hot Standby, round two
On Mon, Apr 16, 2012 at 8:02 AM, Noah Misch n...@leadboat.com wrote: On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote: In the department of query cancellations, I believe Noah argued previously that this wasn't really going to cause a problem. And, indeed, if the master has a mix of inserts, updates, and deletes, then it seems likely that any recovery conflicts generated this way would be hitting about the same place in the XID space as the ones caused by pruning away dead tuples. What will be different, if we go this route, is that an insert-only master, which right now only generates conflicts at freezing time, will instead generate conflicts much sooner, as soon as the relation is vacuumed. I do not use Hot Standby enough to know whether this is a problem, and I'm not trying to block this approach, but I do want to make sure that everyone agrees that it's acceptable before we go do it, because inevitably somebody is going to get bit. Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible. A standby with the GUC off would ignore all-visible indicators and also decline to generate conflicts when flipping them on. I'm against adding a new GUC, especially for that fairly thin reason. 1. vacuum on master sets the page-level bit and the visibility map bit 2. the heap page with the bit is written to disk BEFORE the WAL entry generated in (1) gets flushed; this is allowable, since it's not an error for the page-level bit to be set while the visibility-map bit is unset, only the other way around 3. crash (still before the WAL entry is flushed) 4. some operation on the master emits an FPW for the page without first rendering it not-all-visible At present, I'm not sure there's any real problem here, since normally an all-visible heap page is only going to get another WAL entry if somebody does an insert, update, or delete on it, in which case the visibility map bit is going to get cleared anyway. Freezing the page might emit a new FPW, but that's going to generate conflicts anyway, so no problem there. So I think there's no real problem here, but it doesn't seem totally future-proof - any future type of WAL record that might modify the page without rendering it not-all-visible would create a very subtle hazard for Hot Standby. We could dodge the whole issue by leaving the hack in heapgetpage() intact and just be happy with making index-only scans work, or maybe somebody has a more clever idea. Good point. We could finagle things so the standby notices end-of-recovery checkpoints and blocks the optimization for older snapshots. For example, maintain an integer count of end-of-recovery checkpoints seen and store that in each Snapshot instead of takenDuringRecovery; use 0 on a master. When the global value is greater than the snapshot's value, disable the optimization for that snapshot. I don't know whether the optimization is powerful enough to justify that level of trouble, but it's an option to consider. For a different approach, targeting the fragility, we could add assertions to detect unexpected cases of a page bearing PD_ALL_VISIBLE submitted as a full-page image. We don't need a future proof solution, especially not at this stage of the release cycle. Every time you add a WAL record, we need to consider the possible conflict impact. We just need a simple and clear solution. I'll work on that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans vs. Hot Standby, round two
On 16.04.2012 10:38, Simon Riggs wrote: On Mon, Apr 16, 2012 at 8:02 AM, Noah Mischn...@leadboat.com wrote: On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote: In the department of query cancellations, I believe Noah argued previously that this wasn't really going to cause a problem. And, indeed, if the master has a mix of inserts, updates, and deletes, then it seems likely that any recovery conflicts generated this way would be hitting about the same place in the XID space as the ones caused by pruning away dead tuples. What will be different, if we go this route, is that an insert-only master, which right now only generates conflicts at freezing time, will instead generate conflicts much sooner, as soon as the relation is vacuumed. I do not use Hot Standby enough to know whether this is a problem, and I'm not trying to block this approach, but I do want to make sure that everyone agrees that it's acceptable before we go do it, because inevitably somebody is going to get bit. Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible. A standby with the GUC off would ignore all-visible indicators and also decline to generate conflicts when flipping them on. I'm against adding a new GUC, especially for that fairly thin reason. Same here. Can we have a soft hot standby conflict that doesn't kill the query, but disables index-only-scans? In the long run, perhaps we need to store XIDs in the visibility map instead of just a bit. If you we only stored one xid per 100 pages or something like that, the storage overhead would not be much higher than what we have at the moment. But that's obviously not going to happen for 9.2... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans vs. Hot Standby, round two
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Can we have a soft hot standby conflict that doesn't kill the query, but disables index-only-scans? Well, there wouldn't be any way for the planner to know whether an index-only scan would be safe or not. I think this would have to look like a run-time fallback. Could it be structured as return that the page's all-visible bit is not set, instead of failing? Or am I confused about where the conflict is coming from? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans vs. Hot Standby, round two
On Mon, Apr 16, 2012 at 3:19 PM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Can we have a soft hot standby conflict that doesn't kill the query, but disables index-only-scans? Well, there wouldn't be any way for the planner to know whether an index-only scan would be safe or not. I think this would have to look like a run-time fallback. Could it be structured as return that the page's all-visible bit is not set, instead of failing? Or am I confused about where the conflict is coming from? The starting point is that HS now offers 4 different mechanisms for avoiding conflicts, probably the best of which is hot standby feedback. So we only need to improve things if those techniques don't work for people. So initially, my attitude is lets throw a conflict and wait for feedback during beta. If we do need to do something, then introduce concept of a visibility conflict. On replay: If feedback not set, set LSN of visibility conflict on PROCs that conflict, if not already set. On query: If feedback not set, check conflict LSN against page, if page is later, check visibility. In terms of optimisation, I wouldn't expect to have to adjust costs at all. The difference would only show for long running queries and typically they don't touch too much new data as a fraction of total. The cost model for index only is pretty crude anyhow. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans vs. Hot Standby, round two
On Mon, Apr 16, 2012 at 4:26 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Can we have a soft hot standby conflict that doesn't kill the query, but disables index-only-scans? Yeah, something like that seems possible. For example, suppose the master includes, in each mark-heap-page-all-visible record, the newest XID on the page. On the standby, we keep track of the most recent such XID ever seen in shared memory. After noting that a page is all-visible, the standby cross-checks its snapshot against this XID and does the heap fetch anyway if it's too new. Conceivably this can be done with memory barriers but not without locks: we must update the XID in shared memory *before* marking the page all-visible, and we must check the visibility map or page-level bit *before* fetching the XID from shared memory - but the actual reads and writes of the XID are atomic. Now, if an index-only scan suffers a very high number of these soft conflicts and consequently does a lot more heap fetches than expected, performance might suck. But that should be rare, and could be mitigated by turning on hot_standby_feedback. Also, there might be some hit for repeatedly reading that XID from memory. Alternatively, we could try to forbid index-only scan plans from being created in the first place *only when* the underlying snapshot is too old. But then what happens if a conflict happens later, after the plan has been created but before it's fully executed? At that point it's too late to switch gears, so we'd still need something like the above. And the above might be adequate all by itself, since in practice index-only scans are mostly going to be useful when the data isn't changing all that fast, so most of the queries that would be cancelled by hard conflicts wouldn't have actually had a problem anyway. In the long run, perhaps we need to store XIDs in the visibility map instead of just a bit. If you we only stored one xid per 100 pages or something like that, the storage overhead would not be much higher than what we have at the moment. But that's obviously not going to happen for 9.2... Well, that would also have the undesirable side effect of greatly reducing the granularity of the map. As it is, updating a tiny fraction of the tuples in the table could result in the entire table ceasing to be not-all-visible if you happen to update exactly one tuple per page through the entire heap. Or more generally, updating X% of the rows in the table can cause Y% of the rows in the table to no longer be visible for index-only-scan purposes, where Y X. What you're proposing would make that much worse. I think we're actually going to find that the current system isn't tight enough; my suspicion is that users are going to complain that we're not aggressive enough about marking pages all-visible, because autovac won't kick in until updates+deletes20%, but (1) index-only scans also care about inserts and (2) by the time we've got 20% dead tuples index-only scans may well be worthless. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans vs. Hot Standby, round two
On Mon, Apr 16, 2012 at 3:02 AM, Noah Misch n...@leadboat.com wrote: Do you refer to PD_ALL_VISIBLE as not merely a hint due to the requirement to prevent a page from simultaneously having a negative PD_ALL_VISIBLE and a positive visibility map bit? That is to say, PD_ALL_VISIBLE is fully a hint in its role as a witness of tuple visibility, but it is not a hint in its role as a witness of the visibility map status? Exactly. The large comment in SetBufferCommitInfoNeedsSave() seems incorrect. On systems with weaker memory ordering, the FlushBuffer() process's clearing of BM_JUST_DIRTIED may not be visible to the SetBufferCommitInfoNeedsSave() process until some time after the former retrieved buffer contents from shared memory. True. Memory barriers could make the comment true, but we should probably just update the comment to explain that a race condition may eat the update entirely. Agreed. Incidentally, is there a good reason for MarkBufferDirty() to increment pgBufferUsage.shared_blks_dirtied and SetBufferCommitInfoNeedsSave() not to do so? Looks like an oversight. Also agreed. 2. More seriously, lazy_scan_heap() releases the buffer lock before setting the all-visible bit on the heap. This looks just plain wrong (and it's my fault, to be clear). Somebody else can come along after we've released the buffer lock and mark the page not-all-visible. That concurrent process will check the visibility map bit, find it already clear, and go on its merry way. Then, we set the visibility map bit and go on our merry way. Now we have a not-all-visible page with the all-visible bit set in the visibility map. Oops. Hmm, indeed. This deserves its own open item. Also agreed. Good point. We could finagle things so the standby notices end-of-recovery checkpoints and blocks the optimization for older snapshots. For example, maintain an integer count of end-of-recovery checkpoints seen and store that in each Snapshot instead of takenDuringRecovery; use 0 on a master. When the global value is greater than the snapshot's value, disable the optimization for that snapshot. I don't know whether the optimization is powerful enough to justify that level of trouble, but it's an option to consider. I suspect not. It seems we can make index-only scans work without doing something like this; it's only the sequential-scan optimization we might lose. But nobody's ever really complained about not having that, to my knowledge. For a different approach, targeting the fragility, we could add assertions to detect unexpected cases of a page bearing PD_ALL_VISIBLE submitted as a full-page image. The holes are narrow enough that I fear such cases would be detected only on high-velocity production systems, which is not exactly where you want to find out about such problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans vs. Hot Standby, round two
On Mon, Apr 16, 2012 at 1:58 PM, Simon Riggs si...@2ndquadrant.com wrote: If we do need to do something, then introduce concept of a visibility conflict. On replay: If feedback not set, set LSN of visibility conflict on PROCs that conflict, if not already set. On query: If feedback not set, check conflict LSN against page, if page is later, check visibility. Hmm, should have read the whole thread before replying. This similar to what I just proposed in response to Heikki's message, but using LSN in lieu of (or maybe you mean in addition to) XID. I don't think we can ignore the need to throw conflicts just because hot_standby_feedback is set; there are going to be corner cases, for example, when it's just recently been turned on and the master has already done cleanup; or if the master and standby have recently gotten disconnected for even just a few seconds. But fundamentally we all seem to be converging on some variant of the soft conflict idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans vs. Hot Standby, round two
On Fri, Apr 13, 2012 at 5:33 PM, Robert Haas robertmh...@gmail.com wrote: Currently, we have a problem with index-only scans in Hot Standby mode: the xmin horizon on the standby might lag the master, and thus an index-only scan might mistakenly conclude that no heap fetch is needed when in fact it is. I suggested that we handle this by suppressing generation of index-only scan plans in Hot Standby mode, but Simon, Noah, and Dimitri were arguing that we should instead do the following, which is now on the open items list: * Make XLOG_HEAP2_VISIBLE records generate recovery snapshot conflicts so that IndexOnlyScans work on Hot Standby ... snip very long email /snip Luckily its much simpler than all of that suggests. It'll take a few hours for me to write a short reply but its Sunday today, so that will happen later. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] index-only scans vs. Hot Standby, round two
Currently, we have a problem with index-only scans in Hot Standby mode: the xmin horizon on the standby might lag the master, and thus an index-only scan might mistakenly conclude that no heap fetch is needed when in fact it is. I suggested that we handle this by suppressing generation of index-only scan plans in Hot Standby mode, but Simon, Noah, and Dimitri were arguing that we should instead do the following, which is now on the open items list: * Make XLOG_HEAP2_VISIBLE records generate recovery snapshot conflicts so that IndexOnlyScans work on Hot Standby Ideally, this would also allow us to remove the following hack from heapgetpage(), which would improve sequential-scan performance for Hot Standby. /* * If the all-visible flag indicates that all tuples on the page are * visible to everyone, we can skip the per-tuple visibility tests. But * not in hot standby mode. A tuple that's already visible to all * transactions in the master might still be invisible to a read-only * transaction in the standby. */ all_visible = PageIsAllVisible(dp) !snapshot-takenDuringRecovery; As far as I can see, to make this work, we'd need to have lazy_scan_heap() compute the latest xmin on any potentially all-visible page and treat that as a cutoff XID, cancelling queries with snapshots whose xmins equal or precede that value, just as we currently do when freezing tuples (cf xlog_heap_freeze). There are two things to worry about: correctness, and more query cancellations. In the department of query cancellations, I believe Noah argued previously that this wasn't really going to cause a problem. And, indeed, if the master has a mix of inserts, updates, and deletes, then it seems likely that any recovery conflicts generated this way would be hitting about the same place in the XID space as the ones caused by pruning away dead tuples. What will be different, if we go this route, is that an insert-only master, which right now only generates conflicts at freezing time, will instead generate conflicts much sooner, as soon as the relation is vacuumed. I do not use Hot Standby enough to know whether this is a problem, and I'm not trying to block this approach, but I do want to make sure that everyone agrees that it's acceptable before we go do it, because inevitably somebody is going to get bit. In making our decision, I think we should keep in mind that we are currently pretty laid-back about marking pages all-visible, and that's probably something we're going to want to change over time: for example, we recently discussed the fact that a page with one dead tuple in it currently needs *two* vacuums to become all-visible, because only the first vacuum pass is smart enough to mark things all-visible, and the first heap pass only truncates the dead tuple to a dead line pointer, so the page isn't all-visible a that point. When we fix that, which I think we're almost certainly going to want to do at some point, then that means these conflicts will occur that much sooner. I think we will likely also want to consider marking things all-visible on the fly at some point in the future; for example, if we pull a buffer in for an index-only scan and dirty it by setting a hint bit, we might want to take the plunge and mark it all-visible before evicting it, to save effort the next time. Again, not trying to dissuade anyone, just making sure we've thought through it enough to avoid being unhappy later. It's worth noting that none of this will normally matter if hot_standby_feedback=on, so part of the analysis here may depend on how many people we think have flipped that switch. As to correctness, after further review of lazy_scan_heap(), I think there are some residual bugs here that need to be fixed whatever we decide to do about the Hot Standby case: 1. I noticed that when we do PageSetAllVisible() today we follow it up with SetBufferCommitInfoNeedsSave(). PD_ALL_VISIBLE is not merely a hint, so I think these should be changed to MarkBufferDirty(), which shouldn't make much difference given current code, but proposals to handle hint changes differently than non-hint changes abound, so it's probably not good to rely on those meaning the same thing forever. 2. More seriously, lazy_scan_heap() releases the buffer lock before setting the all-visible bit on the heap. This looks just plain wrong (and it's my fault, to be clear). Somebody else can come along after we've released the buffer lock and mark the page not-all-visible. That concurrent process will check the visibility map bit, find it already clear, and go on its merry way. Then, we set the visibility map bit and go on our merry way. Now we have a not-all-visible page with the all-visible bit set in the visibility map. Oops. I think this probably needs to be rewritten so that lazy_scan_heap() acquires a pin on the visibility map page before locking the buffer for cleanup and holds onto that pin until either we exit the range
[HACKERS] Index only scans and visibilitymap.c
With index only scans, the comments in src/backend/access/heap/visibilitymap.c are probably out of date, starting with: Currently, the visibility map is only used as a hint Also, is there a discussion of how and why index-only scans is safe? i.e. what lock, if any, has to be held while nodeIndexonlyscan.c calls visibilitymap_test? Do index-only scans need a README file? Thanks, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Fri, Oct 7, 2011 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Please find attached a patch implementing a basic version of index-only scans. I'm making some progress with this, but I notice what seems like a missing feature: there needs to be a way to turn it off. Otherwise performance comparisons will be difficult to impossible. The most obvious solution is a planner control GUC, perhaps enable_indexonlyscan. Anyone object, or want to bikeshed the name? Currently I can't force an indexonlyscan over an indexscan, because of the way the enable_* variables work. Should setting enable_indexscan=off also disable index-only scans (the current behavior) in addition to plain index scans? By way of precedent, enable_indexscan=off does not disable Bitmap Index Scan. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Sat, Oct 15, 2011 at 2:16 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Oct 7, 2011 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Please find attached a patch implementing a basic version of index-only scans. I'm making some progress with this, but I notice what seems like a missing feature: there needs to be a way to turn it off. Otherwise performance comparisons will be difficult to impossible. The most obvious solution is a planner control GUC, perhaps enable_indexonlyscan. Anyone object, or want to bikeshed the name? Currently I can't force an indexonlyscan over an indexscan, because of the way the enable_* variables work. OK, scratch that. I must have been using the wrong query (for which the index was not covering), as I can't reproduce the behavior nor looking at the code can I see how it could have occurred in the first place. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
I wrote: I was also toying with the notion of pushing the slot fill-in into the AM, so that the AM API is to return a filled TupleSlot not an IndexTuple. This would not save any cycles AFAICT --- at least in btree, we still have to make a palloc'd copy of the IndexTuple so that we can release lock on the index page. The point of it would be to avoid the assumption that the index's internal storage has exactly the format of IndexTuple. Don't know whether we'd ever have any actual use for that flexibility, but it seems like it wouldn't cost much to preserve the option. BTW, I concluded that that would be a bad idea, because it would involve the index AM in some choices that we're likely to want to change. In particular it would foreclose ever doing anything with expression indexes, without an AM API change. Better to just define the AM's responsibility as to hand back a tuple defined according to the index's columns. Although this aspect of the code is now working well enough for btree, I realized that it's going to have a problem if/when we add GiST support. The difficulty is that the index rowtype includes storage datatypes, not underlying-heap-column datatypes, for opclasses where those are different. This is not going to do for cases where we need to reconstruct a heap value from the index contents, as in Alexander's example of gist point_ops using a box as the underlying storage. What we actually want back from the index AM is a rowtype that matches the list of values submitted for indexing (ie, the original output of FormIndexDatum), and only for btree is it the case that that's represented more or less exactly as the IndexTuple stored in the index. So what I'm now thinking is to go back to the idea of having the index AM fill in a TupleTableSlot. For btree this would just amount to moving the existing StoreIndexTuple function into the AM. But it would give GiST the opportunity to do some computation, and it would avoid the problem of the index's rowtype not being a suitable intermediate format. The objection I voiced above is misguided, because it confuses the set of column types that's needed with the format distinction between a Slot and an IndexTuple. BTW, right at the moment I'm not that excited about actually doing any work on GiST itself for index-only scans. Given the current list of available opclasses there don't seem to be very many for which index-only scans would be possible, so the amount of work needed seems rather out of proportion to the benefit. But I don't mind fixing AM API details that are needed to make this workable. I'd rather have the API as right as possible in the first release. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Tue, Oct 11, 2011 at 12:19 AM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: I have mostly-working code for approach #3, but I haven't tried to make EXPLAIN work yet. While looking at that I realized that there's a pretty good argument for adding the above-mentioned explicit TargetEntry list representing the index columns to index-only plan nodes. Namely, that if we don't do it, EXPLAIN will have to go to the catalogs to find out what's in that index, and this will fall down for hypothetical indexes injected into the planner by index advisors. We could imagine adding some more hooks to let the advisor inject bogus catalog data at EXPLAIN time, but on the whole it seems easier and less fragile to just have the planner include a data structure it has to build anyway into the finished plan. The need for this additional node list field also sways me in a direction that I'd previously been on the fence about, namely that I think index-only scans need to be their own independent plan node type instead of sharing a node type with regular indexscans. It's just too weird that a simple boolean indexonly property would mean completely different contents/interpretation of the tlist and quals. Attached is a draft patch for this. It needs some more review before committing, but it does pass regression tests now. One point worth commenting on is that I chose to rename the OUTER and INNER symbols for special varnos to OUTER_VAR and INNER_VAR, along with adding a new special varno INDEX_VAR. It's bothered me for some time that those macro names were way too generic/susceptible to collision; and since I had to look at all the uses anyway to see if the INDEX case needed to be added, this seemed like a good time to rename them. +1 for that renaming, for sure. Have you given any thought to what would be required to support index-only scans on non-btree indexes - particularly, GIST? ISTM we might have had a thought that some GIST opclasses but not others would be able to regurgitate tuples, or maybe even that it might vary from index tuple to index tuple. But that discussion was a long time ago, and my memory is fuzzy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Tue, Oct 11, 2011 at 2:36 PM, Robert Haas robertmh...@gmail.com wrote: Have you given any thought to what would be required to support index-only scans on non-btree indexes - particularly, GIST? ISTM we might have had a thought that some GIST opclasses but not others would be able to regurgitate tuples, or maybe even that it might vary from index tuple to index tuple. But that discussion was a long time ago, and my memory is fuzzy. In some GiST opclasses original tuple can't be restored from index tuple. For example, polygon can't be restored from it's MBR. In some opclasses, for example box_ops and point_ops, original tuple can be easily restore from index tuple. I can't remeber any implementation where this possibility can vary from index tuple to index tuple. GiST opclass for ts_vector have different representation of leaf index tuple depending on it's length. But, at most, it store array of hashes of words, so it's lossy anyway. I think GiST interface could be extended with optional function which restores original tuple. But there is some additional difficulty, when some opclasses of composite index provide this function, but others - not. -- With best regards, Alexander Korotkov.
Re: [HACKERS] index-only scans
Robert Haas robertmh...@gmail.com writes: Have you given any thought to what would be required to support index-only scans on non-btree indexes - particularly, GIST? ISTM we might have had a thought that some GIST opclasses but not others would be able to regurgitate tuples, or maybe even that it might vary from index tuple to index tuple. But that discussion was a long time ago, and my memory is fuzzy. It would have to be a per-opclass property, for sure, since the basic criterion is whether the opclass's compress function is lossy. I don't think we can tolerate a situation where some of the index tuples might be able to yield data and others not --- that would undo all the improvements I've been working on over the past few days. I haven't thought as far ahead as how we might get the information needed for a per-opclass flag. A syntax addition to CREATE OPERATOR CLASS might be the only way. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Tue, Oct 11, 2011 at 5:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: I haven't thought as far ahead as how we might get the information needed for a per-opclass flag. A syntax addition to CREATE OPERATOR CLASS might be the only way. Shouldn't it be implemented through additional interface function? There are situations when restoring of original tuple requires some transformation. For example, in point_ops we store box in the leaf index tuple, while point can be easily restored from box. -- With best regards, Alexander Korotkov.
Re: [HACKERS] index-only scans
On Oct 7, 2011, at 8:47 PM, Joshua D. Drake wrote: On 10/07/2011 11:40 AM, Tom Lane wrote: Robert Haasrobertmh...@gmail.com writes: Please find attached a patch implementing a basic version of index-only scans. I'm making some progress with this, but I notice what seems like a missing feature: there needs to be a way to turn it off. Otherwise performance comparisons will be difficult to impossible. The most obvious solution is a planner control GUC, perhaps enable_indexonlyscan. Anyone object, or want to bikeshed the name? enable_onlyindexscan I'm kidding. +1 on Tom's proposed name. +1 ... definitely an important thing to do. regards, hans -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
Alexander Korotkov aekorot...@gmail.com writes: On Tue, Oct 11, 2011 at 5:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: I haven't thought as far ahead as how we might get the information needed for a per-opclass flag. A syntax addition to CREATE OPERATOR CLASS might be the only way. Shouldn't it be implemented through additional interface function? There are situations when restoring of original tuple requires some transformation. For example, in point_ops we store box in the leaf index tuple, while point can be easily restored from box. Hm. I had been supposing that lossless compress functions would just be no-ops. If that's not necessarily the case then we might need something different from the opclass's decompress function to get back the original data. However, that doesn't really solve the problem I'm concerned about, because the existence and use of such a function would be entirely internal to GiST. There still needs to be a way for the planner to know which opclasses support data retrieval. And I do *not* want to see us hard-wire the presence of opclass function 8 means a GiST opclass can return data into the planner. Maybe, instead of a simple constant amcanreturn column, we need an AM API function that says whether the index can return data. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Wed, Oct 12, 2011 at 12:35 AM, Tom Lane t...@sss.pgh.pa.us wrote: Hm. I had been supposing that lossless compress functions would just be no-ops. If that's not necessarily the case then we might need something different from the opclass's decompress function to get back the original data. However, that doesn't really solve the problem I'm concerned about, because the existence and use of such a function would be entirely internal to GiST. There still needs to be a way for the planner to know which opclasses support data retrieval. And I do *not* want to see us hard-wire the presence of opclass function 8 means a GiST opclass can return data into the planner. Maybe, instead of a simple constant amcanreturn column, we need an AM API function that says whether the index can return data. I like idea of such AM API function. Since single multicolumn index can use multiple opclasses, AM API function should also say *what* data index can return. -- With best regards, Alexander Korotkov.
Re: [HACKERS] index-only scans
Tom Lane t...@sss.pgh.pa.us writes: I haven't thought as far ahead as how we might get the information needed for a per-opclass flag. A syntax addition to CREATE OPERATOR CLASS might be the only way. It looks to me like it's related to the RECHECK property. Maybe it's just too late, though. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
Alexander Korotkov aekorot...@gmail.com writes: On Wed, Oct 12, 2011 at 12:35 AM, Tom Lane t...@sss.pgh.pa.us wrote: Maybe, instead of a simple constant amcanreturn column, we need an AM API function that says whether the index can return data. I like idea of such AM API function. Since single multicolumn index can use multiple opclasses, AM API function should also say *what* data index can return. I was thinking more like amcanreturn(index, column_number) returns bool which says if the index can return the data for that column. The AM would still have to return a full IndexTuple at runtime, but it'd be allowed to insert nulls or garbage for columns it hadn't promised to return. BTW, if we do this, I'm rather strongly tempted to get rid of the name-versus-cstring hack (see index_descriptor_hack() in HEAD) by defining btree name_ops as not capable of returning data. I don't trust that hack much at all. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
I wrote: I have mostly-working code for approach #3, but I haven't tried to make EXPLAIN work yet. While looking at that I realized that there's a pretty good argument for adding the above-mentioned explicit TargetEntry list representing the index columns to index-only plan nodes. Namely, that if we don't do it, EXPLAIN will have to go to the catalogs to find out what's in that index, and this will fall down for hypothetical indexes injected into the planner by index advisors. We could imagine adding some more hooks to let the advisor inject bogus catalog data at EXPLAIN time, but on the whole it seems easier and less fragile to just have the planner include a data structure it has to build anyway into the finished plan. The need for this additional node list field also sways me in a direction that I'd previously been on the fence about, namely that I think index-only scans need to be their own independent plan node type instead of sharing a node type with regular indexscans. It's just too weird that a simple boolean indexonly property would mean completely different contents/interpretation of the tlist and quals. Attached is a draft patch for this. It needs some more review before committing, but it does pass regression tests now. One point worth commenting on is that I chose to rename the OUTER and INNER symbols for special varnos to OUTER_VAR and INNER_VAR, along with adding a new special varno INDEX_VAR. It's bothered me for some time that those macro names were way too generic/susceptible to collision; and since I had to look at all the uses anyway to see if the INDEX case needed to be added, this seemed like a good time to rename them. regards, tom lane binInhGxDA2ea.bin Description: index-only-scan-revisions.patch.gz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
I wrote: I believe that we should rejigger things so that when an index-only scan is selected, the executor *always* works from the data supplied by the index. Even if it has to visit the heap --- it will do that but just to consult the tuple's visibility data, and then use what it got from the index anyway. This means we'd build the plan node's filter quals and targetlist to reference the index tuple columns not the underlying table's. I've been studying this a bit. The key decision is how to represent Vars that reference columns of the index. We really have to have varattno equal to the index column number, else ExecEvalVar will pull the wrong column from the tuple. However, varno is not so clear cut. There are at least four things we could do: 1. Keep varno = table's rangetable index. The trouble with this is that a Var referencing index column N would look exactly like a Var referencing table column N; so the same Var would mean something different in an index-only scan node than it does in any other type of scan node for the same table. We could maybe make that work, but it seems confusing and fragile as heck. The executor isn't going to care much, but inspection of the plan tree by e.g. EXPLAIN sure will. 2. Set varno = OUTER (or maybe INNER). This is safe because there's no other use for OUTER/INNER in a table scan node. We would have to hack things so that the index tuple gets put into econtext-ecxt_outertuple (resp. ecxt_innertuple) at runtime, but that seems like no big problem. In both setrefs.c and ruleutils.c, it would be desirable to have a TargetEntry list somewhere representing the index columns, which setrefs would want so it could set up the special Var nodes with fix_upper_expr, and ruleutils would want so it could interpret the Vars using existing machinery. I'm not sure whether to hang that list on the index-only plan node or expect EXPLAIN to regenerate it at need. 3. Invent another special varno value similar to OUTER/INNER but representing an index reference. This is just about like #2 except that we could still put the index tuple into econtext-ecxt_scantuple, and ExecEvalVar would do the right thing as it stands. 4. Create a rangetable entry specifically representing the index, and set varno equal to that RTE's number. This has some attractiveness in terms of making the meaning of the Vars clear, but an RTE that represents an index rather than a table seems kind of ugly otherwise. It would likely require changes in unrelated parts of the code. One point here is that we have historically used solution #1 to represent the index keys in index qual expressions. We avoid the ambiguity issues by not asking EXPLAIN to try to interpret the indexqual tree at all: it works from indexqualorig which contains ordinary Vars. So one way to dodge the disadvantages of solution #1 would be to add untransformed targetlistorig and qualorig fields to an index-only plan node, and use those for EXPLAIN. However, those fields would be totally dead weight if the plan were never EXPLAINed, whereas indexqualorig has a legitimate use for rechecking indexquals against the heap tuple in case of a lossy index. (BTW, if we go with any solution other than #1, I'm strongly inclined to change the representation of indexqual to match. See the comments in fix_indexqual_operand.) At the moment I'm leaning to approach #3, but I wonder if anyone has a different opinion or another idea altogether. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Sun, Oct 9, 2011 at 9:03 PM, Tom Lane t...@sss.pgh.pa.us wrote: At the moment I'm leaning to approach #3, but I wonder if anyone has a different opinion or another idea altogether. Would any of these make it more realistic to talk about the crazy plans Heikki suggested like doing two index scans, doing the join between the index tuples, and only then looking up the visibility information and remaining columns for the tuple on the matching rows? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
Greg Stark st...@mit.edu writes: On Sun, Oct 9, 2011 at 9:03 PM, Tom Lane t...@sss.pgh.pa.us wrote: At the moment I'm leaning to approach #3, but I wonder if anyone has a different opinion or another idea altogether. Would any of these make it more realistic to talk about the crazy plans Heikki suggested like doing two index scans, doing the join between the index tuples, and only then looking up the visibility information and remaining columns for the tuple on the matching rows? I don't think it's particularly relevant --- we would not want to use weird representations of the Vars outside the index scan nodes. Above the scan they'd be just like any other upper-level Vars. (FWIW, that idea isn't crazy; I remember having discussions of it back in 2003 or so.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Sun, Oct 9, 2011 at 10:54 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't think it's particularly relevant --- we would not want to use weird representations of the Vars outside the index scan nodes. Above the scan they'd be just like any other upper-level Vars. I can't say I fully understand the planner data structures and the implications of the options. I guess what I was imagining was that being able to reference the indexes as regular rangetable entries would make it more convenient for the rest of the planner to keep working as if nothing had changed when working with values extracted from index tuples. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
I wrote: There are at least four things we could do: ... 2. Set varno = OUTER (or maybe INNER). This is safe because there's no other use for OUTER/INNER in a table scan node. We would have to hack things so that the index tuple gets put into econtext-ecxt_outertuple (resp. ecxt_innertuple) at runtime, but that seems like no big problem. In both setrefs.c and ruleutils.c, it would be desirable to have a TargetEntry list somewhere representing the index columns, which setrefs would want so it could set up the special Var nodes with fix_upper_expr, and ruleutils would want so it could interpret the Vars using existing machinery. I'm not sure whether to hang that list on the index-only plan node or expect EXPLAIN to regenerate it at need. 3. Invent another special varno value similar to OUTER/INNER but representing an index reference. This is just about like #2 except that we could still put the index tuple into econtext-ecxt_scantuple, and ExecEvalVar would do the right thing as it stands. I have mostly-working code for approach #3, but I haven't tried to make EXPLAIN work yet. While looking at that I realized that there's a pretty good argument for adding the above-mentioned explicit TargetEntry list representing the index columns to index-only plan nodes. Namely, that if we don't do it, EXPLAIN will have to go to the catalogs to find out what's in that index, and this will fall down for hypothetical indexes injected into the planner by index advisors. We could imagine adding some more hooks to let the advisor inject bogus catalog data at EXPLAIN time, but on the whole it seems easier and less fragile to just have the planner include a data structure it has to build anyway into the finished plan. The need for this additional node list field also sways me in a direction that I'd previously been on the fence about, namely that I think index-only scans need to be their own independent plan node type instead of sharing a node type with regular indexscans. It's just too weird that a simple boolean indexonly property would mean completely different contents/interpretation of the tlist and quals. I've run into one other thing that's going to need to be hacked up a bit: index-only scans on name columns fall over with this modified code, because there's now tighter checking of the implied tuple descriptors: regression=# select relname from pg_class where relname = 'tenk1'; ERROR: attribute 1 has wrong type DETAIL: Table has type cstring, but query expects name. The reason for this is the hack we put in some time back to conserve space in system catalog indexes by having name columns be indexed as though they were cstring, cf commit 5f6f840e93a3649e0d07e85bad188d163e96ec0e. We will probably need some compensatory hack in index-only scans, unless we can think of a less klugy way of representing that optimization. (Basically, the index-only code is assuming that btrees don't have storage type distinct from input type, and that's not the case for the name opclass. I had kind of expected the original patch to have some issues with that too, and I'm still not fully convinced that there aren't corner cases where it'd be an issue even with the currently committed code.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Mon, Oct 10, 2011 at 2:47 AM, Tom Lane t...@sss.pgh.pa.us wrote: The need for this additional node list field also sways me in a direction that I'd previously been on the fence about, namely that I think index-only scans need to be their own independent plan node type instead of sharing a node type with regular indexscans At a superficial PR level it'll go over quite well to have a special plan node be visible in the explain output. People will love to see Fast Index Scan or Covering Index Scan or whatever you call it in their plans. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
I wrote: Robert Haas robertmh...@gmail.com writes: Not really. We have detected a small performance regression when both heap and index fit in shared_buffers and an index-only scan is used. I have a couple of ideas for improving this. One is to store a virtual tuple into the slot instead of building a regular tuple, but what do we do about tuples with OIDs? [ that's done ] I was also toying with the notion of pushing the slot fill-in into the AM, so that the AM API is to return a filled TupleSlot not an IndexTuple. This would not save any cycles AFAICT --- at least in btree, we still have to make a palloc'd copy of the IndexTuple so that we can release lock on the index page. The point of it would be to avoid the assumption that the index's internal storage has exactly the format of IndexTuple. Don't know whether we'd ever have any actual use for that flexibility, but it seems like it wouldn't cost much to preserve the option. BTW, I concluded that that would be a bad idea, because it would involve the index AM in some choices that we're likely to want to change. In particular it would foreclose ever doing anything with expression indexes, without an AM API change. Better to just define the AM's responsibility as to hand back a tuple defined according to the index's columns. Another is to avoid locking the index buffer multiple times - right now it locks the index buffer to get the TID, and then relocks it to extract the index tuple (after checking that nothing disturbing has happened meanwhile). It seems likely that with some refactoring we could get this down to a single lock/unlock cycle, but I haven't figured out exactly where the TID gets copied out. Yeah, maybe, but let's get the patch committed before we start looking for second-order optimizations. On reflection I'm starting to think that the above would be a good idea because there are a couple of bogosities in the basic choices this patch made. In particular, I'm thinking about how we could use an index on f(x) to avoid recalculating f() in something like select f(x) from tab where f(x) 42; assuming that f() is expensive but immutable. The planner side of this is already a bit daunting, because it's not clear how to recognize that an index on f(x) is a covering index (the existing code is going to think that x itself needs to be available). But the executor side is a real problem, because it will fail to make use of the f() value fetched from the index anytime the heap visibility test fails. I believe that we should rejigger things so that when an index-only scan is selected, the executor *always* works from the data supplied by the index. Even if it has to visit the heap --- it will do that but just to consult the tuple's visibility data, and then use what it got from the index anyway. This means we'd build the plan node's filter quals and targetlist to reference the index tuple columns not the underlying table's. (Which in passing gets rid of the behavior you were complaining about that EXPLAIN VERBOSE shows a lot of columns that aren't actually being computed.) In order to make this work, we have to remove the API wart that says the index AM is allowed to choose to not return the index tuple. And that ties into what you were saying above. What we ought to do is have _bt_readpage() save off copies of the whole tuples not only the TIDs when it is extracting data from the page. This is no more net copying work than what happens now (assuming that all the tuples get fetched) since we won't need the per-tuple memcpy that occurs now in bt_getindextuple. The tuples can go into a page-sized workspace buffer associated with the BTScanPosData structure, and then just be referenced in-place in that workspace with no second copy step. I'm inclined to do the last part immediately, since there's a performance argument for it, and then work on revising the executor implementation. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Oct 8, 2011, at 11:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: I'm inclined to do the last part immediately, since there's a performance argument for it, and then work on revising the executor implementation. Sounds great. Thanks for working on this. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
Robert Haas robertmh...@gmail.com writes: Please find attached a patch implementing a basic version of index-only scans. I'm making some progress with this, but I notice what seems like a missing feature: there needs to be a way to turn it off. Otherwise performance comparisons will be difficult to impossible. The most obvious solution is a planner control GUC, perhaps enable_indexonlyscan. Anyone object, or want to bikeshed the name? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On 10/07/2011 11:40 AM, Tom Lane wrote: Robert Haasrobertmh...@gmail.com writes: Please find attached a patch implementing a basic version of index-only scans. I'm making some progress with this, but I notice what seems like a missing feature: there needs to be a way to turn it off. Otherwise performance comparisons will be difficult to impossible. The most obvious solution is a planner control GUC, perhaps enable_indexonlyscan. Anyone object, or want to bikeshed the name? enable_onlyindexscan I'm kidding. +1 on Tom's proposed name. regards, tom lane -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development The PostgreSQL Conference - http://www.postgresqlconference.org/ @cmdpromptinc - @postgresconf - 509-416-6579 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Fri, Oct 7, 2011 at 2:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Please find attached a patch implementing a basic version of index-only scans. I'm making some progress with this, but I notice what seems like a missing feature: there needs to be a way to turn it off. Otherwise performance comparisons will be difficult to impossible. The most obvious solution is a planner control GUC, perhaps enable_indexonlyscan. Anyone object, or want to bikeshed the name? I was expecting you to NOT want that, or I would have put it in to begin with... so go for it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
Robert Haas robertmh...@gmail.com writes: On Fri, Oct 7, 2011 at 2:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: I'm making some progress with this, but I notice what seems like a missing feature: there needs to be a way to turn it off. Otherwise performance comparisons will be difficult to impossible. The most obvious solution is a planner control GUC, perhaps enable_indexonlyscan. Anyone object, or want to bikeshed the name? I was expecting you to NOT want that, or I would have put it in to begin with... so go for it. It seems unlikely to have any use except for testing, but I think we need it for that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
Robert Haas robertmh...@gmail.com writes: Please find attached a patch implementing a basic version of index-only scans. This patch is the work of my colleague Ibrar Ahmed and myself, and also incorporates some code from previous patches posted by Heikki Linnakanagas. I've committed this after some rather substantial editorialization. There's still a lot left to do of course, but it seems to need performance testing next, and that'll be easier if the code is in HEAD. 1. The way that nodeIndexscan.c builds up the faux heap tuple is perhaps susceptible to improvement. I thought about building a virtual tuple, but then what do I do with an OID column, if I have one? Or maybe this should be done some other way altogether. I switched it to use a virtual tuple for now, and just not attempt to use index-only scans if a system column is required. We're likely to want to rethink this anyway, because as currently constituted the code can't do anything with an expression index, and avoiding recalculation of an expensive function could be a nice win. But the approach of just building a faux heap tuple fundamentally doesn't work for that. 2. Suppose we scan one tuple on a not-all-visible page followed by 99 tuples on all-visible pages. The code as written will hold the pin on the first heap page for the entire scan. As soon as we hit the end of the scan or another tuple where we have to actually visit the page, the old pin will be released, but until then we hold onto it. I did not do anything about this issue --- ISTM it needs performance testing. 3. The code in create_index_path() builds up a bitmapset of heap attributes that get used for any purpose anywhere in the query, and hangs it on the RelOptInfo so it doesn't need to be rebuilt for every index under consideration. However, if it were somehow possible to have the rel involved without using any attributes at all, we'd rebuild the cache over and over, since it would never become non-NULL. I dealt with this by the expedient of getting rid of the caching ;-). It's not clear to me that it was worth the trouble, and in any case it's fundamentally wrong to suppose that every index faces the same set of attributes it must supply. It should not need to supply columns that are only needed in indexquals or index predicate conditions. I'm not sure how to deal with those refinements cheaply enough, but the cache isn't helping. 4. There are a couple of cases that use index-only scans even though the EXPLAIN output sort of makes it look like they shouldn't. For example, in the above queries, an index-only scan is chosen even though the query does SELECT * from the table being scanned. Even though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems that the target list of an EXISTS query is in fact discarded, e.g.: The reason it looks that way is that we're choosing to use a physical result tuple to avoid an ExecProject step at runtime. There's nothing wrong with the logic, it's just that EXPLAIN shows something that might mislead people. 5. We haven't made any planner changes at all, not even for cost estimation. It is not clear to me what the right way to do cost estimation here is. Yeah, me either. For the moment I put in a hard-wired estimate that only 90% of the heap pages would actually get fetched. This is conservative and only meant to ensure that the planner picks an index-only-capable plan over an indexscan with a non-covering index, all else being equal. We'll need to do performance testing before we can refine that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Fri, Oct 7, 2011 at 8:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: 1. The way that nodeIndexscan.c builds up the faux heap tuple is perhaps susceptible to improvement. I thought about building a virtual tuple, but then what do I do with an OID column, if I have one? Or maybe this should be done some other way altogether. I switched it to use a virtual tuple for now, and just not attempt to use index-only scans if a system column is required. We're likely to want to rethink this anyway, because as currently constituted the code can't do anything with an expression index, and avoiding recalculation of an expensive function could be a nice win. But the approach of just building a faux heap tuple fundamentally doesn't work for that. Figuring out how to fix that problem likely requires more knowledge of the executor than I have got. 2. Suppose we scan one tuple on a not-all-visible page followed by 99 tuples on all-visible pages. The code as written will hold the pin on the first heap page for the entire scan. As soon as we hit the end of the scan or another tuple where we have to actually visit the page, the old pin will be released, but until then we hold onto it. I did not do anything about this issue --- ISTM it needs performance testing. I'm actually less worried about any performance problem than I am about the possibility of holding up VACUUM. That can happen the old way, too, but now the pin could stay on the same page for quite a while even when the scan is advancing. I think we maybe ought to think seriously about solving the problem at the other end, though - either make VACUUM skip pages that it can't get a cleanup lock on without blocking (except in anti-wraparound mode) or have it just do the amount of work that can be done with an exclusive lock (i.e. prune but not defragment, which would work even in anti-wraparound mode). That would solve the problems of (1) undetected VACUUM deadlock vs. a buffer pin, (2) indefinite VACUUM stall due to a suspended query, and (3) this issue. 3. The code in create_index_path() builds up a bitmapset of heap attributes that get used for any purpose anywhere in the query, and hangs it on the RelOptInfo so it doesn't need to be rebuilt for every index under consideration. However, if it were somehow possible to have the rel involved without using any attributes at all, we'd rebuild the cache over and over, since it would never become non-NULL. I dealt with this by the expedient of getting rid of the caching ;-). It's not clear to me that it was worth the trouble, and in any case it's fundamentally wrong to suppose that every index faces the same set of attributes it must supply. It should not need to supply columns that are only needed in indexquals or index predicate conditions. I'm not sure how to deal with those refinements cheaply enough, but the cache isn't helping. Oh, hmm. 4. There are a couple of cases that use index-only scans even though the EXPLAIN output sort of makes it look like they shouldn't. For example, in the above queries, an index-only scan is chosen even though the query does SELECT * from the table being scanned. Even though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems that the target list of an EXISTS query is in fact discarded, e.g.: The reason it looks that way is that we're choosing to use a physical result tuple to avoid an ExecProject step at runtime. There's nothing wrong with the logic, it's just that EXPLAIN shows something that might mislead people. I wonder if we oughta do something about that. I was also thinking we should probably make EXPLAIN ANALYZE display the number of heap fetches, so that you can see how index-only your index-only scan actually was. 5. We haven't made any planner changes at all, not even for cost estimation. It is not clear to me what the right way to do cost estimation here is. Yeah, me either. For the moment I put in a hard-wired estimate that only 90% of the heap pages would actually get fetched. This is conservative and only meant to ensure that the planner picks an index-only-capable plan over an indexscan with a non-covering index, all else being equal. We'll need to do performance testing before we can refine that. Yep. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
Robert Haas robertmh...@gmail.com writes: Please find attached a patch implementing a basic version of index-only scans. This patch is the work of my colleague Ibrar Ahmed and myself, and also incorporates some code from previous patches posted by Heikki Linnakanagas. I'm starting to review this patch now. Has any further work been done since the first version was posted? Also, was any documentation written? I'm a tad annoyed by having to reverse-engineer the changes in the AM API specification from the code. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Thu, Oct 6, 2011 at 3:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Please find attached a patch implementing a basic version of index-only scans. This patch is the work of my colleague Ibrar Ahmed and myself, and also incorporates some code from previous patches posted by Heikki Linnakanagas. I'm starting to review this patch now. Thanks! Has any further work been done since the first version was posted? Also, was any documentation written? I'm a tad annoyed by having to reverse-engineer the changes in the AM API specification from the code. Not really. We have detected a small performance regression when both heap and index fit in shared_buffers and an index-only scan is used. I have a couple of ideas for improving this. One is to store a virtual tuple into the slot instead of building a regular tuple, but what do we do about tuples with OIDs? Another is to avoid locking the index buffer multiple times - right now it locks the index buffer to get the TID, and then relocks it to extract the index tuple (after checking that nothing disturbing has happened meanwhile). It seems likely that with some refactoring we could get this down to a single lock/unlock cycle, but I haven't figured out exactly where the TID gets copied out. With regard to the AM API, the basic idea is we're just adding a Boolean to say whether the AM is capable of returning index tuples. If it's not, then we don't ever try an index-only scan. If it is, then we'll set the want_index_tuple flag if an index-only scan is possible. This requests that the AM attempt to return the tuple; but the AM is also allowed to fail and not return the tuple whenever it wants. This is more or less the interface that Heikki suggested a couple years back, but it might well be vulnerable to improvement. Incidentally, if you happen to feel the urge to beat this around and send it back rather than posting a list of requested changes, feel free. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
Robert Haas robertmh...@gmail.com writes: Not really. We have detected a small performance regression when both heap and index fit in shared_buffers and an index-only scan is used. I have a couple of ideas for improving this. One is to store a virtual tuple into the slot instead of building a regular tuple, but what do we do about tuples with OIDs? Yeah, I was just looking at that. I think it's going to be a clear win to use a virtual tuple slot instead of constructing and deconstructing a HeapTuple. The obvious solution is to decree that you can't use an index-only scan if the query requires fetching OID (or any other system column). This would be slightly annoying for catalog fetches but I'm not sure I believe that catalog queries are that important a use-case. I was also toying with the notion of pushing the slot fill-in into the AM, so that the AM API is to return a filled TupleSlot not an IndexTuple. This would not save any cycles AFAICT --- at least in btree, we still have to make a palloc'd copy of the IndexTuple so that we can release lock on the index page. The point of it would be to avoid the assumption that the index's internal storage has exactly the format of IndexTuple. Don't know whether we'd ever have any actual use for that flexibility, but it seems like it wouldn't cost much to preserve the option. Another is to avoid locking the index buffer multiple times - right now it locks the index buffer to get the TID, and then relocks it to extract the index tuple (after checking that nothing disturbing has happened meanwhile). It seems likely that with some refactoring we could get this down to a single lock/unlock cycle, but I haven't figured out exactly where the TID gets copied out. Yeah, maybe, but let's get the patch committed before we start looking for second-order optimizations. With regard to the AM API, the basic idea is we're just adding a Boolean to say whether the AM is capable of returning index tuples. If it's not, then we don't ever try an index-only scan. If it is, then we'll set the want_index_tuple flag if an index-only scan is possible. This requests that the AM attempt to return the tuple; but the AM is also allowed to fail and not return the tuple whenever it wants. It was the allowed to fail bit that I was annoyed to discover only by reading some well-buried code. Incidentally, if you happen to feel the urge to beat this around and send it back rather than posting a list of requested changes, feel free. You can count on that ;-). I've already found a lot of things I didn't care for. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On 6 October 2011 21:11, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Not really. We have detected a small performance regression when both heap and index fit in shared_buffers and an index-only scan is used. I have a couple of ideas for improving this. One is to store a virtual tuple into the slot instead of building a regular tuple, but what do we do about tuples with OIDs? Yeah, I was just looking at that. I think it's going to be a clear win to use a virtual tuple slot instead of constructing and deconstructing a HeapTuple. The obvious solution is to decree that you can't use an index-only scan if the query requires fetching OID (or any other system column). This would be slightly annoying for catalog fetches but I'm not sure I believe that catalog queries are that important a use-case. I was also toying with the notion of pushing the slot fill-in into the AM, so that the AM API is to return a filled TupleSlot not an IndexTuple. This would not save any cycles AFAICT --- at least in btree, we still have to make a palloc'd copy of the IndexTuple so that we can release lock on the index page. The point of it would be to avoid the assumption that the index's internal storage has exactly the format of IndexTuple. Don't know whether we'd ever have any actual use for that flexibility, but it seems like it wouldn't cost much to preserve the option. Another is to avoid locking the index buffer multiple times - right now it locks the index buffer to get the TID, and then relocks it to extract the index tuple (after checking that nothing disturbing has happened meanwhile). It seems likely that with some refactoring we could get this down to a single lock/unlock cycle, but I haven't figured out exactly where the TID gets copied out. Yeah, maybe, but let's get the patch committed before we start looking for second-order optimizations. +1 Been testing this today with a few regression tests and haven't seen anything noticeably broken. Does what it says on the tin. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Sun, Aug 14, 2011 at 00:31, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: That is somewhat compensated by the fact that tuples that are accessed more often are also more likely to be in cache. Fetching the heap tuple to check visibility is very cheap when the tuple is in cache. I'm not sure how far that compensates it, though. I'm sure there's typically nevertheless a fairly wide range of pages that have been modified since the last vacuum, but not in cache anymore. Would it make sense to re-evaluate the visibility bit just before a page gets flushed out from shared buffers? On a system with no long transactions, it seems likely that a dirty page is already all-visible by the time bgwriter (or shared buffers memory pressure) gets around to writing it out. That way we don't have to wait for vacuum to do it and would make your observation hold more often. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Sun, Sep 25, 2011 at 2:43 PM, Marti Raudsepp ma...@juffo.org wrote: On Sun, Aug 14, 2011 at 00:31, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: That is somewhat compensated by the fact that tuples that are accessed more often are also more likely to be in cache. Fetching the heap tuple to check visibility is very cheap when the tuple is in cache. I'm not sure how far that compensates it, though. I'm sure there's typically nevertheless a fairly wide range of pages that have been modified since the last vacuum, but not in cache anymore. Would it make sense to re-evaluate the visibility bit just before a page gets flushed out from shared buffers? On a system with no long transactions, it seems likely that a dirty page is already all-visible by the time bgwriter (or shared buffers memory pressure) gets around to writing it out. That way we don't have to wait for vacuum to do it and would make your observation hold more often. This has been suggested before, and, sure, there might be cases where it helps. But you need to choose your test case fairly carefully. For example, if you're doing a large sequential scan on a table, the ring-buffer logic causes processes to evict their own pages, and so the background writer doesn't get a chance to touch any of those pages. You need some kind of a workload where pages are being evicted from shared buffers slowly enough that it ends up being the background writer, rather than the individual backends, that do the work. But if you have that kind of workload, then we can infer that most of your working set fits into shared buffers. And in that case you don't really need index-only scans in the first place. The main point of index only scans is to optimize the case where you have a gigantic table and you're trying to avoid swamping the system with random I/O. I'm not saying that such a change would be a particularly bad idea, but I'm not personally planning to work on it any time soon because I can't convince myself that it really helps all that much. I think the real solution to getting visibility map bits set is to vacuum more frequently, but have it be cheaper each time. Our default autovacuum settings vacuum the table when the number of updates and deletes reaches 20% of the table size. But those settings were put in place under the assumption that we'll have to scan the entire heap, dirtying every page that contains dead tuples, scan all the indexes and remove the associated index pointers, and then scan and dirty the heap pages that contain now-dead line pointers a second time to remove those. The visibility map has eroded those assumptions to some degree, because now we probably won't have to scan the entire heap every time we vacuum; and I'm hoping we're going to see some further erosion. Pavan has a pending patch which, if we can work out the details, will eliminate the second heap scan; and we've also talked about doing the index scan only when there are enough dead line pointers to justify the effort. That, it seems to me, would open the door to lowering the scale factor, maybe by quite a bit - which, in turn, would help us control bloat better and get visibility map bits set sooner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
2011/8/16 Anssi Kääriäinen anssi.kaariai...@thl.fi: On 08/14/2011 12:31 AM, Heikki Linnakangas wrote: The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio would have the problem of feedback loops, though. Yeah, I'm not excited about making the planner and statistics more dynamic. Feedback loops and plan instability are not fun. I might be a little out of my league here... But I was thinking about the cache hit ratio and feedback loops. I understand automatic tuning would be hard. But making automatic tuning easier (by using pg_tune for example) would be a big plus for most use cases. To make it easier to tune the page read costs automatically, it would be nice if there would be four variables instead of the current two: - random_page_cost is the cost of reading a random page from storage. Currently it is not, it is the cost of accessing a random page, taking in account it might be in memory. - seq_page_cost is the cost of reading pages sequentially from storage - memory_page_cost is the cost of reading a page in memory - cache_hit_ratio is the expected cache hit ratio memory_page_cost would be server global, random and seq page costs tablespace specific, and cache_hit_ratio relation specific. You would get the current behavior by tuning *_page_costs realistically, and setting cache_hit_ratio globally so that the expected random_page_cost / seq_page_cost stays the same as now. The biggest advantage of this would be that the correct values are much easier to detect automatically compared to current situation. This can be done using pg_statio_* views and IO speed testing. They should not be tuned automatically by PostgreSQL, at least not the cache_hit_ratio, as that leads to the possibility of feedback loops and plan instability. The variables would also be much easier to understand. There is the question if one should be allowed to tune the *_page_costs at all. If I am not missing something, it is possible to detect the correct values programmatically and they do not change if you do not change the hardware. Cache hit ratio is the real reason why they are currently so important for tuning. An example why the current random_page_cost and seq_page_cost tuning is not adequate is that you can only set random_page_cost per tablespace. That makes perfect sense if random_page_cost would be the cost of accessing a page in storage. But it is not, it is a combination of that and caching effects, so that it actually varies per relation (and over time). How do you set it correctly for a query where one relation is fully cached and another one not? Another problem is that if you use random_page_cost == seq_page_cost, you are effectively saying that everything is in cache. But if everything is in cache, the cost of page access relative to cpu_*_costs is way off. The more random_page_cost and seq_page_cost are different, the more they mean the storage access costs. When they are the same, they mean the memory page cost. There can be an order of magnitude in difference of a storage page cost and a memory page cost. So it is hard to tune the cpu_*_costs realistically for cases where sometimes data is in cache and sometimes not. Ok, enough hand waving for one post :) Sorry if this all is obvious / discussed before. My googling didn't turn out anything directly related, although these have some similarity: - Per-table random_page_cost for tables that we know are always cached [http://archives.postgresql.org/pgsql-hackers/2008-04/msg01503.php] - Script to compute random page cost [http://archives.postgresql.org/pgsql-hackers/2002-09/msg00503.php] - The science of optimization in practical terms? [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00718.php], getting really interesting starting from here: [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00787.php] late reply. You can add this link to your list: http://archives.postgresql.org/pgsql-hackers/2011-06/msg01140.php -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On 08/14/2011 12:31 AM, Heikki Linnakangas wrote: The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio would have the problem of feedback loops, though. Yeah, I'm not excited about making the planner and statistics more dynamic. Feedback loops and plan instability are not fun. I might be a little out of my league here... But I was thinking about the cache hit ratio and feedback loops. I understand automatic tuning would be hard. But making automatic tuning easier (by using pg_tune for example) would be a big plus for most use cases. To make it easier to tune the page read costs automatically, it would be nice if there would be four variables instead of the current two: - random_page_cost is the cost of reading a random page from storage. Currently it is not, it is the cost of accessing a random page, taking in account it might be in memory. - seq_page_cost is the cost of reading pages sequentially from storage - memory_page_cost is the cost of reading a page in memory - cache_hit_ratio is the expected cache hit ratio memory_page_cost would be server global, random and seq page costs tablespace specific, and cache_hit_ratio relation specific. You would get the current behavior by tuning *_page_costs realistically, and setting cache_hit_ratio globally so that the expected random_page_cost / seq_page_cost stays the same as now. The biggest advantage of this would be that the correct values are much easier to detect automatically compared to current situation. This can be done using pg_statio_* views and IO speed testing. They should not be tuned automatically by PostgreSQL, at least not the cache_hit_ratio, as that leads to the possibility of feedback loops and plan instability. The variables would also be much easier to understand. There is the question if one should be allowed to tune the *_page_costs at all. If I am not missing something, it is possible to detect the correct values programmatically and they do not change if you do not change the hardware. Cache hit ratio is the real reason why they are currently so important for tuning. An example why the current random_page_cost and seq_page_cost tuning is not adequate is that you can only set random_page_cost per tablespace. That makes perfect sense if random_page_cost would be the cost of accessing a page in storage. But it is not, it is a combination of that and caching effects, so that it actually varies per relation (and over time). How do you set it correctly for a query where one relation is fully cached and another one not? Another problem is that if you use random_page_cost == seq_page_cost, you are effectively saying that everything is in cache. But if everything is in cache, the cost of page access relative to cpu_*_costs is way off. The more random_page_cost and seq_page_cost are different, the more they mean the storage access costs. When they are the same, they mean the memory page cost. There can be an order of magnitude in difference of a storage page cost and a memory page cost. So it is hard to tune the cpu_*_costs realistically for cases where sometimes data is in cache and sometimes not. Ok, enough hand waving for one post :) Sorry if this all is obvious / discussed before. My googling didn't turn out anything directly related, although these have some similarity: - Per-table random_page_cost for tables that we know are always cached [http://archives.postgresql.org/pgsql-hackers/2008-04/msg01503.php] - Script to compute random page cost [http://archives.postgresql.org/pgsql-hackers/2002-09/msg00503.php] - The science of optimization in practical terms? [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00718.php], getting really interesting starting from here: [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00787.php] - Anssi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Aug 13, 2011, at 4:31 PM, Heikki Linnakangas wrote: The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and the whole relation's visibility statistics are completely wrong for that query. That is somewhat compensated by the fact that tuples that are accessed more often are also more likely to be in cache. Fetching the heap tuple to check visibility is very cheap when the tuple is in cache. I'm not sure how far that compensates it, though. I'm sure there's typically nevertheless a fairly wide range of pages that have been modified since the last vacuum, but not in cache anymore. http://xkcd.org/937/ :) Could something be added to pg_stats that tracks visibility map usefulness on a per-attribute basis? Perhaps another set of stats buckets that show visibility percentages for each stats bucket? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers