subject:"\[HACKERS\] index\-only scans"

Re: [HACKERS] Index-only scans with btree_gist

2015-03-30 Thread Heikki Linnakangas


On 03/29/2015 04:30 AM, Andreas Karlsson wrote:

On 03/29/2015 03:25 AM, Andreas Karlsson wrote:

On 03/28/2015 09:36 PM, Andreas Karlsson wrote:

Thanks! Do you know if it is possible to add index-only scan support to
range indexes? I have never looked at those and do not know if they are
lossy.


Seems like range types are not compressed at all so implementing
index-only scans was trivial. A patch is attached.


Noticed a couple of typos, so to avoid having them sneak into the code
here is a version without them.


Thanks, pushed.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans with btree_gist

2015-03-28 Thread Andreas Karlsson


On 03/28/2015 09:36 PM, Andreas Karlsson wrote:

Thanks! Do you know if it is possible to add index-only scan support to
range indexes? I have never looked at those and do not know if they are
lossy.


Seems like range types are not compressed at all so implementing 
index-only scans was trivial. A patch is attached.


--
Andreas Karlsson
diff --git a/src/backend/utils/adt/rangetypes_gist.c b/src/backend/utils/adt/rangetypes_gist.c
index c1ff471..dea8f04 100644
--- a/src/backend/utils/adt/rangetypes_gist.c
+++ b/src/backend/utils/adt/rangetypes_gist.c
@@ -216,7 +216,7 @@ range_gist_union(PG_FUNCTION_ARGS)
 	PG_RETURN_RANGE(result_range);
 }
 
-/* compress, decompress are no-ops */
+/* compress, decompress, fecth are no-ops */
 Datum
 range_gist_compress(PG_FUNCTION_ARGS)
 {
@@ -233,6 +233,14 @@ range_gist_decompress(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(entry);
 }
 
+Datum
+range_gist_fetch(PG_FUNCTION_ARGS)
+{
+	GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
+
+	PG_RETURN_POINTER(entry);
+}
+
 /*
  * GiST page split penalty function.
  *
diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h
index 037684c..e80c897 100644
--- a/src/include/catalog/pg_amproc.h
+++ b/src/include/catalog/pg_amproc.h
@@ -235,6 +235,7 @@ DATA(insert (	3919   3831 3831 4 3878 ));
 DATA(insert (	3919   3831 3831 5 3879 ));
 DATA(insert (	3919   3831 3831 6 3880 ));
 DATA(insert (	3919   3831 3831 7 3881 ));
+DATA(insert (   3919   3831 3831 9 3996 ));
 
 
 /* gin */
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index a96d369..3cd7851 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4951,6 +4951,8 @@ DATA(insert OID = 3877 (  range_gist_compress	PGNSP PGUID 12 1 0 0 0 f f f f t f
 DESCR(GiST support);
 DATA(insert OID = 3878 (  range_gist_decompress PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ range_gist_decompress _null_ _null_ _null_ ));
 DESCR(GiST support);
+DATA(insert OID = 3996 (  range_gist_fetch		PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ range_gist_fetch _null_ _null_ _null_ ));
+DESCR(GiST support);
 DATA(insert OID = 3879 (  range_gist_penalty	PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 2281 2281 2281 2281 _null_ _null_ _null_ _null_ range_gist_penalty _null_ _null_ _null_ ));
 DESCR(GiST support);
 DATA(insert OID = 3880 (  range_gist_picksplit	PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 2281 2281 _null_ _null_ _null_ _null_ range_gist_picksplit _null_ _null_ _null_ ));
diff --git a/src/include/utils/rangetypes.h b/src/include/utils/rangetypes.h
index 173bf74..43c80f4 100644
--- a/src/include/utils/rangetypes.h
+++ b/src/include/utils/rangetypes.h
@@ -209,6 +209,7 @@ extern RangeType *make_empty_range(TypeCacheEntry *typcache);
 extern Datum range_gist_consistent(PG_FUNCTION_ARGS);
 extern Datum range_gist_compress(PG_FUNCTION_ARGS);
 extern Datum range_gist_decompress(PG_FUNCTION_ARGS);
+extern Datum range_gist_fetch(PG_FUNCTION_ARGS);
 extern Datum range_gist_union(PG_FUNCTION_ARGS);
 extern Datum range_gist_penalty(PG_FUNCTION_ARGS);
 extern Datum range_gist_picksplit(PG_FUNCTION_ARGS);
diff --git a/src/test/regress/expected/rangetypes.out b/src/test/regress/expected/rangetypes.out
index 35d0dd3..8654e03 100644
--- a/src/test/regress/expected/rangetypes.out
+++ b/src/test/regress/expected/rangetypes.out
@@ -1072,6 +1072,25 @@ select count(*) from test_range_spgist where ir -|- int4range(100,500);
  5
 (1 row)
 
+-- test index-only scans
+explain (costs off)
+select ir from test_range_spgist where ir -|- int4range(10,20) order by ir;
+   QUERY PLAN   
+
+ Sort
+   Sort Key: ir
+   -  Index Only Scan using test_range_spgist_idx on test_range_spgist
+ Index Cond: (ir -|- '[10,20)'::int4range)
+(4 rows)
+
+select ir from test_range_spgist where ir -|- int4range(10,20) order by ir;
+ ir 
+
+ [20,30)
+ [20,30)
+ [20,10020)
+(3 rows)
+
 RESET enable_seqscan;
 RESET enable_indexscan;
 RESET enable_bitmapscan;
diff --git a/src/test/regress/sql/rangetypes.sql b/src/test/regress/sql/rangetypes.sql
index aa026ca..af13352 100644
--- a/src/test/regress/sql/rangetypes.sql
+++ b/src/test/regress/sql/rangetypes.sql
@@ -286,6 +286,11 @@ select count(*) from test_range_spgist where ir  int4range(100,500);
 select count(*) from test_range_spgist where ir  int4range(100,500);
 select count(*) from test_range_spgist where ir -|- int4range(100,500);
 
+-- test index-only scans
+explain (costs off)
+select ir from test_range_spgist where ir -|- int4range(10,20) order by ir;
+select ir from test_range_spgist where ir -|- int4range(10,20) order by ir;
+
 RESET enable_seqscan;
 RESET enable_indexscan;
 RESET enable_bitmapscan;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your

Re: [HACKERS] Index-only scans with btree_gist

2015-03-28 Thread Andreas Karlsson


On 03/29/2015 03:25 AM, Andreas Karlsson wrote:

On 03/28/2015 09:36 PM, Andreas Karlsson wrote:

Thanks! Do you know if it is possible to add index-only scan support to
range indexes? I have never looked at those and do not know if they are
lossy.


Seems like range types are not compressed at all so implementing
index-only scans was trivial. A patch is attached.


Noticed a couple of typos, so to avoid having them sneak into the code 
here is a version without them.


--
Andreas Karlsson
diff --git a/src/backend/utils/adt/rangetypes_gist.c b/src/backend/utils/adt/rangetypes_gist.c
index c1ff471..ef84121 100644
--- a/src/backend/utils/adt/rangetypes_gist.c
+++ b/src/backend/utils/adt/rangetypes_gist.c
@@ -216,7 +216,7 @@ range_gist_union(PG_FUNCTION_ARGS)
 	PG_RETURN_RANGE(result_range);
 }
 
-/* compress, decompress are no-ops */
+/* compress, decompress, fetch are no-ops */
 Datum
 range_gist_compress(PG_FUNCTION_ARGS)
 {
@@ -233,6 +233,14 @@ range_gist_decompress(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(entry);
 }
 
+Datum
+range_gist_fetch(PG_FUNCTION_ARGS)
+{
+	GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
+
+	PG_RETURN_POINTER(entry);
+}
+
 /*
  * GiST page split penalty function.
  *
diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h
index 037684c..8a43f64 100644
--- a/src/include/catalog/pg_amproc.h
+++ b/src/include/catalog/pg_amproc.h
@@ -235,6 +235,7 @@ DATA(insert (	3919   3831 3831 4 3878 ));
 DATA(insert (	3919   3831 3831 5 3879 ));
 DATA(insert (	3919   3831 3831 6 3880 ));
 DATA(insert (	3919   3831 3831 7 3881 ));
+DATA(insert (	3919   3831 3831 9 3996 ));
 
 
 /* gin */
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index a96d369..3cd7851 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4951,6 +4951,8 @@ DATA(insert OID = 3877 (  range_gist_compress	PGNSP PGUID 12 1 0 0 0 f f f f t f
 DESCR(GiST support);
 DATA(insert OID = 3878 (  range_gist_decompress PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ range_gist_decompress _null_ _null_ _null_ ));
 DESCR(GiST support);
+DATA(insert OID = 3996 (  range_gist_fetch		PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ range_gist_fetch _null_ _null_ _null_ ));
+DESCR(GiST support);
 DATA(insert OID = 3879 (  range_gist_penalty	PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 2281 2281 2281 2281 _null_ _null_ _null_ _null_ range_gist_penalty _null_ _null_ _null_ ));
 DESCR(GiST support);
 DATA(insert OID = 3880 (  range_gist_picksplit	PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 2281 2281 _null_ _null_ _null_ _null_ range_gist_picksplit _null_ _null_ _null_ ));
diff --git a/src/include/utils/rangetypes.h b/src/include/utils/rangetypes.h
index 173bf74..43c80f4 100644
--- a/src/include/utils/rangetypes.h
+++ b/src/include/utils/rangetypes.h
@@ -209,6 +209,7 @@ extern RangeType *make_empty_range(TypeCacheEntry *typcache);
 extern Datum range_gist_consistent(PG_FUNCTION_ARGS);
 extern Datum range_gist_compress(PG_FUNCTION_ARGS);
 extern Datum range_gist_decompress(PG_FUNCTION_ARGS);
+extern Datum range_gist_fetch(PG_FUNCTION_ARGS);
 extern Datum range_gist_union(PG_FUNCTION_ARGS);
 extern Datum range_gist_penalty(PG_FUNCTION_ARGS);
 extern Datum range_gist_picksplit(PG_FUNCTION_ARGS);
diff --git a/src/test/regress/expected/rangetypes.out b/src/test/regress/expected/rangetypes.out
index 35d0dd3..8654e03 100644
--- a/src/test/regress/expected/rangetypes.out
+++ b/src/test/regress/expected/rangetypes.out
@@ -1072,6 +1072,25 @@ select count(*) from test_range_spgist where ir -|- int4range(100,500);
  5
 (1 row)
 
+-- test index-only scans
+explain (costs off)
+select ir from test_range_spgist where ir -|- int4range(10,20) order by ir;
+   QUERY PLAN   
+
+ Sort
+   Sort Key: ir
+   -  Index Only Scan using test_range_spgist_idx on test_range_spgist
+ Index Cond: (ir -|- '[10,20)'::int4range)
+(4 rows)
+
+select ir from test_range_spgist where ir -|- int4range(10,20) order by ir;
+ ir 
+
+ [20,30)
+ [20,30)
+ [20,10020)
+(3 rows)
+
 RESET enable_seqscan;
 RESET enable_indexscan;
 RESET enable_bitmapscan;
diff --git a/src/test/regress/sql/rangetypes.sql b/src/test/regress/sql/rangetypes.sql
index aa026ca..af13352 100644
--- a/src/test/regress/sql/rangetypes.sql
+++ b/src/test/regress/sql/rangetypes.sql
@@ -286,6 +286,11 @@ select count(*) from test_range_spgist where ir  int4range(100,500);
 select count(*) from test_range_spgist where ir  int4range(100,500);
 select count(*) from test_range_spgist where ir -|- int4range(100,500);
 
+-- test index-only scans
+explain (costs off)
+select ir from test_range_spgist where ir -|- int4range(10,20) order by ir;
+select ir from test_range_spgist where ir -|- int4range(10,20) order by ir;
+
 RESET

Re: [HACKERS] Index-only scans with btree_gist

2015-03-28 Thread Andreas Karlsson


On 03/28/2015 02:12 PM, Heikki Linnakangas wrote:

Looks good to me. Committed, thanks.


Thanks! Do you know if it is possible to add index-only scan support to 
range indexes? I have never looked at those and do not know if they are 
lossy.


Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans with btree_gist

2015-03-28 Thread Heikki Linnakangas


On 03/28/2015 01:14 AM, Andreas Karlsson wrote:

On 03/26/2015 10:31 PM, Heikki Linnakangas wrote:

I've pushed Anastasia's patch to support index-only scans with GiST, and
it's time to add opclasses support for all the opclasses that are not
lossy. I think at least all the btree_gist opclasses need to be
supported, it would be pretty surprising if they didn't support
index-only scans, while some more complex opclasses did.

Attached is a WIP patch for that. It covers all the varlen types that
btree_gist supports, and int2, int4 and oid. The rest of the fixed-width
types should be just a matter of copy-pasting. I'll continue adding
those, but wanted to let people know I'm working on this.


Would it also be worth doing the same for the inet_ops class for
inet/cidr? I have hacked a quick WIP patch which I believe should work,
but have not looked into the index only scan code enough to be sure.


Looks good to me. Committed, thanks.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans with btree_gist

2015-03-27 Thread Andreas Karlsson


On 03/26/2015 10:31 PM, Heikki Linnakangas wrote:

I've pushed Anastasia's patch to support index-only scans with GiST, and
it's time to add opclasses support for all the opclasses that are not
lossy. I think at least all the btree_gist opclasses need to be
supported, it would be pretty surprising if they didn't support
index-only scans, while some more complex opclasses did.

Attached is a WIP patch for that. It covers all the varlen types that
btree_gist supports, and int2, int4 and oid. The rest of the fixed-width
types should be just a matter of copy-pasting. I'll continue adding
those, but wanted to let people know I'm working on this.


Would it also be worth doing the same for the inet_ops class for 
inet/cidr? I have hacked a quick WIP patch which I believe should work, 
but have not looked into the index only scan code enough to be sure.


--
Andreas Karlsson
diff --git a/src/backend/utils/adt/network_gist.c b/src/backend/utils/adt/network_gist.c
index 14dd62b..0133032 100644
--- a/src/backend/utils/adt/network_gist.c
+++ b/src/backend/utils/adt/network_gist.c
@@ -105,6 +105,27 @@ typedef struct GistInetKey
 #define SET_GK_VARSIZE(dst) \
 	SET_VARSIZE_SHORT(dst, offsetof(GistInetKey, ipaddr) + gk_ip_addrsize(dst))
 
+Datum
+inet_gist_fetch(PG_FUNCTION_ARGS)
+{
+	GISTENTRY	*entry = (GISTENTRY *) PG_GETARG_POINTER(0);
+	GistInetKey	*key = DatumGetInetKeyP(entry-key);
+	GISTENTRY	*retval;
+	inet		*dst;
+
+	dst = (inet *) palloc0(sizeof(inet));
+
+	ip_family(dst) = gk_ip_family(key);
+	ip_bits(dst) = gk_ip_minbits(key);
+	memcpy(ip_addr(dst), gk_ip_addr(key), ip_addrsize(dst));
+	SET_INET_VARSIZE(dst);
+
+	retval = palloc(sizeof(GISTENTRY));
+	gistentryinit(*retval, InetPGetDatum(dst), entry-rel, entry-page,
+  entry-offset, FALSE);
+
+	PG_RETURN_POINTER(retval);
+}
 
 /*
  * The GiST query consistency check
diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h
index 612a9d2..3b87f90 100644
--- a/src/include/catalog/pg_amproc.h
+++ b/src/include/catalog/pg_amproc.h
@@ -411,6 +411,7 @@ DATA(insert (	3550   869	869  4 3556 ));
 DATA(insert (	3550   869	869  5 3557 ));
 DATA(insert (	3550   869	869  6 3558 ));
 DATA(insert (	3550   869	869  7 3559 ));
+DATA(insert (   3550   869  869  9 3573 ));
 
 /* sp-gist */
 DATA(insert (	3474   3831 3831 1 3469 ));
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 77b7717..a96d369 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2240,6 +2240,8 @@ DATA(insert OID = 3555 (  inet_gist_compress	PGNSP PGUID 12 1 0 0 0 f f f f t f
 DESCR(GiST support);
 DATA(insert OID = 3556 (  inet_gist_decompress	PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ inet_gist_decompress _null_ _null_ _null_ ));
 DESCR(GiST support);
+DATA(insert OID = 3573 (  inet_gist_fetch		PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 2281 _null_ _null_ _null_ _null_ inet_gist_fetch _null_ _null_ _null_ ));
+DESCR(GiST support);
 DATA(insert OID = 3557 (  inet_gist_penalty		PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 2281 2281 2281 2281 _null_ _null_ _null_ _null_ inet_gist_penalty _null_ _null_ _null_ ));
 DESCR(GiST support);
 DATA(insert OID = 3558 (  inet_gist_picksplit	PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 2281 2281 _null_ _null_ _null_ _null_ inet_gist_picksplit _null_ _null_ _null_ ));
diff --git a/src/include/utils/inet.h b/src/include/utils/inet.h
index 6694688..2d2bae4 100644
--- a/src/include/utils/inet.h
+++ b/src/include/utils/inet.h
@@ -123,6 +123,7 @@ extern int	bitncommon(const unsigned char *l, const unsigned char *r, int n);
 /*
  * GiST support functions in network_gist.c
  */
+extern Datum inet_gist_fetch(PG_FUNCTION_ARGS);
 extern Datum inet_gist_consistent(PG_FUNCTION_ARGS);
 extern Datum inet_gist_union(PG_FUNCTION_ARGS);
 extern Datum inet_gist_compress(PG_FUNCTION_ARGS);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans for GiST.

2015-03-26 Thread Heikki Linnakangas


On 03/02/2015 12:43 AM, Heikki Linnakangas wrote:

On 02/27/2015 04:19 PM, Anastasia Lubennikova wrote:

I add MemoryContext listCxt to avoid memory leak. listCxt is created once
in gistrescan (only for index-only scan plan ) and reseted when scan of the
leaf page is finished.

I do not sure if the problem was completely solved, so I wait for feedback.


Yeah, I think that solves it.


On further testing, there was actually still a leak with kNN-searches. 
Fixed.



I spent a little time cleaning this up. I reverted that pageData change
so that it's an array again, put back the gist_indexonly.sql and
expected output files that were missing from your latest version,
removed a couple of unused local variables that gcc complained about. I
refactored gistFetchTuple a bit, because it was doing IMHO quite bogus
things with NULLs. It was passing NULLs to the opclass' fetch function,
but it didn't pass the isNull flag correctly. I changed it so that the
fetch function is not called at all for NULLs.

I think this is pretty close to being committable. I'll make a round of
editorializing over the docs, and the code comments as well.

The opr_sanity regression test is failing, there's apparently something
wrong with the pg_proc entries of the *canreturn functions. I haven't
looked into that yet; could you fix that?


I have pushed this, after fixing the opr_sanity failure, some bug fixes, 
and documentation and comment cleanup.


Thanks for the patch!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Index-only scans with btree_gist

2015-03-26 Thread Heikki Linnakangas

I've pushed Anastasia's patch to support index-only scans with GiST, and 
it's time to add opclasses support for all the opclasses that are not 
lossy. I think at least all the btree_gist opclasses need to be 
supported, it would be pretty surprising if they didn't support 
index-only scans, while some more complex opclasses did.


Attached is a WIP patch for that. It covers all the varlen types that 
btree_gist supports, and int2, int4 and oid. The rest of the fixed-width 
types should be just a matter of copy-pasting. I'll continue adding 
those, but wanted to let people know I'm working on this.


- Heikki
From d0a1cd0aff05ac3fdfc3d5cea4d3bc6c738ffc23 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas heikki.linnakangas@iki.fi
Date: Thu, 26 Mar 2015 23:10:14 +0200
Subject: [PATCH 1/1] Add index-only scan support to btree_gist.

WIP: not all datatypes are covered yet.
---
 contrib/btree_gist/Makefile |3 +-
 contrib/btree_gist/btree_gist--1.0--1.1.sql |   58 +
 contrib/btree_gist/btree_gist--1.0.sql  | 1491 --
 contrib/btree_gist/btree_gist--1.1.sql  | 1522 +++
 contrib/btree_gist/btree_gist.control   |2 +-
 contrib/btree_gist/btree_int2.c |8 +
 contrib/btree_gist/btree_int4.c |8 +
 contrib/btree_gist/btree_oid.c  |8 +
 contrib/btree_gist/btree_utils_num.c|   55 +
 contrib/btree_gist/btree_utils_num.h|1 +
 contrib/btree_gist/btree_utils_var.c|   18 +
 11 files changed, 1681 insertions(+), 1493 deletions(-)
 create mode 100644 contrib/btree_gist/btree_gist--1.0--1.1.sql
 delete mode 100644 contrib/btree_gist/btree_gist--1.0.sql
 create mode 100644 contrib/btree_gist/btree_gist--1.1.sql

diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile
index 19a5929..9b7d61d 100644
--- a/contrib/btree_gist/Makefile
+++ b/contrib/btree_gist/Makefile
@@ -9,7 +9,8 @@ OBJS =  btree_gist.o btree_utils_num.o btree_utils_var.o btree_int2.o \
 btree_numeric.o $(WIN32RES)
 
 EXTENSION = btree_gist
-DATA = btree_gist--1.0.sql btree_gist--unpackaged--1.0.sql
+DATA = btree_gist--1.1.sql btree_gist--unpackaged--1.0.sql \
+	btree_gist--1.0--1.1.sql
 PGFILEDESC = btree_gist - B-tree equivalent GIST operator classes
 
 REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \
diff --git a/contrib/btree_gist/btree_gist--1.0--1.1.sql b/contrib/btree_gist/btree_gist--1.0--1.1.sql
new file mode 100644
index 000..6b6f496
--- /dev/null
+++ b/contrib/btree_gist/btree_gist--1.0--1.1.sql
@@ -0,0 +1,58 @@
+/* contrib/btree_gist/btree_gist--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use CREATE EXTENSION btree_gist FROM unpackaged to load this file. \quit
+
+-- Index-only scan support new in 9.5.
+CREATE FUNCTION gbt_var_fetch(internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION gbt_oid_fetch(internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION gbt_int2_fetch(internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION gbt_int4_fetch(internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT;
+
+ALTER OPERATOR FAMILY gist_oid_ops USING gist ADD
+	FUNCTION	9 (oid, oid) gbt_oid_fetch (internal) ;
+
+ALTER OPERATOR FAMILY gist_int2_ops USING gist ADD
+	FUNCTION	9 (int2, int2) gbt_int2_fetch (internal) ;
+
+ALTER OPERATOR FAMILY gist_int4_ops USING gist ADD
+	FUNCTION	9 (int4, int4) gbt_int4_fetch (internal) ;
+
+ALTER OPERATOR FAMILY gist_text_ops USING gist ADD
+	FUNCTION	9 (text, text) gbt_var_fetch (internal) ;
+
+ALTER OPERATOR FAMILY gist_bpchar_ops USING gist ADD
+	FUNCTION	9 (bpchar, bpchar) gbt_var_fetch (internal) ;
+
+ALTER OPERATOR FAMILY gist_bytea_ops USING gist ADD
+	FUNCTION	9 (bytea, bytea) gbt_var_fetch (internal) ;
+
+ALTER OPERATOR FAMILY gist_numeric_ops USING gist ADD
+	FUNCTION	9 (numeric, numeric) gbt_var_fetch (internal) ;
+
+ALTER OPERATOR FAMILY gist_bit_ops USING gist ADD
+	FUNCTION	9 (bit, bit) gbt_var_fetch (internal) ;
+
+ALTER OPERATOR FAMILY gist_vbit_ops USING gist ADD
+	FUNCTION	9 (varbit, varbit) gbt_var_fetch (internal) ;
+
+ALTER OPERATOR FAMILY gist_inet_ops USING gist ADD
+	FUNCTION	9 (inet, inet) gbt_var_fetch  (internal) ;
+
+ALTER OPERATOR FAMILY gist_cidr_ops USING gist ADD
+	FUNCTION	9 (cidr, cidr) gbt_var_fetch (internal) ;
diff --git a/contrib/btree_gist/btree_gist--1.0.sql b/contrib/btree_gist/btree_gist--1.0.sql
deleted file mode 100644
index c5c9587..000
--- a/contrib/btree_gist/btree_gist--1.0.sql
+++ /dev/null
@@ -1,1491 +0,0 @@
-/* contrib/btree_gist/btree_gist--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use CREATE EXTENSION btree_gist to load this file. \quit
-
-CREATE FUNCTION gbtreekey4_in(cstring)
-RETURNS gbtreekey4
-AS

Re: [HACKERS] Index-only scans for GiST.

2015-03-01 Thread Heikki Linnakangas


On 02/27/2015 04:19 PM, Anastasia Lubennikova wrote:

I add MemoryContext listCxt to avoid memory leak. listCxt is created once
in gistrescan (only for index-only scan plan ) and reseted when scan of the
leaf page is finished.

I do not sure if the problem was completely solved, so I wait for feedback.


Yeah, I think that solves it.


* What's the reason for turning GISTScanOpaqueData.pageData from an array
to a List?

This array is field of structure GISTScanOpaqueData. Memory for that
structure allocated in function gistbeginscan(). The array is static so
it's declared only one time in structure:
GISTSearchHeapItem  pageData [BLCKSZ/sizeof(IndexTupleData)]

But how could we know size of array if we don't know what data would be
returned? I mean type and amount.


You're only adding a pointer to the IndexTuple to GISTSearchHeapItem. 
The GISTSearchHeapItem struct itself is still of constant size.


I spent a little time cleaning this up. I reverted that pageData change 
so that it's an array again, put back the gist_indexonly.sql and 
expected output files that were missing from your latest version, 
removed a couple of unused local variables that gcc complained about. I 
refactored gistFetchTuple a bit, because it was doing IMHO quite bogus 
things with NULLs. It was passing NULLs to the opclass' fetch function, 
but it didn't pass the isNull flag correctly. I changed it so that the 
fetch function is not called at all for NULLs.


I think this is pretty close to being committable. I'll make a round of 
editorializing over the docs, and the code comments as well.


The opr_sanity regression test is failing, there's apparently something 
wrong with the pg_proc entries of the *canreturn functions. I haven't 
looked into that yet; could you fix that?


- Heikki

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index db2a452..53750da 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1404,6 +1404,14 @@ initGISTstate(Relation index)
 		else
 			giststate-distanceFn[i].fn_oid = InvalidOid;
 
+		/* opclasses are not required to provide a Fetch method */
+		if (OidIsValid(index_getprocid(index, i + 1, GIST_FETCH_PROC)))
+			fmgr_info_copy((giststate-fetchFn[i]),
+		 index_getprocinfo(index, i + 1, GIST_FETCH_PROC),
+		   scanCxt);
+		else
+			giststate-fetchFn[i].fn_oid = InvalidOid;
+
 		/*
 		 * If the index column has a specified collation, we should honor that
 		 * while doing comparisons.  However, we may have a collatable storage
@@ -1426,6 +1434,22 @@ initGISTstate(Relation index)
 	return giststate;
 }
 
+/*
+ * Gistcanreturn is supposed to be true if ANY FetchFn method is defined.
+ * If FetchFn exists it would be used in index-only scan
+ * Thus the responsibility rests with the opclass developer.
+ */
+
+Datum
+gistcanreturn(PG_FUNCTION_ARGS) {
+	Relation index = (Relation) PG_GETARG_POINTER(0);
+	int i = PG_GETARG_INT32(1);
+	if (OidIsValid(index_getprocid(index, i+1, GIST_FETCH_PROC)))
+		PG_RETURN_BOOL(true);
+	else
+		PG_RETURN_BOOL(false);
+}
+
 void
 freeGISTstate(GISTSTATE *giststate)
 {
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 717cb85..80532c8 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -229,6 +229,8 @@ gistindex_keytest(IndexScanDesc scan,
  * we're doing a plain or ordered indexscan.  For a plain indexscan, heap
  * tuple TIDs are returned into so-pageData[].  For an ordered indexscan,
  * heap tuple TIDs are pushed into individual search queue items.
+ * If index-only scan is possible, heap tuples themselves are returned
+ * into so-pageData or into search queue.
  *
  * If we detect that the index page has split since we saw its downlink
  * in the parent, we push its new right sibling onto the queue so the
@@ -241,6 +243,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 	GISTScanOpaque so = (GISTScanOpaque) scan-opaque;
 	Buffer		buffer;
 	Page		page;
+	GISTSTATE *giststate = so-giststate;
+	Relation r = scan-indexRelation;
 	GISTPageOpaque opaque;
 	OffsetNumber maxoff;
 	OffsetNumber i;
@@ -288,6 +292,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 	}
 
 	so-nPageData = so-curPageData = 0;
+	if (so-pageDataCxt)
+		MemoryContextReset(so-pageDataCxt);
 
 	/*
 	 * check all tuples on page
@@ -326,10 +332,20 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 		else if (scan-numberOfOrderBys == 0  GistPageIsLeaf(page))
 		{
 			/*
-			 * Non-ordered scan, so report heap tuples in so-pageData[]
+			 * Non-ordered scan, so report tuples in so-pageData[]
 			 */
 			so-pageData[so-nPageData].heapPtr = it-t_tid;
 			so-pageData[so-nPageData].recheck = recheck;
+
+			/*
+			 * If index-only scan is possible add the data fetched from index.
+			 */
+			if (scan-xs_want_itup)
+			{
+oldcxt =

Re: [HACKERS] Index-only scans for GiST.

2015-02-27 Thread Anastasia Lubennikova

I add MemoryContext listCxt to avoid memory leak. listCxt is created once
in gistrescan (only for index-only scan plan ) and reseted when scan of the
leaf page is finished.

I do not sure if the problem was completely solved, so I wait for feedback.

* What's the reason for turning GISTScanOpaqueData.pageData from an array
to a List?

This array is field of structure GISTScanOpaqueData. Memory for that
structure allocated in function gistbeginscan(). The array is static so
it's declared only one time in structure:
GISTSearchHeapItem  pageData [BLCKSZ/sizeof(IndexTupleData)]

But how could we know size of array if we don't know what data would be
returned? I mean type and amount.

I asked Alexander about that and he offered me to use List instead of Array.


indexonlyscan_gist_2.2.patch
Description: Binary data


indexonlyscan_gist_docs.patch
Description: Binary data


test_performance.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans for GiST.

2015-02-12 Thread Anastasia Lubennikova

Thanks for answer.
Now it seems to be applied correctly.

2015-02-12 3:12 GMT+04:00 Thom Brown t...@linux.com:

 On 11 February 2015 at 22:50, Anastasia Lubennikova 
 lubennikov...@gmail.com wrote:

 Finally there is a new version of patch (in attachments).
 It provides multicolumn index-only scan for GiST indexes.

 - Memory leak is fixed.
 - little code cleanup
 - example of performance test in attachmens
 - function OIDs have debugging values (*) just to avoid merge
 conflicts while testing patch

 Wiki page of the project is

 https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014

 Waiting for feedback.


 Hi Anastasia.  Thanks for the updated patch.  I've just tried applying it
 to head and it doesn't appear to apply cleanly.

 $ patch -p1  ~/Downloads/indexonlyscan_gist_2.0.patch
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file src/backend/access/gist/gist.c
 Hunk #1 succeeded at 1404 (offset 9 lines).
 Hunk #2 succeeded at 1434 (offset 9 lines).
 (Stripping trailing CRs from patch; use --binary to disable.)
 patching file src/backend/access/gist/gistget.c
 Hunk #1 succeeded at 227 (offset 1 line).
 Hunk #2 succeeded at 243 (offset 1 line).
 Hunk #3 succeeded at 293 (offset -4 lines).
 Hunk #4 succeeded at 330 (offset -4 lines).
 Hunk #5 succeeded at 365 (offset -5 lines).
 Hunk #6 succeeded at 444 (offset -27 lines).
 Hunk #7 succeeded at 474 (offset -27 lines).
 Hunk #8 FAILED at 518.
 Hunk #9 succeeded at 507 (offset -28 lines).
 Hunk #10 succeeded at 549 with fuzz 1 (offset -28 lines).
 Hunk #11 FAILED at 601.
 2 out of 11 hunks FAILED -- saving rejects to file
 src/backend/access/gist/gistget.c.rej
 ...

 --
 Thom




-- 
Best regards,
Lubennikova Anastasia
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index db2a452..53750da 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1404,6 +1404,14 @@ initGISTstate(Relation index)
 		else
 			giststate-distanceFn[i].fn_oid = InvalidOid;
 
+		/* opclasses are not required to provide a Fetch method */
+		if (OidIsValid(index_getprocid(index, i + 1, GIST_FETCH_PROC)))
+			fmgr_info_copy((giststate-fetchFn[i]),
+		 index_getprocinfo(index, i + 1, GIST_FETCH_PROC),
+		   scanCxt);
+		else
+			giststate-fetchFn[i].fn_oid = InvalidOid;
+
 		/*
 		 * If the index column has a specified collation, we should honor that
 		 * while doing comparisons.  However, we may have a collatable storage
@@ -1426,6 +1434,22 @@ initGISTstate(Relation index)
 	return giststate;
 }
 
+/*
+ * Gistcanreturn is supposed to be true if ANY FetchFn method is defined.
+ * If FetchFn exists it would be used in index-only scan
+ * Thus the responsibility rests with the opclass developer.
+ */
+
+Datum
+gistcanreturn(PG_FUNCTION_ARGS) {
+	Relation index = (Relation) PG_GETARG_POINTER(0);
+	int i = PG_GETARG_INT32(1);
+	if (OidIsValid(index_getprocid(index, i+1, GIST_FETCH_PROC)))
+		PG_RETURN_BOOL(true);
+	else
+		PG_RETURN_BOOL(false);
+}
+
 void
 freeGISTstate(GISTSTATE *giststate)
 {
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 717cb85..0925e56 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -227,8 +227,10 @@ gistindex_keytest(IndexScanDesc scan,
  * If tbm/ntids aren't NULL, we are doing an amgetbitmap scan, and heap
  * tuples should be reported directly into the bitmap.  If they are NULL,
  * we're doing a plain or ordered indexscan.  For a plain indexscan, heap
- * tuple TIDs are returned into so-pageData[].  For an ordered indexscan,
+ * tuple TIDs are returned into so-pageData. For an ordered indexscan,
  * heap tuple TIDs are pushed into individual search queue items.
+ * If index-only scan is possible, heap tuples themselves are returned
+ * into so-pageData or into search queue.
  *
  * If we detect that the index page has split since we saw its downlink
  * in the parent, we push its new right sibling onto the queue so the
@@ -241,6 +243,10 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 	GISTScanOpaque so = (GISTScanOpaque) scan-opaque;
 	Buffer		buffer;
 	Page		page;
+	GISTSTATE *giststate = so-giststate;
+	Relation r = scan-indexRelation;
+	boolisnull[INDEX_MAX_KEYS];
+	GISTSearchHeapItem *tmpListItem;
 	GISTPageOpaque opaque;
 	OffsetNumber maxoff;
 	OffsetNumber i;
@@ -287,8 +293,6 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 		MemoryContextSwitchTo(oldcxt);
 	}
 
-	so-nPageData = so-curPageData = 0;
-
 	/*
 	 * check all tuples on page
 	 */
@@ -326,11 +330,20 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 		else if (scan-numberOfOrderBys == 0  GistPageIsLeaf(page))
 		{
 			/*
-			 * Non-ordered scan, so report heap tuples in so-pageData[]
+			 * Non-ordered scan, so report tuples in so-pageData
+			 */
+
+			/* form tmpListItem and

Re: [HACKERS] Index-only scans for GiST.

2015-02-12 Thread Thom Brown

On 12 February 2015 at 10:40, Anastasia Lubennikova lubennikov...@gmail.com
 wrote:

 Thanks for answer.
 Now it seems to be applied correctly.


(please avoid top-posting)

Thanks for the updated patch.  I can confirm that it now cleanly applies
and compiles fine.  I've run the tested in the SQL file you provided, and
it's behaving as expected.

Thom

Re: [HACKERS] Index-only scans for GiST.

2015-02-12 Thread Thom Brown

On 12 February 2015 at 16:40, Heikki Linnakangas hlinnakan...@vmware.com
wrote:

 On 02/12/2015 12:40 PM, Anastasia Lubennikova wrote:

 Thanks for answer.
 Now it seems to be applied correctly.


 * Documentation is missing.


Anastasia provided a documentation patch in the first email.

Thom

Re: [HACKERS] Index-only scans for GiST.

2015-02-12 Thread Fabrízio de Royes Mello

On Thu, Feb 12, 2015 at 3:07 PM, Thom Brown t...@linux.com wrote:

 On 12 February 2015 at 16:40, Heikki Linnakangas hlinnakan...@vmware.com
wrote:

 On 02/12/2015 12:40 PM, Anastasia Lubennikova wrote:

 Thanks for answer.
 Now it seems to be applied correctly.


 * Documentation is missing.


 Anastasia provided a documentation patch in the first email.


Yeah, but it's a good practice send all the patches together. Will make the
life of the reviewers better ;-)

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello

Re: [HACKERS] Index-only scans for GiST.

2015-02-12 Thread Heikki Linnakangas


On 02/12/2015 12:40 PM, Anastasia Lubennikova wrote:

Thanks for answer.
Now it seems to be applied correctly.


Thanks, it would be great to get this completed.

This still leaks memory with the same test query as earlier. The leak 
seems to be into a different memory context now; it used to be to the 
GiST scan context, but now it's to the ExecutorState context. See 
attached patch which makes the leak evident.


Looking at my previous comments from August:

* What's the reason for turning GISTScanOpaqueData.pageData from an
array to a List?

* Documentation is missing.

- Heikki
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 0925e56..3768c9c 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -526,6 +526,12 @@ gistgettuple(PG_FUNCTION_ARGS)
  * It's always head of so-pageData
  */
 so-pageData = list_delete_first(so-pageData);
+{
+	static int lastreport = 0;
+	if ((lastreport++) % 1 == 0)
+		MemoryContextStats(CurrentMemoryContext);
+}
+
 PG_RETURN_BOOL(TRUE);
 			}
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans for GiST.

2015-02-11 Thread Thom Brown

On 11 February 2015 at 22:50, Anastasia Lubennikova lubennikov...@gmail.com
 wrote:

 Finally there is a new version of patch (in attachments).
 It provides multicolumn index-only scan for GiST indexes.

 - Memory leak is fixed.
 - little code cleanup
 - example of performance test in attachmens
 - function OIDs have debugging values (*) just to avoid merge
 conflicts while testing patch

 Wiki page of the project is

 https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014

 Waiting for feedback.


Hi Anastasia.  Thanks for the updated patch.  I've just tried applying it
to head and it doesn't appear to apply cleanly.

$ patch -p1  ~/Downloads/indexonlyscan_gist_2.0.patch
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/backend/access/gist/gist.c
Hunk #1 succeeded at 1404 (offset 9 lines).
Hunk #2 succeeded at 1434 (offset 9 lines).
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/backend/access/gist/gistget.c
Hunk #1 succeeded at 227 (offset 1 line).
Hunk #2 succeeded at 243 (offset 1 line).
Hunk #3 succeeded at 293 (offset -4 lines).
Hunk #4 succeeded at 330 (offset -4 lines).
Hunk #5 succeeded at 365 (offset -5 lines).
Hunk #6 succeeded at 444 (offset -27 lines).
Hunk #7 succeeded at 474 (offset -27 lines).
Hunk #8 FAILED at 518.
Hunk #9 succeeded at 507 (offset -28 lines).
Hunk #10 succeeded at 549 with fuzz 1 (offset -28 lines).
Hunk #11 FAILED at 601.
2 out of 11 hunks FAILED -- saving rejects to file
src/backend/access/gist/gistget.c.rej
...

-- 
Thom

[HACKERS] Index-only scans for GiST.

2015-02-11 Thread Anastasia Lubennikova

Finally there is a new version of patch (in attachments).
It provides multicolumn index-only scan for GiST indexes.

- Memory leak is fixed.
- little code cleanup
- example of performance test in attachmens
- function OIDs have debugging values (*) just to avoid merge conflicts
while testing patch

Wiki page of the project is
https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014

Waiting for feedback.
-- 
Best regards,
Lubennikova Anastasia


indexonlyscan_gist_2.0.patch
Description: Binary data


test_performance.sql
Description: Binary data


indexonlyscan_gist_docs.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans for GIST

2014-10-27 Thread Thom Brown

On 18 August 2014 09:05, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 On 08/17/2014 07:15 PM, Anastasia Lubennikova wrote:

 2014-08-07 0:30 GMT+04:00 Heikki Linnakangas hlinnakan...@vmware.com:

 * I'm getting two regression failures with this (opr_sanity and join).



 opr_sanity failure is corrected.
 But there is remain question with join.
 I check the latest version of my github repo and there's no fail in join
 regression test
 All 145 tests passed.
 To tell the truth, I don't understand which changes could led to this
 failure.
 Could you show me regression diffs?


 Sure, here you go. It seems like a change in a plan. At a quick glance it
 seems harmless: the new plan is identical except that the left and right
 side of a join have been reversed. But I don't understand either why this
 patch would change that, so it needs to be investigated.

  * The regression test queries that use LIMIT are not guaranteed to always

 return the same rows, hence they're not very good regression test cases.
 I'd suggest using more restricting WHERE clauses, so that each query only
 returns a handful of rows.


 Thank you for comment, I rewrote wrong queries. But could you explain why
 LIMIT queries may return different results? Is it happens because of
 different query optimization?


 Imagine that you have a table with two rows, A and B. If you run a query
 like SELECT * FROM table LIMIT 1, the system can legally return either
 row A or B, because there's no ORDER BY.

 Now, admittedly you have a similar problem even without the LIMIT - the
 system can legally return the rows in either order - but it's less of an
 issue because at least you can quickly see from the diff that the result
 set is in fact the same, the rows are just in different order. You could
 fix that by adding an ORDER BY to all test queries, but we haven't done
 that in the regression suite because then we would not have any test
 coverage for cases where you don't have an ORDER BY. As a compromise, test
 cases are usually written without an ORDER BY, but if e.g. the buildfarm
 starts failing because of differences in the result set order across
 platforms, then we add an ORDER BY to make it stable.

  * I think it's leaking memory, in GIST scan context. I tested this with a

 variant of the regression tests:

 insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i,
 0.05*i)),
   point(0.05*i, 0.05*i) FROM generate_series(0,
 1000) as i;
 CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p);

 set enable_seqscan=off;
 set enable_bitmapscan=off;

 explain analyze  select p from gist_tbl where p @ box(point(0,0),
 point(999,999)) and length(p::text)  10;

 while the final query runs, 'top' shows constantly increasing memory
 usage.


 I don't think it's memory leak. After some increasing, memory using remain
 the same. It works similar without using indexonlyscan.


 No, it's definitely a leak caused by the patch. Test with the attached
 patch, which prints out to the server log the amount of memory used by the
 GiST scan memory context every 1 rows. It clearly shows increasing
 memory usage all the way to the end of the query.

 It's cleaned up at the end of the query, but that's not good enough
 because for a large query you might accumulate gigabytes of leaked memory
 until the query has finished. If you (manually) apply the same patch to git
 master, you'll see that the memory usage stays consistent and small.


Hi Anastasia,

Do you have time to address the issues brought up in Heikki's review?  It
would be good if we could your work into PostgreSQL 9.5.

Thanks

Thom

Re: [HACKERS] Index-only scans for GIST

2014-08-18 Thread Anastasia Lubennikova

Updated patch
* Compiler, merge and regression fails checked
* Regression tests was impoved
* GiST and amcanreturn docs updated
-- 
Best regards,
Lubennikova Anastasia


indexonlyscan_gist2.patch
Description: Binary data


indexonlyscan_gist_docs.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans for GIST

2014-08-18 Thread Heikki Linnakangas


On 08/17/2014 07:15 PM, Anastasia Lubennikova wrote:

2014-08-07 0:30 GMT+04:00 Heikki Linnakangas hlinnakan...@vmware.com:

* I'm getting two regression failures with this (opr_sanity and join).




opr_sanity failure is corrected.
But there is remain question with join.
I check the latest version of my github repo and there's no fail in join
regression test
All 145 tests passed.
To tell the truth, I don't understand which changes could led to this
failure.
Could you show me regression diffs?


Sure, here you go. It seems like a change in a plan. At a quick glance 
it seems harmless: the new plan is identical except that the left and 
right side of a join have been reversed. But I don't understand either 
why this patch would change that, so it needs to be investigated.



* The regression test queries that use LIMIT are not guaranteed to always

return the same rows, hence they're not very good regression test cases.
I'd suggest using more restricting WHERE clauses, so that each query only
returns a handful of rows.


Thank you for comment, I rewrote wrong queries. But could you explain why
LIMIT queries may return different results? Is it happens because of
different query optimization?


Imagine that you have a table with two rows, A and B. If you run a query 
like SELECT * FROM table LIMIT 1, the system can legally return either 
row A or B, because there's no ORDER BY.


Now, admittedly you have a similar problem even without the LIMIT - the 
system can legally return the rows in either order - but it's less of an 
issue because at least you can quickly see from the diff that the result 
set is in fact the same, the rows are just in different order. You could 
fix that by adding an ORDER BY to all test queries, but we haven't done 
that in the regression suite because then we would not have any test 
coverage for cases where you don't have an ORDER BY. As a compromise, 
test cases are usually written without an ORDER BY, but if e.g. the 
buildfarm starts failing because of differences in the result set order 
across platforms, then we add an ORDER BY to make it stable.



* I think it's leaking memory, in GIST scan context. I tested this with a

variant of the regression tests:

insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i,
0.05*i)),
  point(0.05*i, 0.05*i) FROM generate_series(0,
1000) as i;
CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p);

set enable_seqscan=off;
set enable_bitmapscan=off;

explain analyze  select p from gist_tbl where p @ box(point(0,0),
point(999,999)) and length(p::text)  10;

while the final query runs, 'top' shows constantly increasing memory usage.


I don't think it's memory leak. After some increasing, memory using remain
the same. It works similar without using indexonlyscan.


No, it's definitely a leak caused by the patch. Test with the attached 
patch, which prints out to the server log the amount of memory used by 
the GiST scan memory context every 1 rows. It clearly shows 
increasing memory usage all the way to the end of the query.


It's cleaned up at the end of the query, but that's not good enough 
because for a large query you might accumulate gigabytes of leaked 
memory until the query has finished. If you (manually) apply the same 
patch to git master, you'll see that the memory usage stays consistent 
and small.


- Heikki

*** /home/heikki/git-sandbox/postgresql/src/test/regress/expected/join.out  
2014-08-18 09:38:55.146171394 +0300
--- /home/heikki/git-sandbox/postgresql/src/test/regress/results/join.out   
2014-08-18 10:28:30.326491898 +0300
***
*** 2579,2590 
  ---
   Sort
 Sort Key: t1.q1, t1.q2
!-  Hash Left Join
!  Hash Cond: (t1.q2 = t2.q1)
   Filter: (1 = (SubPlan 1))
!  -  Seq Scan on int8_tbl t1
   -  Hash
!-  Seq Scan on int8_tbl t2
   SubPlan 1
 -  Limit
   -  Result
--- 2579,2590 
  ---
   Sort
 Sort Key: t1.q1, t1.q2
!-  Hash Right Join
!  Hash Cond: (t2.q1 = t1.q2)
   Filter: (1 = (SubPlan 1))
!  -  Seq Scan on int8_tbl t2
   -  Hash
!-  Seq Scan on int8_tbl t1
   SubPlan 1
 -  Limit
   -  Result
***
*** 3589,3603 
lateral (values(x.q1,y.q1,y.q2)) v(xq1,yq1,yq2);
  q1|q2 |q1|q2 
|   xq1|   yq1|yq2
  
--+---+--+---+--+--+---
-   123 |   456 |  |   
|  123 |  |  
-   123 |  4567890123456789 | 4567890123456789

Re: [HACKERS] Index-only scans for GIST

2014-08-17 Thread Anastasia Lubennikova

2014-08-07 0:30 GMT+04:00 Heikki Linnakangas hlinnakan...@vmware.com:

* I'm getting two regression failures with this (opr_sanity and join).


opr_sanity failure is corrected.
But there is remain question with join.
I check the latest version of my github repo and there's no fail in join
regression test
All 145 tests passed.
To tell the truth, I don't understand which changes could led to this
failure.
Could you show me regression diffs?
I want to understand what's wrong with the patch.

* The regression test queries that use LIMIT are not guaranteed to always
 return the same rows, hence they're not very good regression test cases.
 I'd suggest using more restricting WHERE clauses, so that each query only
 returns a handful of rows.


Thank you for comment, I rewrote wrong queries. But could you explain why
LIMIT queries may return different results? Is it happens because of
different query optimization?

* I think it's leaking memory, in GIST scan context. I tested this with a
 variant of the regression tests:

 insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i,
 0.05*i)),
  point(0.05*i, 0.05*i) FROM generate_series(0,
 1000) as i;
 CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p);

 set enable_seqscan=off;
 set enable_bitmapscan=off;

 explain analyze  select p from gist_tbl where p @ box(point(0,0),
 point(999,999)) and length(p::text)  10;

 while the final query runs, 'top' shows constantly increasing memory usage.


I don't think it's memory leak. After some increasing, memory using remain
the same. It works similar without using indexonlyscan.



-- 
Best regards,
Lubennikova Anastasia

Re: [HACKERS] Index-only scans for GIST

2014-08-06 Thread Heikki Linnakangas


On 08/01/2014 10:58 AM, Anastasia Lubennikova wrote:

Hi, hackers!
I work on a GSoC project Index-only scans for GIST
https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014

Repository is
https://github.com/lubennikovaav/postgres/tree/indexonlygist2
Patch is in attachments.


Thanks!

Some comments:

* I got this compiler warning:

gistget.c:556:5: warning: ISO C90 forbids mixed declarations and code 
[-Wdeclaration-after-statement]

 ListCell *tmpPageData = so-curPageData;
 ^

* I'm getting two regression failures with this (opr_sanity and join).

* After merging with master, build fails because of duplicate OIDs.

* The regression test queries that use LIMIT are not guaranteed to 
always return the same rows, hence they're not very good regression test 
cases. I'd suggest using more restricting WHERE clauses, so that each 
query only returns a handful of rows.


* What's the reason for turning GISTScanOpaqueData.pageData from an 
array to a List?


* I think it's leaking memory, in GIST scan context. I tested this with 
a variant of the regression tests:


insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i, 
0.05*i)),
 point(0.05*i, 0.05*i) FROM generate_series(0, 
1000) as i;

CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p);

set enable_seqscan=off;
set enable_bitmapscan=off;

explain analyze  select p from gist_tbl where p @ box(point(0,0), 
point(999,999)) and length(p::text)  10;


while the final query runs, 'top' shows constantly increasing memory usage.


It includes index-only scans for multicolumn GIST and new regression test.
Fetch() method is realized for box and point opclasses.


Can we have Fetch functions for all the datatypes in btree_gist contrib 
module, please? Do other contrib modules contain GiST opclasses that 
could have Fetch functions?



Documentation is not updated yet, but I'm going to do it till the end of
GSoC.

I've got one question about query with OR condition. It is the last query
in regression test gist_indexonly. It doesn't fail but it doensn't use
index-only scans. Could someone explain to me how it works?
It seems to depend on build_paths_for_OR
http://doxygen.postgresql.org/indxpath_8c.html#ae660d2e886355e53ed3b9ec693e4afd2
function.
But I couldn't understand how.


The query is:

select * from gist_tbl
where b @ box(point(5,5),  point(6,6))
or p @ box(point(0,0),  point(100,100)) limit 10;

It cannot use an index(-only) scan for this, because a single index scan 
can only return rows based on one key. In this case, you need to do two 
scans, and then return the rows returned by either scan, removing 
duplicates. A bitmap scan is possible, because it can remove the 
duplicates, but the planner can't produce a plain index scan plan that 
would do the same.


A common trick when that happens in a real-world application is to 
re-write the query using UNION:


select * from gist_tbl
where b @ box(point(5,5),  point(6,6))
UNION
select * from gist_tbl
where p @ box(point(0,0),  point(100,100))
limit 10;

Although that doesn't seem to actually work:

ERROR:  could not identify an equality operator for type box
LINE 1: select * from gist_tbl
   ^

but that's not your patch's fault, the same happens with unpatched master.

IOW, you don't need to worry about that case.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Index-only scans for GIST

2014-08-01 Thread Anastasia Lubennikova

Hi, hackers!
I work on a GSoC project Index-only scans for GIST
https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014

Repository is
https://github.com/lubennikovaav/postgres/tree/indexonlygist2
Patch is in attachments.

It includes index-only scans for multicolumn GIST and new regression test.
Fetch() method is realized for box and point opclasses.

Documentation is not updated yet, but I'm going to do it till the end of
GSoC.

I've got one question about query with OR condition. It is the last query
in regression test gist_indexonly. It doesn't fail but it doensn't use
index-only scans. Could someone explain to me how it works?
It seems to depend on build_paths_for_OR
http://doxygen.postgresql.org/indxpath_8c.html#ae660d2e886355e53ed3b9ec693e4afd2
function.
But I couldn't understand how.

-- 
Best regards,
Lubennikova Anastasia


indexonlyscan_gist.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans for GIST

2014-08-01 Thread Fabrízio de Royes Mello

On Fri, Aug 1, 2014 at 4:58 AM, Anastasia Lubennikova 
lubennikov...@gmail.com wrote:

 Hi, hackers!
 I work on a GSoC project Index-only scans for GIST

https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014

 Repository is
 https://github.com/lubennikovaav/postgres/tree/indexonlygist2
 Patch is in attachments.

 It includes index-only scans for multicolumn GIST and new regression test.
 Fetch() method is realized for box and point opclasses.

 Documentation is not updated yet, but I'm going to do it till the end of
GSoC.

 I've got one question about query with OR condition. It is the last query
in regression test gist_indexonly. It doesn't fail but it doensn't use
index-only scans. Could someone explain to me how it works?
 It seems to depend on build_paths_for_OR function. But I couldn't
understand how.


Very nice... please add your patch to the next commit fest [1].

Regards,


[1] https://commitfest.postgresql.org/action/commitfest_view?id=23

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog sobre TI: http://fabriziomello.blogspot.com
 Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello

Re: [HACKERS] Index-only scans for GIST

2014-08-01 Thread Anastasia Lubennikova

Thank you for comment
Patch is already added in Performance topic.


2014-08-01 20:25 GMT+04:00 Fabrízio de Royes Mello fabriziome...@gmail.com
:


 On Fri, Aug 1, 2014 at 4:58 AM, Anastasia Lubennikova 
 lubennikov...@gmail.com wrote:
 
  Hi, hackers!
  I work on a GSoC project Index-only scans for GIST
 
 https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014
 
  Repository is
  https://github.com/lubennikovaav/postgres/tree/indexonlygist2
  Patch is in attachments.
 
  It includes index-only scans for multicolumn GIST and new regression
 test.
  Fetch() method is realized for box and point opclasses.
 
  Documentation is not updated yet, but I'm going to do it till the end of
 GSoC.
 
  I've got one question about query with OR condition. It is the last
 query in regression test gist_indexonly. It doesn't fail but it doensn't
 use index-only scans. Could someone explain to me how it works?
  It seems to depend on build_paths_for_OR function. But I couldn't
 understand how.
 

 Very nice... please add your patch to the next commit fest [1].

 Regards,


 [1] https://commitfest.postgresql.org/action/commitfest_view?id=23

 --
 Fabrízio de Royes Mello
 Consultoria/Coaching PostgreSQL
  Timbira: http://www.timbira.com.br
  Blog sobre TI: http://fabriziomello.blogspot.com
  Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
  Twitter: http://twitter.com/fabriziomello




-- 
Best regards,
Lubennikova Anastasia

Re: [HACKERS] Index-only scans for multicolumn GIST

2014-07-23 Thread Emre Hasegeli

 That seems like a nonstarter :-(.  Index-only scans don't have a license
 to return approximations.  If we drop the behavior for circles, how much
 functionality do we have left?

It should work with exact operator classes, box_ops, point_ops,
range_ops, inet_ops.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans for multicolumn GIST

2014-07-22 Thread Heikki Linnakangas


On 07/21/2014 10:47 PM, Anastasia Lubennikova wrote:

Hi, hackers!
There are new results of my work on GSoC project Index-only scans for
GIST.
Previous post is here:
http://postgresql.1045698.n5.nabble.com/Index-only-scans-for-GIST-td5804892.html

Repository is
https://github.com/lubennikovaav/postgres/tree/indexonlygist2
Patch is in attachments.
It includes indexonlyscan for multicolumn GiST. It works correctly -
results are the same with another scan plans.
Fetch() method is realized for box and circle opclasses
Improvement for circle opclass is not such distinct as for box opclass,
because of recheck.


For a circle, the GiST index stores a bounding box of the circle. The 
new fetch function reverses that, calculating the radius and center of 
the circle from the bounding box.


Those conversions lose some precision due to rounding. Are we okay with 
that? Floating point math is always subject to rounding errors, but 
there's a good argument to be made that it's not acceptable to get a 
different value back when the user didn't explicitly invoke any math 
functions.


As an example:

create table circle_tbl (c circle);
create index circle_tbl_idx on circle_tbl using gist (c);
insert into circle_tbl values ('1.23456789012345,1.23456789012345,1e300');

postgres=# set enable_seqscan=off; set enable_bitmapscan=off; set 
enable_indexonlyscan=on;

SET
SET
SET
postgres=# select * from circle_tbl ;
   c

 (0,0),1e+300
(1 row)

postgres=# set enable_indexonlyscan=off;
SET
postgres=# select * from circle_tbl ;
  c
--
 (1.23456789012345,1.23456789012345),1e+300
(1 row)

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans for multicolumn GIST

2014-07-22 Thread Robert Haas

On Tue, Jul 22, 2014 at 2:55 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 07/21/2014 10:47 PM, Anastasia Lubennikova wrote:

 Hi, hackers!
 There are new results of my work on GSoC project Index-only scans for
 GIST.
 Previous post is here:

 http://postgresql.1045698.n5.nabble.com/Index-only-scans-for-GIST-td5804892.html

 Repository is
 https://github.com/lubennikovaav/postgres/tree/indexonlygist2
 Patch is in attachments.
 It includes indexonlyscan for multicolumn GiST. It works correctly -
 results are the same with another scan plans.
 Fetch() method is realized for box and circle opclasses
 Improvement for circle opclass is not such distinct as for box opclass,
 because of recheck.


 For a circle, the GiST index stores a bounding box of the circle. The new
 fetch function reverses that, calculating the radius and center of the
 circle from the bounding box.

 Those conversions lose some precision due to rounding. Are we okay with
 that? Floating point math is always subject to rounding errors, but there's
 a good argument to be made that it's not acceptable to get a different value
 back when the user didn't explicitly invoke any math functions.

 As an example:

 create table circle_tbl (c circle);
 create index circle_tbl_idx on circle_tbl using gist (c);
 insert into circle_tbl values ('1.23456789012345,1.23456789012345,1e300');

 postgres=# set enable_seqscan=off; set enable_bitmapscan=off; set
 enable_indexonlyscan=on;
 SET
 SET
 SET
 postgres=# select * from circle_tbl ;
c
 
  (0,0),1e+300
 (1 row)

 postgres=# set enable_indexonlyscan=off;
 SET
 postgres=# select * from circle_tbl ;
   c
 --
  (1.23456789012345,1.23456789012345),1e+300
 (1 row)

I really don't think it's ever OK for a query to produce different
answers depending on which plan is chosen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans for multicolumn GIST

2014-07-22 Thread Tom Lane

Heikki Linnakangas hlinnakan...@vmware.com writes:
 For a circle, the GiST index stores a bounding box of the circle. The 
 new fetch function reverses that, calculating the radius and center of 
 the circle from the bounding box.

 Those conversions lose some precision due to rounding. Are we okay with 
 that?

That seems like a nonstarter :-(.  Index-only scans don't have a license
to return approximations.  If we drop the behavior for circles, how much
functionality do we have left?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Index-only scans for multicolumn GIST

2014-07-21 Thread Anastasia Lubennikova

Hi, hackers!
There are new results of my work on GSoC project Index-only scans for
GIST.
Previous post is here:
http://postgresql.1045698.n5.nabble.com/Index-only-scans-for-GIST-td5804892.html

Repository is
https://github.com/lubennikovaav/postgres/tree/indexonlygist2
Patch is in attachments.
It includes indexonlyscan for multicolumn GiST. It works correctly -
results are the same with another scan plans.
Fetch() method is realized for box and circle opclasses
Improvement for circle opclass is not such distinct as for box opclass,
because of recheck.

I remember that all elog and other bad comments must be fixed before this
patch can be committed.

Any comments are welcome
-- 
Best regards,
Lubennikova Anastasia


indexonlygist_2.1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans and non-MVCC snapshots

2014-06-27 Thread Andres Freund

On 2014-06-26 22:47:47 -0600, Ryan Johnson wrote:
 Hi,
 
 As part of a research project, I'm trying to change Read Committed isolation
 to use HeapTupleSatisfiesNow rather than acquiring a new snapshot at every
 command [1]. Things appear to have gone reasonably well so far, except
 certain queries fail with ERROR:  non-MVCC snapshots are not supported in
 index-only scans.

You're aware that unless you employ additional locking you can simply
miss individual rows or see them several times because of concurrent
updates?
The reason it has worked ( 9.4) for system catalogs is that updates of
rows were only performed while objects were locked access exclusively -
that's the reason why some places in the code use inplace updates btw...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans and non-MVCC snapshots

2014-06-27 Thread Alvaro Herrera

Ryan Johnson wrote:
 On 26/06/2014 11:04 PM, Alvaro Herrera wrote:
 Ryan Johnson wrote:
 As part of a research project, I'm trying to change Read Committed
 isolation to use HeapTupleSatisfiesNow rather than acquiring a new
 snapshot at every command [1].
 Are you aware of this?
 
 commit 813fb0315587d32e3b77af1051a0ef517d187763
 Author: Robert Haas rh...@postgresql.org
 Date:   Thu Aug 1 10:46:19 2013 -0400
 
  Remove SnapshotNow and HeapTupleSatisfiesNow.

 That would be wonderful news... if snapshots weren't so darned
 expensive to create.

I take it you aren't aware of this other effort, either:
http://archives.postgresql.org/message-id/539ad153.9000...@vmware.com

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans and non-MVCC snapshots

2014-06-27 Thread Ryan Johnson


On 27/06/2014 3:14 AM, Andres Freund wrote:

On 2014-06-26 22:47:47 -0600, Ryan Johnson wrote:

Hi,

As part of a research project, I'm trying to change Read Committed isolation
to use HeapTupleSatisfiesNow rather than acquiring a new snapshot at every
command [1]. Things appear to have gone reasonably well so far, except
certain queries fail with ERROR:  non-MVCC snapshots are not supported in
index-only scans.

You're aware that unless you employ additional locking you can simply
miss individual rows or see them several times because of concurrent
updates?
The reason it has worked ( 9.4) for system catalogs is that updates of
rows were only performed while objects were locked access exclusively -
that's the reason why some places in the code use inplace updates btw...
Yes, I was aware of the need for locking. The documentation just made it 
sound that locking was already in place for non-MVCC index scans. I was 
hoping I'd missed some easy way to activate it.


Regards,
Ryan



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans and non-MVCC snapshots

2014-06-27 Thread Andres Freund

On 2014-06-27 08:39:13 -0600, Ryan Johnson wrote:
 On 27/06/2014 3:14 AM, Andres Freund wrote:
 On 2014-06-26 22:47:47 -0600, Ryan Johnson wrote:
 Hi,
 
 As part of a research project, I'm trying to change Read Committed isolation
 to use HeapTupleSatisfiesNow rather than acquiring a new snapshot at every
 command [1]. Things appear to have gone reasonably well so far, except
 certain queries fail with ERROR:  non-MVCC snapshots are not supported in
 index-only scans.
 You're aware that unless you employ additional locking you can simply
 miss individual rows or see them several times because of concurrent
 updates?
 The reason it has worked ( 9.4) for system catalogs is that updates of
 rows were only performed while objects were locked access exclusively -
 that's the reason why some places in the code use inplace updates btw...

 Yes, I was aware of the need for locking. The documentation just made it
 sound that locking was already in place for non-MVCC index scans. I was
 hoping I'd missed some easy way to activate it.

Well, it is/was for the places (i.e. DDL) that actually use non-MVCC
scans.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans and non-MVCC snapshots

2014-06-27 Thread Ryan Johnson


On 27/06/2014 8:20 AM, Alvaro Herrera wrote:

Ryan Johnson wrote:

On 26/06/2014 11:04 PM, Alvaro Herrera wrote:

Ryan Johnson wrote:

As part of a research project, I'm trying to change Read Committed
isolation to use HeapTupleSatisfiesNow rather than acquiring a new
snapshot at every command [1].

Are you aware of this?

commit 813fb0315587d32e3b77af1051a0ef517d187763
Author: Robert Haas rh...@postgresql.org
Date:   Thu Aug 1 10:46:19 2013 -0400

 Remove SnapshotNow and HeapTupleSatisfiesNow.

That would be wonderful news... if snapshots weren't so darned
expensive to create.

I take it you aren't aware of this other effort, either:
http://archives.postgresql.org/message-id/539ad153.9000...@vmware.com
That is good news, though from reading the thread it sounds like proc 
array accesses are being exchanged for accesses to an SLRU, so a lot of 
lwlock calls will remain. It will definitely help, though. SLRU will get 
ex-locked a lot less often, so the main source of contention will be for 
the actual lwlock acquire/release operations.


Regards,
Ryan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Index-only scans and non-MVCC snapshots

2014-06-26 Thread Ryan Johnson


Hi,

As part of a research project, I'm trying to change Read Committed 
isolation to use HeapTupleSatisfiesNow rather than acquiring a new 
snapshot at every command [1]. Things appear to have gone reasonably 
well so far, except certain queries fail with ERROR:  non-MVCC 
snapshots are not supported in index-only scans.


I'm using v9.3.2, and the docs claim that index-only scans work without 
MVCC, but require some extra locking to avoid races [2]. Is this not 
actually implemented? If that is the case, shouldn't the query optimizer 
avoid selecting index-only scans for non-MVCC snapshots?


I realize I'm playing with fire here, but any pointers to sections of 
code I might look at to either work around or fix this issue would be 
greatly appreciated. I've been looking around in index_fetch_heap 
(indexam.c) as well as other locations that use scan-xs_continue_hot; 
there seems to be code in place to detect when a non-MVCC snapshot is in 
use, as if that were nothing out of the ordinary, but nothing prevents 
the error from arising if a hot chain is actually encountered.


Thanks,
Ryan

[1] Right now, Read Committed is significantly *slower* than Repeatable 
Read---for transactions involving multiple short commands---because the 
former acquires multiple snapshots per transaction and causes a lwlock 
bottleneck on my 12-core machine.


[2] http://www.postgresql.org/docs/9.3/static/index-locking.html:
with a non-MVCC-compliant snapshot (such as SnapshotNow), it would be 
possible to accept and return a row that does not in fact match the 
scan keys ... [so] we use a pin on an index page as a proxy to 
indicate that the reader might still be in flight from the index 
entry to the matching heap entry. Making ambulkdelete block on such a 
pin ensures that VACUUM cannot delete the heap entry before the reader 
is done with it. ... This solution requires that index scans be 
synchronous: we have to fetch each heap tuple immediately after 
scanning the corresponding index entry. This is expensive for a number 
of reasons. An asynchronous scan in which we collect many TIDs from 
the index, and only visit the heap tuples sometime later, requires 
much less index locking overhead and can allow a more efficient heap 
access pattern. Per the above analysis, we must use the synchronous 
approach for non-MVCC-compliant snapshots. 




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans and non-MVCC snapshots

2014-06-26 Thread Alvaro Herrera

Ryan Johnson wrote:

 As part of a research project, I'm trying to change Read Committed
 isolation to use HeapTupleSatisfiesNow rather than acquiring a new
 snapshot at every command [1].

Are you aware of this?

commit 813fb0315587d32e3b77af1051a0ef517d187763
Author: Robert Haas rh...@postgresql.org
Date:   Thu Aug 1 10:46:19 2013 -0400

Remove SnapshotNow and HeapTupleSatisfiesNow.

We now use MVCC catalog scans, and, per discussion, have eliminated
all other remaining uses of SnapshotNow, so that we can now get rid of
it.  This will break third-party code which is still using it, which
is intentional, as we want such code to be updated to do things the
new way.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans and non-MVCC snapshots

2014-06-26 Thread Ryan Johnson


On 26/06/2014 11:04 PM, Alvaro Herrera wrote:

Ryan Johnson wrote:

As part of a research project, I'm trying to change Read Committed
isolation to use HeapTupleSatisfiesNow rather than acquiring a new
snapshot at every command [1].

Are you aware of this?

commit 813fb0315587d32e3b77af1051a0ef517d187763
Author: Robert Haas rh...@postgresql.org
Date:   Thu Aug 1 10:46:19 2013 -0400

 Remove SnapshotNow and HeapTupleSatisfiesNow.
That would be wonderful news... if snapshots weren't so darned expensive 
to create.


I guess there's no avoiding that bottleneck now, though.

Ryan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index-only scans for GIST

2014-05-29 Thread Robert Haas

On Sun, May 25, 2014 at 6:12 AM, Anastasia Lubennikova
lubennikov...@gmail.com wrote:
 Hi, hackers!
 There are first results of my work on GSoC project Index-only scans for
 GIST.

Cool.

 1. Version of my project code is in forked repository
 https://github.com/lubennikovaav/postgres/tree/indexonlygist2
 Patch is in attachments
 - This version is only for one-column indexes

That's probably a limitation that needs to be fixed before this can be
committed.

 - fetch() method is realized only for box opclass (because it's trivial)

That might not need to be fixed before this can be committed.

 2. I test Index-only scans with SQL script box_test.sql
 and it works really faster. (results in box_test_out)

 I'll be glad to get your feedback about this feature.

Since this is a GSoC project, it would be nice if one of the people
who is knowledgeable about GIST (Heikki, Alexander, etc.) could weigh
in on this before too much time goes by, so that Anastasia can press
forward with this work.

I don't know enough to offer too many substantive comments, but I
think you should remove all of the //-style comments (most of which
are debugging leftovers) and add some more comments describing what
you're actually doing and, more importantly, why.

This comment doesn't appear to make sense:

+   /*
+* The offset number on tuples on internal pages is unused.
For historical
+* reasons, it is set 0x.
+*/

The reason this doesn't make sense is because the tuple in question is
not on an internal page, or indeed any page at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Index-only scans for GIST

2014-05-25 Thread Anastasia Lubennikova

Hi, hackers!
There are first results of my work on GSoC project Index-only scans for
GIST.

1. Version of my project code is in forked repository
https://github.com/lubennikovaav/postgres/tree/indexonlygist2
Patch is in attachments
- This version is only for one-column indexes
- fetch() method is realized only for box opclass (because it's trivial)

2. I test Index-only scans with SQL script box_test.sql
and it works really faster. (results in box_test_out)

I'll be glad to get your feedback about this feature.

-- 
Best regards,
Lubennikova Anastasia


indexonlygist_2.0.patch
Description: Binary data


box_test.sql
Description: Binary data


box_test_out.out
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index only scans wiki page

2012-11-15 Thread Peter Geoghegan

On 13 November 2012 01:25, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Attached is a piece I wrote on the feature. That might form the basis
 of a new wiki page.

Since no one else moved on this, I've completely replaced the existing
page with the contents of the user-orientated document I posted. I
don't believe that any of the information that has been removed was of
general interest, since it only existed as a developer-orientated
page. If anyone thinks that I shouldn't have removed everything that
was there, or indeed who would like to change what I've added, well,
it's a wiki.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index only scans wiki page

2012-11-13 Thread Robert Haas

On Mon, Nov 12, 2012 at 8:25 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 13 November 2012 01:03, Jeff Janes jeff.ja...@gmail.com wrote:
 https://wiki.postgresql.org/wiki/Index-only_scans

 This page is seriously out of date.  It suggests they are not yet
 implemented, but only being talked about.

 Attached is a piece I wrote on the feature. That might form the basis
 of a new wiki page. Feel free to incorporate this material as you see
 fit.

I found this an interesting read.  As one of the people who worked on
the feature, I'm sort of curious whether people have any experience
yet with how this actually shakes out in the field.  Are you (or is
anyone) aware of positive/negative field experiences with this
feature?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index only scans wiki page

2012-11-13 Thread Peter Geoghegan

On 13 November 2012 16:37, Robert Haas robertmh...@gmail.com wrote:
 I found this an interesting read.  As one of the people who worked on
 the feature, I'm sort of curious whether people have any experience
 yet with how this actually shakes out in the field.  Are you (or is
 anyone) aware of positive/negative field experiences with this
 feature?

Unfortunately, I don't think that I have any original insight about
the problems with index-only scans in the field right now.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index only scans wiki page

2012-11-13 Thread Peter Geoghegan

On 13 November 2012 16:37, Robert Haas robertmh...@gmail.com wrote:
 I found this an interesting read.  As one of the people who worked on
 the feature, I'm sort of curious whether people have any experience
 yet with how this actually shakes out in the field.  Are you (or is
 anyone) aware of positive/negative field experiences with this
 feature?

Unfortunately, I don't think that I have any original insight about
the problems with index-only scans in the field right now.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Index only scans wiki page

2012-11-12 Thread Jeff Janes

https://wiki.postgresql.org/wiki/Index-only_scans

This page is seriously out of date.  It suggests they are not yet
implemented, but only being talked about.

Would someone who knows a lot about the subject (which is why I'm
sending this hackers, not web)  like to take a whack at updating this?
 Otherwise, I'd like to just replace the whole page with a discussion
of the limitations and trade-offs involved in using them as they
currently exist

Thanks,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index only scans wiki page

2012-11-12 Thread Peter Geoghegan

On 13 November 2012 01:03, Jeff Janes jeff.ja...@gmail.com wrote:
 https://wiki.postgresql.org/wiki/Index-only_scans

 This page is seriously out of date.  It suggests they are not yet
 implemented, but only being talked about.

Attached is a piece I wrote on the feature. That might form the basis
of a new wiki page. Feel free to incorporate this material as you see
fit.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


index_only_scans.rst
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans versus serializable transactions

2012-09-04 Thread Kevin Grittner

Tom Lane  wrote:
 Kevin Grittner  writes:
 By not visiting the heap page for tuples, index-only scans fail to
 acquire all of the necessary predicate locks for correct behavior
 at the serializable transaction isolation level. The tag for the
 tuple-level predicate locks includes the xmin, to avoid possible
 problems with tid re-use. (This was not covered in initial
 pre-release versions of SSI, and testing actually hit the
 problem.)  When an index-only scan does need to look at the heap
 because the visibility map doesn't indicate that the tuple is
 visible to all transactions, the tuple-level predicate lock is
 acquired. The best we can do without visiting the heap is a page
 level lock on the heap page, so that is what the attached patch
 does.
 
 If there are no objections, I will apply to HEAD and 9.2.
 
 This isn't right in detail: there are paths through the loop where
 tuple is not NULL at the beginning of the next iteration
 (specifically, consider failure of a lossy-qual recheck). I think
 that only results in wasted work, but it's still not operating as
 intended. I'd suggest moving the declaration/initialization of the
 tuple variable to inside the while loop, since there's no desire
 for its value to carry across loops.
 
You're right.  It looks to me like moving the declaration (and
initialization) to more local scape (just inside the loop) fixes it.
 
New version attached.  Will apply if no further problems are found.
 
-Kevin




index-only-serializable-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans versus serializable transactions

2012-09-04 Thread Kevin Grittner

Kevin Grittner  wrote:
 
 New version attached. Will apply if no further problems are found.
 
Pushed to master and REL9_2_STABLE.
 
-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] index-only scans versus serializable transactions

2012-09-03 Thread Kevin Grittner

By not visiting the heap page for tuples, index-only scans fail to
acquire all of the necessary predicate locks for correct behavior at
the serializable transaction isolation level.  The tag for the
tuple-level predicate locks includes the xmin, to avoid possible
problems with tid re-use.  (This was not covered in initial
pre-release versions of SSI, and testing actually hit the problem.) 
When an index-only scan does need to look at the heap because the
visibility map doesn't indicate that the tuple is visible to all
transactions, the tuple-level predicate lock is acquired.  The best
we can do without visiting the heap is a page level lock on the heap
page, so that is what the attached patch does.

If there are no objections, I will apply to HEAD and 9.2.
 
-Kevin




index-only-serializable-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans versus serializable transactions

2012-09-03 Thread Tom Lane

Kevin Grittner kevin.gritt...@wicourts.gov writes:
 By not visiting the heap page for tuples, index-only scans fail to
 acquire all of the necessary predicate locks for correct behavior at
 the serializable transaction isolation level.  The tag for the
 tuple-level predicate locks includes the xmin, to avoid possible
 problems with tid re-use.  (This was not covered in initial
 pre-release versions of SSI, and testing actually hit the problem.) 
 When an index-only scan does need to look at the heap because the
 visibility map doesn't indicate that the tuple is visible to all
 transactions, the tuple-level predicate lock is acquired.  The best
 we can do without visiting the heap is a page level lock on the heap
 page, so that is what the attached patch does.

 If there are no objections, I will apply to HEAD and 9.2.

This isn't right in detail: there are paths through the loop where
tuple is not NULL at the beginning of the next iteration
(specifically, consider failure of a lossy-qual recheck).  I think
that only results in wasted work, but it's still not operating as
intended.  I'd suggest moving the declaration/initialization of the
tuple variable to inside the while loop, since there's no desire
for its value to carry across loops.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-05-04 Thread Simon Riggs

On 2 May 2012 13:41, Robert Haas robertmh...@gmail.com wrote:

 So on further reflection I'm thinking it may be best just to stick
 with a hard conflict for now and see what feedback we get from beta
 testers.

Which is what I was expecting y'all to conclude once you'd looked at
the task in more detail.

And I'm happy with the concept of beta being a period where we listen
to feedback, not just bugs, and decide whether further refinements are
required.

What I think is possible is to alter the conflict as it hits the
backend. If a backend has enable_indexonly = off then it wouldn't be
affected by those conflicts anyhow. So if we simply record whether we
are using an index only plan then we'll know whether to ignore it or
abort. I think we can handle that by marking the snapshot at executor
startup time. Needs a few other pushups but not that hard.

The likelihood is that SQL that uses index only won't run for long
enough to be cancelled anyway.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-05-02 Thread Robert Haas

On Thu, Apr 26, 2012 at 8:03 PM, Robert Haas robertmh...@gmail.com wrote:
 So, as a first step, I've committed a patch that just throws a hard
 conflict.  I think we probably want to optimize this further, and I'm
 going to work investigate that next.  But it seemed productive to get
 this much out of the way first, so I did.

I've been thinking about this some more.  What's worrying me is that a
visibility conflict, however we implement it, could be *worse* from
the user's point of view than just killing the query.  After all,
there's a reasonable likelihood that a single visibility map page
covers the whole relation (or all the blocks that the user is
interested in), so any sort of conflict is basically going to turn the
index-only scan into an index-scan plus some extra overhead.  And if
the planner had known that the user was going to get an index-only
scan rather than just a plain index scan, it might well have picked
some other plan in the first place.

Another problem is that, if we add a test for visibility conflicts
into visibilitymap_test(), I'm afraid we're going to drive up the cost
of that function very significantly.  Previous testing suggests that
that efficiency or lack thereof of that function is already a
performance problem for index-only scans, which kinda makes me not
that excited about adding another branch in there somewhere (and even
less excited about any proposed implementation that would add an
lwlock acquire/release or similar).

So on further reflection I'm thinking it may be best just to stick
with a hard conflict for now and see what feedback we get from beta
testers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-04-26 Thread Robert Haas

On Mon, Apr 16, 2012 at 4:13 PM, Robert Haas robertmh...@gmail.com wrote:
 But fundamentally we all seem to be converging on some variant of the
 soft conflict idea.

So, as a first step, I've committed a patch that just throws a hard
conflict.  I think we probably want to optimize this further, and I'm
going to work investigate that next.  But it seemed productive to get
this much out of the way first, so I did.

In studying this, it strikes me that it would be rather nicer if we
recovery conflicts could somehow arrange to roll back the active
transaction by some means short of a FATAL error.  I think there are
some protocol issues with doing that, but I still wish we could figure
out a way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-04-16 Thread Noah Misch

On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote:
 In the department of query cancellations, I believe Noah argued
 previously that this wasn't really going to cause a problem.  And,
 indeed, if the master has a mix of inserts, updates, and deletes, then
 it seems likely that any recovery conflicts generated this way would
 be hitting about the same place in the XID space as the ones caused by
 pruning away dead tuples.  What will be different, if we go this
 route, is that an insert-only master, which right now only generates
 conflicts at freezing time, will instead generate conflicts much
 sooner, as soon as the relation is vacuumed.  I do not use Hot Standby
 enough to know whether this is a problem, and I'm not trying to block
 this approach, but I do want to make sure that everyone agrees that
 it's acceptable before we go do it, because inevitably somebody is
 going to get bit.

Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible.
A standby with the GUC off would ignore all-visible indicators and also
decline to generate conflicts when flipping them on.

 As to correctness, after further review of lazy_scan_heap(), I think
 there are some residual bugs here that need to be fixed whatever we
 decide to do about the Hot Standby case:
 
 1. I noticed that when we do PageSetAllVisible() today we follow it up
 with SetBufferCommitInfoNeedsSave().  PD_ALL_VISIBLE is not merely a
 hint, so I think these should be changed to MarkBufferDirty(), which
 shouldn't make much difference given current code, but proposals to
 handle hint changes differently than non-hint changes abound, so it's
 probably not good to rely on those meaning the same thing forever.

Do you refer to PD_ALL_VISIBLE as not merely a hint due to the requirement
to prevent a page from simultaneously having a negative PD_ALL_VISIBLE and a
positive visibility map bit?  That is to say, PD_ALL_VISIBLE is fully a hint
in its role as a witness of tuple visibility, but it is not a hint in its role
as a witness of the visibility map status?  In any event, I agree that those
call sites should use MarkBufferDirty().

The large comment in SetBufferCommitInfoNeedsSave() seems incorrect.  On
systems with weaker memory ordering, the FlushBuffer() process's clearing of
BM_JUST_DIRTIED may not be visible to the SetBufferCommitInfoNeedsSave()
process until some time after the former retrieved buffer contents from shared
memory.  Memory barriers could make the comment true, but we should probably
just update the comment to explain that a race condition may eat the update
entirely.  Incidentally, is there a good reason for MarkBufferDirty() to
increment pgBufferUsage.shared_blks_dirtied and SetBufferCommitInfoNeedsSave()
not to do so?  Looks like an oversight.

 2. More seriously, lazy_scan_heap() releases the buffer lock before
 setting the all-visible bit on the heap.  This looks just plain wrong
 (and it's my fault, to be clear).  Somebody else can come along after
 we've released the buffer lock and mark the page not-all-visible.
 That concurrent process will check the visibility map bit, find it
 already clear, and go on its merry way.  Then, we set the visibility
 map bit and go on our merry way.  Now we have a not-all-visible page
 with the all-visible bit set in the visibility map.   Oops.

Hmm, indeed.  This deserves its own open item.

 I think
 this probably needs to be rewritten so that lazy_scan_heap() acquires
 a pin on the visibility map page before locking the buffer for cleanup
 and holds onto that pin until either we exit the range of heap pages
 covered by that visibility map page or trigger index vac due to
 maintenance_work_mem exhaustion.  With that infrastructure in place,
 we can set the visibility map bit at the same time we set the
 page-level bit without risking holding the buffer lock across a buffer
 I/O (we might have to hold the buffer lock across a WAL flush in the
 worst case, but that hazard exists in a number of other places as well
 and fixing it here is well out of scope).

Looks reasonable at a glance.

 1. vacuum on master sets the page-level bit and the visibility map bit
 2. the heap page with the bit is written to disk BEFORE the WAL entry
 generated in (1) gets flushed; this is allowable, since it's not an
 error for the page-level bit to be set while the visibility-map bit is
 unset, only the other way around
 3. crash (still before the WAL entry is flushed)
 4. some operation on the master emits an FPW for the page without
 first rendering it not-all-visible
 
 At present, I'm not sure there's any real problem here, since normally
 an all-visible heap page is only going to get another WAL entry if
 somebody does an insert, update, or delete on it, in which case the
 visibility map bit is going to get cleared anyway.  Freezing the page
 might emit a new FPW, but that's going to generate conflicts anyway,
 so no problem there.  So I think there's no real problem here, but it

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-04-16 Thread Simon Riggs

On Mon, Apr 16, 2012 at 8:02 AM, Noah Misch n...@leadboat.com wrote:
 On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote:
 In the department of query cancellations, I believe Noah argued
 previously that this wasn't really going to cause a problem.  And,
 indeed, if the master has a mix of inserts, updates, and deletes, then
 it seems likely that any recovery conflicts generated this way would
 be hitting about the same place in the XID space as the ones caused by
 pruning away dead tuples.  What will be different, if we go this
 route, is that an insert-only master, which right now only generates
 conflicts at freezing time, will instead generate conflicts much
 sooner, as soon as the relation is vacuumed.  I do not use Hot Standby
 enough to know whether this is a problem, and I'm not trying to block
 this approach, but I do want to make sure that everyone agrees that
 it's acceptable before we go do it, because inevitably somebody is
 going to get bit.

 Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible.
 A standby with the GUC off would ignore all-visible indicators and also
 decline to generate conflicts when flipping them on.

I'm against adding a new GUC, especially for that fairly thin reason.


 1. vacuum on master sets the page-level bit and the visibility map bit
 2. the heap page with the bit is written to disk BEFORE the WAL entry
 generated in (1) gets flushed; this is allowable, since it's not an
 error for the page-level bit to be set while the visibility-map bit is
 unset, only the other way around
 3. crash (still before the WAL entry is flushed)
 4. some operation on the master emits an FPW for the page without
 first rendering it not-all-visible

 At present, I'm not sure there's any real problem here, since normally
 an all-visible heap page is only going to get another WAL entry if
 somebody does an insert, update, or delete on it, in which case the
 visibility map bit is going to get cleared anyway.  Freezing the page
 might emit a new FPW, but that's going to generate conflicts anyway,
 so no problem there.  So I think there's no real problem here, but it
 doesn't seem totally future-proof - any future type of WAL record that
 might modify the page without rendering it not-all-visible would
 create a very subtle hazard for Hot Standby.  We could dodge the whole
 issue by leaving the hack in heapgetpage() intact and just be happy
 with making index-only scans work, or maybe somebody has a more clever
 idea.

 Good point.  We could finagle things so the standby notices end-of-recovery
 checkpoints and blocks the optimization for older snapshots.  For example,
 maintain an integer count of end-of-recovery checkpoints seen and store that
 in each Snapshot instead of takenDuringRecovery; use 0 on a master.  When the
 global value is greater than the snapshot's value, disable the optimization
 for that snapshot.  I don't know whether the optimization is powerful enough
 to justify that level of trouble, but it's an option to consider.

 For a different approach, targeting the fragility, we could add assertions to
 detect unexpected cases of a page bearing PD_ALL_VISIBLE submitted as a
 full-page image.

We don't need a future proof solution, especially not at this stage of
the release cycle. Every time you add a WAL record, we need to
consider the possible conflict impact.

We just need a simple and clear solution. I'll work on that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-04-16 Thread Heikki Linnakangas


On 16.04.2012 10:38, Simon Riggs wrote:

On Mon, Apr 16, 2012 at 8:02 AM, Noah Mischn...@leadboat.com  wrote:

On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote:

In the department of query cancellations, I believe Noah argued
previously that this wasn't really going to cause a problem.  And,
indeed, if the master has a mix of inserts, updates, and deletes, then
it seems likely that any recovery conflicts generated this way would
be hitting about the same place in the XID space as the ones caused by
pruning away dead tuples.  What will be different, if we go this
route, is that an insert-only master, which right now only generates
conflicts at freezing time, will instead generate conflicts much
sooner, as soon as the relation is vacuumed.  I do not use Hot Standby
enough to know whether this is a problem, and I'm not trying to block
this approach, but I do want to make sure that everyone agrees that
it's acceptable before we go do it, because inevitably somebody is
going to get bit.


Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible.
A standby with the GUC off would ignore all-visible indicators and also
decline to generate conflicts when flipping them on.


I'm against adding a new GUC, especially for that fairly thin reason.


Same here.

Can we have a soft hot standby conflict that doesn't kill the query, 
but disables index-only-scans?


In the long run, perhaps we need to store XIDs in the visibility map 
instead of just a bit. If you we only stored one xid per 100 pages or 
something like that, the storage overhead would not be much higher than 
what we have at the moment. But that's obviously not going to happen for 
9.2...


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-04-16 Thread Tom Lane

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Can we have a soft hot standby conflict that doesn't kill the query, 
 but disables index-only-scans?

Well, there wouldn't be any way for the planner to know whether an
index-only scan would be safe or not.  I think this would have to look
like a run-time fallback.  Could it be structured as return that the
page's all-visible bit is not set, instead of failing?  Or am I
confused about where the conflict is coming from?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-04-16 Thread Simon Riggs

On Mon, Apr 16, 2012 at 3:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Can we have a soft hot standby conflict that doesn't kill the query,
 but disables index-only-scans?

 Well, there wouldn't be any way for the planner to know whether an
 index-only scan would be safe or not.  I think this would have to look
 like a run-time fallback.  Could it be structured as return that the
 page's all-visible bit is not set, instead of failing?  Or am I
 confused about where the conflict is coming from?

The starting point is that HS now offers 4 different mechanisms for
avoiding conflicts, probably the best of which is hot standby
feedback. So we only need to improve things if those techniques don't
work for people. So initially, my attitude is lets throw a conflict
and wait for feedback during beta.

If we do need to do something, then introduce concept of a visibility conflict.

On replay:
If feedback not set, set LSN of visibility conflict on PROCs that
conflict, if not already set.

On query:
If feedback not set, check conflict LSN against page, if page is
later, check visibility.

In terms of optimisation, I wouldn't expect to have to adjust costs at
all. The difference would only show for long running queries and
typically they don't touch too much new data as a fraction of total.
The cost model for index only is pretty crude anyhow.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-04-16 Thread Robert Haas

On Mon, Apr 16, 2012 at 4:26 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Can we have a soft hot standby conflict that doesn't kill the query, but
 disables index-only-scans?

Yeah, something like that seems possible.

For example, suppose the master includes, in each
mark-heap-page-all-visible record, the newest XID on the page.  On the
standby, we keep track of the most recent such XID ever seen in shared
memory.  After noting that a page is all-visible, the standby
cross-checks its snapshot against this XID and does the heap fetch
anyway if it's too new.  Conceivably this can be done with memory
barriers but not without locks: we must update the XID in shared
memory *before* marking the page all-visible, and we must check the
visibility map or page-level bit *before* fetching the XID from shared
memory - but the actual reads and writes of the XID are atomic.

Now, if an index-only scan suffers a very high number of these soft
conflicts and consequently does a lot more heap fetches than
expected, performance might suck.  But that should be rare, and could
be mitigated by turning on hot_standby_feedback.  Also, there might be
some hit for repeatedly reading that XID from memory.

Alternatively, we could try to forbid index-only scan plans from being
created in the first place *only when* the underlying snapshot is too
old.  But then what happens if a conflict happens later, after the
plan has been created but before it's fully executed?  At that point
it's too late to switch gears, so we'd still need something like the
above.  And the above might be adequate all by itself, since in
practice index-only scans are mostly going to be useful when the data
isn't changing all that fast, so most of the queries that would be
cancelled by hard conflicts wouldn't have actually had a problem
anyway.

 In the long run, perhaps we need to store XIDs in the visibility map instead
 of just a bit. If you we only stored one xid per 100 pages or something like
 that, the storage overhead would not be much higher than what we have at the
 moment. But that's obviously not going to happen for 9.2...

Well, that would also have the undesirable side effect of greatly
reducing the granularity of the map.  As it is, updating a tiny
fraction of the tuples in the table could result in the entire table
ceasing to be not-all-visible if you happen to update exactly one
tuple per page through the entire heap.  Or more generally, updating
X% of the rows in the table can cause Y% of the rows in the table to
no longer be visible for index-only-scan purposes, where Y  X.  What
you're proposing would make that much worse.

I think we're actually going to find that the current system isn't
tight enough; my suspicion is that users are going to complain that
we're not aggressive enough about marking pages all-visible, because
autovac won't kick in until updates+deletes20%, but (1) index-only
scans also care about inserts and (2) by the time we've got 20% dead
tuples index-only scans may well be worthless.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-04-16 Thread Robert Haas

On Mon, Apr 16, 2012 at 3:02 AM, Noah Misch n...@leadboat.com wrote:
 Do you refer to PD_ALL_VISIBLE as not merely a hint due to the requirement
 to prevent a page from simultaneously having a negative PD_ALL_VISIBLE and a
 positive visibility map bit?  That is to say, PD_ALL_VISIBLE is fully a hint
 in its role as a witness of tuple visibility, but it is not a hint in its role
 as a witness of the visibility map status?

Exactly.

 The large comment in SetBufferCommitInfoNeedsSave() seems incorrect.  On
 systems with weaker memory ordering, the FlushBuffer() process's clearing of
 BM_JUST_DIRTIED may not be visible to the SetBufferCommitInfoNeedsSave()
 process until some time after the former retrieved buffer contents from shared
 memory.

True.

 Memory barriers could make the comment true, but we should probably
 just update the comment to explain that a race condition may eat the update
 entirely.

Agreed.

 Incidentally, is there a good reason for MarkBufferDirty() to
 increment pgBufferUsage.shared_blks_dirtied and SetBufferCommitInfoNeedsSave()
 not to do so?  Looks like an oversight.

Also agreed.

 2. More seriously, lazy_scan_heap() releases the buffer lock before
 setting the all-visible bit on the heap.  This looks just plain wrong
 (and it's my fault, to be clear).  Somebody else can come along after
 we've released the buffer lock and mark the page not-all-visible.
 That concurrent process will check the visibility map bit, find it
 already clear, and go on its merry way.  Then, we set the visibility
 map bit and go on our merry way.  Now we have a not-all-visible page
 with the all-visible bit set in the visibility map.   Oops.

 Hmm, indeed.  This deserves its own open item.

Also agreed.

 Good point.  We could finagle things so the standby notices end-of-recovery
 checkpoints and blocks the optimization for older snapshots.  For example,
 maintain an integer count of end-of-recovery checkpoints seen and store that
 in each Snapshot instead of takenDuringRecovery; use 0 on a master.  When the
 global value is greater than the snapshot's value, disable the optimization
 for that snapshot.  I don't know whether the optimization is powerful enough
 to justify that level of trouble, but it's an option to consider.

I suspect not.  It seems we can make index-only scans work without
doing something like this; it's only the sequential-scan optimization
we might lose.  But nobody's ever really complained about not having
that, to my knowledge.

 For a different approach, targeting the fragility, we could add assertions to
 detect unexpected cases of a page bearing PD_ALL_VISIBLE submitted as a
 full-page image.

The holes are narrow enough that I fear such cases would be detected
only on high-velocity production systems, which is not exactly where
you want to find out about such problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-04-16 Thread Robert Haas

On Mon, Apr 16, 2012 at 1:58 PM, Simon Riggs si...@2ndquadrant.com wrote:
 If we do need to do something, then introduce concept of a visibility 
 conflict.

 On replay:
 If feedback not set, set LSN of visibility conflict on PROCs that
 conflict, if not already set.

 On query:
 If feedback not set, check conflict LSN against page, if page is
 later, check visibility.

Hmm, should have read the whole thread before replying.  This similar
to what I just proposed in response to Heikki's message, but using LSN
in lieu of (or maybe you mean in addition to) XID.

I don't think we can ignore the need to throw conflicts just because
hot_standby_feedback is set; there are going to be corner cases, for
example, when it's just recently been turned on and the master has
already done cleanup; or if the master and standby have recently
gotten disconnected for even just a few seconds.

But fundamentally we all seem to be converging on some variant of the
soft conflict idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-04-15 Thread Simon Riggs

On Fri, Apr 13, 2012 at 5:33 PM, Robert Haas robertmh...@gmail.com wrote:
 Currently, we have a problem with index-only scans in Hot Standby
 mode: the xmin horizon on the standby might lag the master, and thus
 an index-only scan might mistakenly conclude that no heap fetch is
 needed when in fact it is.  I suggested that we handle this by
 suppressing generation of index-only scan plans in Hot Standby mode,
 but Simon, Noah, and Dimitri were arguing that we should instead do
 the following, which is now on the open items list:

 * Make XLOG_HEAP2_VISIBLE records generate recovery snapshot conflicts
 so that IndexOnlyScans work on Hot Standby
...
snip very long email /snip

Luckily its much simpler than all of that suggests. It'll take a few
hours for me to write a short reply but its Sunday today, so that will
happen later.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] index-only scans vs. Hot Standby, round two

2012-04-13 Thread Robert Haas

Currently, we have a problem with index-only scans in Hot Standby
mode: the xmin horizon on the standby might lag the master, and thus
an index-only scan might mistakenly conclude that no heap fetch is
needed when in fact it is.  I suggested that we handle this by
suppressing generation of index-only scan plans in Hot Standby mode,
but Simon, Noah, and Dimitri were arguing that we should instead do
the following, which is now on the open items list:

* Make XLOG_HEAP2_VISIBLE records generate recovery snapshot conflicts
so that IndexOnlyScans work on Hot Standby

Ideally, this would also allow us to remove the following hack from
heapgetpage(), which would improve sequential-scan performance for Hot
Standby.

/*
 * If the all-visible flag indicates that all tuples on the page are
 * visible to everyone, we can skip the per-tuple visibility tests. But
 * not in hot standby mode. A tuple that's already visible to all
 * transactions in the master might still be invisible to a read-only
 * transaction in the standby.
 */
all_visible = PageIsAllVisible(dp)  !snapshot-takenDuringRecovery;

As far as I can see, to make this work, we'd need to have
lazy_scan_heap() compute the latest xmin on any potentially
all-visible page and treat that as a cutoff XID, cancelling queries
with snapshots whose xmins equal or precede that value, just as we
currently do when freezing tuples (cf xlog_heap_freeze).  There are
two things to worry about: correctness, and more query cancellations.

In the department of query cancellations, I believe Noah argued
previously that this wasn't really going to cause a problem.  And,
indeed, if the master has a mix of inserts, updates, and deletes, then
it seems likely that any recovery conflicts generated this way would
be hitting about the same place in the XID space as the ones caused by
pruning away dead tuples.  What will be different, if we go this
route, is that an insert-only master, which right now only generates
conflicts at freezing time, will instead generate conflicts much
sooner, as soon as the relation is vacuumed.  I do not use Hot Standby
enough to know whether this is a problem, and I'm not trying to block
this approach, but I do want to make sure that everyone agrees that
it's acceptable before we go do it, because inevitably somebody is
going to get bit.  In making our decision, I think we should keep in
mind that we are currently pretty laid-back about marking pages
all-visible, and that's probably something we're going to want to
change over time: for example, we recently discussed the fact that a
page with one dead tuple in it currently needs *two* vacuums to become
all-visible, because only the first vacuum pass is smart enough to
mark things all-visible, and the first heap pass only truncates the
dead tuple to a dead line pointer, so the page isn't all-visible a
that point.  When we fix that, which I think we're almost certainly
going to want to do at some point, then that means these conflicts
will occur that much sooner.  I think we will likely also want to
consider marking things all-visible on the fly at some point in the
future; for example, if we pull a buffer in for an index-only scan and
dirty it by setting a hint bit, we might want to take the plunge and
mark it all-visible before evicting it, to save effort the next time.
Again, not trying to dissuade anyone, just making sure we've thought
through it enough to avoid being unhappy later.  It's worth noting
that none of this will normally matter if hot_standby_feedback=on, so
part of the analysis here may depend on how many people we think have
flipped that switch.

As to correctness, after further review of lazy_scan_heap(), I think
there are some residual bugs here that need to be fixed whatever we
decide to do about the Hot Standby case:

1. I noticed that when we do PageSetAllVisible() today we follow it up
with SetBufferCommitInfoNeedsSave().  PD_ALL_VISIBLE is not merely a
hint, so I think these should be changed to MarkBufferDirty(), which
shouldn't make much difference given current code, but proposals to
handle hint changes differently than non-hint changes abound, so it's
probably not good to rely on those meaning the same thing forever.

2. More seriously, lazy_scan_heap() releases the buffer lock before
setting the all-visible bit on the heap.  This looks just plain wrong
(and it's my fault, to be clear).  Somebody else can come along after
we've released the buffer lock and mark the page not-all-visible.
That concurrent process will check the visibility map bit, find it
already clear, and go on its merry way.  Then, we set the visibility
map bit and go on our merry way.  Now we have a not-all-visible page
with the all-visible bit set in the visibility map.   Oops.  I think
this probably needs to be rewritten so that lazy_scan_heap() acquires
a pin on the visibility map page before locking the buffer for cleanup
and holds onto that pin until either we exit the range

[HACKERS] Index only scans and visibilitymap.c

2011-10-23 Thread Jeff Janes

With index only scans, the comments in
src/backend/access/heap/visibilitymap.c are probably out of date,
starting with:

Currently, the visibility map is only used as a hint


Also, is there a discussion of how and why index-only scans is safe?
i.e. what lock, if any, has to be held while nodeIndexonlyscan.c
calls  visibilitymap_test?  Do index-only scans need a README file?

Thanks,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-15 Thread Jeff Janes

On Fri, Oct 7, 2011 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Please find attached a patch implementing a basic version of
 index-only scans.

 I'm making some progress with this, but I notice what seems like a
 missing feature: there needs to be a way to turn it off.  Otherwise
 performance comparisons will be difficult to impossible.

 The most obvious solution is a planner control GUC, perhaps
 enable_indexonlyscan.  Anyone object, or want to bikeshed the name?


Currently I can't force an indexonlyscan over an indexscan, because of
the way the enable_* variables work.

Should setting enable_indexscan=off also disable index-only scans (the
current behavior) in addition to plain index scans?

By way of precedent, enable_indexscan=off does not disable Bitmap Index Scan.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-15 Thread Jeff Janes

On Sat, Oct 15, 2011 at 2:16 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Oct 7, 2011 at 11:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Please find attached a patch implementing a basic version of
 index-only scans.

 I'm making some progress with this, but I notice what seems like a
 missing feature: there needs to be a way to turn it off.  Otherwise
 performance comparisons will be difficult to impossible.

 The most obvious solution is a planner control GUC, perhaps
 enable_indexonlyscan.  Anyone object, or want to bikeshed the name?


 Currently I can't force an indexonlyscan over an indexscan, because of
 the way the enable_* variables work.

OK, scratch that.  I must have been using the wrong query (for
which the index was not covering), as I can't reproduce the
behavior nor looking at the code can I see how it could have
occurred in the first place.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-12 Thread Tom Lane

I wrote:
 I was also toying with the notion of pushing the slot fill-in into the
 AM, so that the AM API is to return a filled TupleSlot not an
 IndexTuple.  This would not save any cycles AFAICT --- at least in
 btree, we still have to make a palloc'd copy of the IndexTuple so that
 we can release lock on the index page.  The point of it would be to
 avoid the assumption that the index's internal storage has exactly the
 format of IndexTuple.  Don't know whether we'd ever have any actual use
 for that flexibility, but it seems like it wouldn't cost much to
 preserve the option.

 BTW, I concluded that that would be a bad idea, because it would involve
 the index AM in some choices that we're likely to want to change.  In
 particular it would foreclose ever doing anything with expression
 indexes, without an AM API change.  Better to just define the AM's
 responsibility as to hand back a tuple defined according to the index's
 columns.

Although this aspect of the code is now working well enough for btree,
I realized that it's going to have a problem if/when we add GiST
support.  The difficulty is that the index rowtype includes storage
datatypes, not underlying-heap-column datatypes, for opclasses where
those are different.  This is not going to do for cases where we need
to reconstruct a heap value from the index contents, as in Alexander's
example of gist point_ops using a box as the underlying storage.

What we actually want back from the index AM is a rowtype that matches
the list of values submitted for indexing (ie, the original output of
FormIndexDatum), and only for btree is it the case that that's
represented more or less exactly as the IndexTuple stored in the index.

So what I'm now thinking is to go back to the idea of having the index
AM fill in a TupleTableSlot.  For btree this would just amount to moving
the existing StoreIndexTuple function into the AM.  But it would give
GiST the opportunity to do some computation, and it would avoid the
problem of the index's rowtype not being a suitable intermediate format.
The objection I voiced above is misguided, because it confuses the set
of column types that's needed with the format distinction between a Slot
and an IndexTuple.

BTW, right at the moment I'm not that excited about actually doing
any work on GiST itself for index-only scans.  Given the current list of
available opclasses there don't seem to be very many for which
index-only scans would be possible, so the amount of work needed seems
rather out of proportion to the benefit.  But I don't mind fixing AM API
details that are needed to make this workable.  I'd rather have the API
as right as possible in the first release.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-11 Thread Robert Haas

On Tue, Oct 11, 2011 at 12:19 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 I have mostly-working code for approach #3, but I haven't tried to make
 EXPLAIN work yet.  While looking at that I realized that there's a
 pretty good argument for adding the above-mentioned explicit TargetEntry
 list representing the index columns to index-only plan nodes.  Namely,
 that if we don't do it, EXPLAIN will have to go to the catalogs to find
 out what's in that index, and this will fall down for hypothetical
 indexes injected into the planner by index advisors.  We could imagine
 adding some more hooks to let the advisor inject bogus catalog data at
 EXPLAIN time, but on the whole it seems easier and less fragile to just
 have the planner include a data structure it has to build anyway into
 the finished plan.

 The need for this additional node list field also sways me in a
 direction that I'd previously been on the fence about, namely that
 I think index-only scans need to be their own independent plan node type
 instead of sharing a node type with regular indexscans.  It's just too
 weird that a simple boolean indexonly property would mean completely
 different contents/interpretation of the tlist and quals.

 Attached is a draft patch for this.  It needs some more review before
 committing, but it does pass regression tests now.

 One point worth commenting on is that I chose to rename the OUTER and
 INNER symbols for special varnos to OUTER_VAR and INNER_VAR, along with
 adding a new special varno INDEX_VAR.  It's bothered me for some time
 that those macro names were way too generic/susceptible to collision;
 and since I had to look at all the uses anyway to see if the INDEX case
 needed to be added, this seemed like a good time to rename them.

+1 for that renaming, for sure.

Have you given any thought to what would be required to support
index-only scans on non-btree indexes - particularly, GIST?  ISTM we
might have had a thought that some GIST opclasses but not others would
be able to regurgitate tuples, or maybe even that it might vary from
index tuple to index tuple.  But that discussion was a long time ago,
and my memory is fuzzy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-11 Thread Alexander Korotkov

On Tue, Oct 11, 2011 at 2:36 PM, Robert Haas robertmh...@gmail.com wrote:

 Have you given any thought to what would be required to support
 index-only scans on non-btree indexes - particularly, GIST?  ISTM we
 might have had a thought that some GIST opclasses but not others would
 be able to regurgitate tuples, or maybe even that it might vary from
 index tuple to index tuple.  But that discussion was a long time ago,
 and my memory is fuzzy.

In some GiST opclasses original tuple can't be restored from index tuple.
For example, polygon can't be restored from it's MBR. In some opclasses, for
example box_ops and point_ops, original tuple can be easily restore from
index tuple.
I can't remeber any implementation where this possibility can vary from
index tuple to index tuple. GiST opclass for ts_vector have different
representation of leaf index tuple depending on it's length. But, at most,
it store array of hashes of words, so it's lossy anyway.
I think GiST interface could be extended with optional function which
restores original tuple. But there is some additional difficulty, when some
opclasses of composite index provide this function, but others - not.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] index-only scans

2011-10-11 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 Have you given any thought to what would be required to support
 index-only scans on non-btree indexes - particularly, GIST?  ISTM we
 might have had a thought that some GIST opclasses but not others would
 be able to regurgitate tuples, or maybe even that it might vary from
 index tuple to index tuple.  But that discussion was a long time ago,
 and my memory is fuzzy.

It would have to be a per-opclass property, for sure, since the basic
criterion is whether the opclass's compress function is lossy.
I don't think we can tolerate a situation where some of the index tuples
might be able to yield data and others not --- that would undo all the
improvements I've been working on over the past few days.

I haven't thought as far ahead as how we might get the information
needed for a per-opclass flag.  A syntax addition to CREATE OPERATOR
CLASS might be the only way.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-11 Thread Alexander Korotkov

On Tue, Oct 11, 2011 at 5:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 I haven't thought as far ahead as how we might get the information
 needed for a per-opclass flag.  A syntax addition to CREATE OPERATOR
 CLASS might be the only way.

Shouldn't it be implemented through additional interface function? There are
situations when restoring of original tuple requires some transformation.
For example, in point_ops we store box in the leaf index tuple, while point
can be easily restored from box.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] index-only scans

2011-10-11 Thread PostgreSQL - Hans-Jürgen Schönig

On Oct 7, 2011, at 8:47 PM, Joshua D. Drake wrote:

 
 On 10/07/2011 11:40 AM, Tom Lane wrote:
 Robert Haasrobertmh...@gmail.com  writes:
 Please find attached a patch implementing a basic version of
 index-only scans.
 
 I'm making some progress with this, but I notice what seems like a
 missing feature: there needs to be a way to turn it off.  Otherwise
 performance comparisons will be difficult to impossible.
 
 The most obvious solution is a planner control GUC, perhaps
 enable_indexonlyscan.  Anyone object, or want to bikeshed the name?
 
 enable_onlyindexscan
 
 I'm kidding.
 
 +1 on Tom's proposed name.


+1 ...
definitely an important thing to do.

regards,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-11 Thread Tom Lane

Alexander Korotkov aekorot...@gmail.com writes:
 On Tue, Oct 11, 2011 at 5:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I haven't thought as far ahead as how we might get the information
 needed for a per-opclass flag.  A syntax addition to CREATE OPERATOR
 CLASS might be the only way.
 
 Shouldn't it be implemented through additional interface function? There are
 situations when restoring of original tuple requires some transformation.
 For example, in point_ops we store box in the leaf index tuple, while point
 can be easily restored from box.

Hm.  I had been supposing that lossless compress functions would just be
no-ops.  If that's not necessarily the case then we might need something
different from the opclass's decompress function to get back the
original data.  However, that doesn't really solve the problem I'm
concerned about, because the existence and use of such a function would
be entirely internal to GiST.  There still needs to be a way for the
planner to know which opclasses support data retrieval.  And I do *not*
want to see us hard-wire the presence of opclass function 8 means a
GiST opclass can return data into the planner.

Maybe, instead of a simple constant amcanreturn column, we need an AM
API function that says whether the index can return data.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-11 Thread Alexander Korotkov

On Wed, Oct 12, 2011 at 12:35 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Hm.  I had been supposing that lossless compress functions would just be
 no-ops.  If that's not necessarily the case then we might need something
 different from the opclass's decompress function to get back the
 original data.  However, that doesn't really solve the problem I'm
 concerned about, because the existence and use of such a function would
 be entirely internal to GiST.  There still needs to be a way for the
 planner to know which opclasses support data retrieval.  And I do *not*
 want to see us hard-wire the presence of opclass function 8 means a
 GiST opclass can return data into the planner.

 Maybe, instead of a simple constant amcanreturn column, we need an AM
 API function that says whether the index can return data.

I like idea of such AM API function. Since single multicolumn index can use
multiple opclasses, AM API function should also say *what* data index can
return.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] index-only scans

2011-10-11 Thread Dimitri Fontaine

Tom Lane t...@sss.pgh.pa.us writes:
 I haven't thought as far ahead as how we might get the information
 needed for a per-opclass flag.  A syntax addition to CREATE OPERATOR
 CLASS might be the only way.

It looks to me like it's related to the RECHECK property.  Maybe it's
just too late, though.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-11 Thread Tom Lane

Alexander Korotkov aekorot...@gmail.com writes:
 On Wed, Oct 12, 2011 at 12:35 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Maybe, instead of a simple constant amcanreturn column, we need an AM
 API function that says whether the index can return data.

 I like idea of such AM API function. Since single multicolumn index can use
 multiple opclasses, AM API function should also say *what* data index can
 return.

I was thinking more like amcanreturn(index, column_number) returns bool
which says if the index can return the data for that column.  The AM
would still have to return a full IndexTuple at runtime, but it'd be
allowed to insert nulls or garbage for columns it hadn't promised to
return.

BTW, if we do this, I'm rather strongly tempted to get rid of the
name-versus-cstring hack (see index_descriptor_hack() in HEAD) by
defining btree name_ops as not capable of returning data.  I don't
trust that hack much at all.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-10 Thread Tom Lane

I wrote:
 I have mostly-working code for approach #3, but I haven't tried to make
 EXPLAIN work yet.  While looking at that I realized that there's a
 pretty good argument for adding the above-mentioned explicit TargetEntry
 list representing the index columns to index-only plan nodes.  Namely,
 that if we don't do it, EXPLAIN will have to go to the catalogs to find
 out what's in that index, and this will fall down for hypothetical
 indexes injected into the planner by index advisors.  We could imagine
 adding some more hooks to let the advisor inject bogus catalog data at
 EXPLAIN time, but on the whole it seems easier and less fragile to just
 have the planner include a data structure it has to build anyway into
 the finished plan.

 The need for this additional node list field also sways me in a
 direction that I'd previously been on the fence about, namely that
 I think index-only scans need to be their own independent plan node type
 instead of sharing a node type with regular indexscans.  It's just too
 weird that a simple boolean indexonly property would mean completely
 different contents/interpretation of the tlist and quals.

Attached is a draft patch for this.  It needs some more review before
committing, but it does pass regression tests now.

One point worth commenting on is that I chose to rename the OUTER and
INNER symbols for special varnos to OUTER_VAR and INNER_VAR, along with
adding a new special varno INDEX_VAR.  It's bothered me for some time
that those macro names were way too generic/susceptible to collision;
and since I had to look at all the uses anyway to see if the INDEX case
needed to be added, this seemed like a good time to rename them.

regards, tom lane



binInhGxDA2ea.bin
Description: index-only-scan-revisions.patch.gz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-09 Thread Tom Lane

I wrote:
 I believe that we should rejigger things so that when an index-only scan
 is selected, the executor *always* works from the data supplied by the
 index.  Even if it has to visit the heap --- it will do that but just to
 consult the tuple's visibility data, and then use what it got from the
 index anyway.  This means we'd build the plan node's filter quals and
 targetlist to reference the index tuple columns not the underlying
 table's.

I've been studying this a bit.  The key decision is how to represent
Vars that reference columns of the index.  We really have to have
varattno equal to the index column number, else ExecEvalVar will pull
the wrong column from the tuple.  However, varno is not so clear cut.
There are at least four things we could do:

1. Keep varno = table's rangetable index.  The trouble with this is that
a Var referencing index column N would look exactly like a Var
referencing table column N; so the same Var would mean something
different in an index-only scan node than it does in any other type of
scan node for the same table.  We could maybe make that work, but it
seems confusing and fragile as heck.  The executor isn't going to care
much, but inspection of the plan tree by e.g. EXPLAIN sure will.

2. Set varno = OUTER (or maybe INNER).  This is safe because there's no
other use for OUTER/INNER in a table scan node.  We would have to hack
things so that the index tuple gets put into econtext-ecxt_outertuple
(resp. ecxt_innertuple) at runtime, but that seems like no big problem.
In both setrefs.c and ruleutils.c, it would be desirable to have a
TargetEntry list somewhere representing the index columns, which setrefs
would want so it could set up the special Var nodes with fix_upper_expr,
and ruleutils would want so it could interpret the Vars using existing
machinery.  I'm not sure whether to hang that list on the index-only
plan node or expect EXPLAIN to regenerate it at need.

3. Invent another special varno value similar to OUTER/INNER but
representing an index reference.  This is just about like #2 except that
we could still put the index tuple into econtext-ecxt_scantuple, and
ExecEvalVar would do the right thing as it stands.

4. Create a rangetable entry specifically representing the index,
and set varno equal to that RTE's number.  This has some attractiveness
in terms of making the meaning of the Vars clear, but an RTE that
represents an index rather than a table seems kind of ugly otherwise.
It would likely require changes in unrelated parts of the code.


One point here is that we have historically used solution #1 to
represent the index keys in index qual expressions.  We avoid the
ambiguity issues by not asking EXPLAIN to try to interpret the indexqual
tree at all: it works from indexqualorig which contains ordinary Vars.
So one way to dodge the disadvantages of solution #1 would be to add
untransformed targetlistorig and qualorig fields to an index-only
plan node, and use those for EXPLAIN.  However, those fields would be
totally dead weight if the plan were never EXPLAINed, whereas
indexqualorig has a legitimate use for rechecking indexquals against the
heap tuple in case of a lossy index.  (BTW, if we go with any solution
other than #1, I'm strongly inclined to change the representation of
indexqual to match.  See the comments in fix_indexqual_operand.)

At the moment I'm leaning to approach #3, but I wonder if anyone has
a different opinion or another idea altogether.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-09 Thread Greg Stark

On Sun, Oct 9, 2011 at 9:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 At the moment I'm leaning to approach #3, but I wonder if anyone has
 a different opinion or another idea altogether.


Would any of these make it more realistic to talk about the crazy
plans Heikki suggested like doing two index scans, doing the join
between the index tuples, and only then looking up the visibility
information and remaining columns for the tuple on the matching rows?

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-09 Thread Tom Lane

Greg Stark st...@mit.edu writes:
 On Sun, Oct 9, 2011 at 9:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 At the moment I'm leaning to approach #3, but I wonder if anyone has
 a different opinion or another idea altogether.

 Would any of these make it more realistic to talk about the crazy
 plans Heikki suggested like doing two index scans, doing the join
 between the index tuples, and only then looking up the visibility
 information and remaining columns for the tuple on the matching rows?

I don't think it's particularly relevant --- we would not want to use
weird representations of the Vars outside the index scan nodes.  Above
the scan they'd be just like any other upper-level Vars.

(FWIW, that idea isn't crazy; I remember having discussions of it back
in 2003 or so.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-09 Thread Greg Stark

On Sun, Oct 9, 2011 at 10:54 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't think it's particularly relevant --- we would not want to use
 weird representations of the Vars outside the index scan nodes.  Above
 the scan they'd be just like any other upper-level Vars.

I can't say I fully understand the planner data structures and the
implications of the options. I guess what I was imagining was that
being able to reference the indexes as regular rangetable entries
would make it more convenient for the rest of the planner to keep
working as if nothing had changed when working with values extracted
from index tuples.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-09 Thread Tom Lane

I wrote:
 There are at least four things we could do: ...
 2. Set varno = OUTER (or maybe INNER).  This is safe because there's no
 other use for OUTER/INNER in a table scan node.  We would have to hack
 things so that the index tuple gets put into econtext-ecxt_outertuple
 (resp. ecxt_innertuple) at runtime, but that seems like no big problem.
 In both setrefs.c and ruleutils.c, it would be desirable to have a
 TargetEntry list somewhere representing the index columns, which setrefs
 would want so it could set up the special Var nodes with fix_upper_expr,
 and ruleutils would want so it could interpret the Vars using existing
 machinery.  I'm not sure whether to hang that list on the index-only
 plan node or expect EXPLAIN to regenerate it at need.

 3. Invent another special varno value similar to OUTER/INNER but
 representing an index reference.  This is just about like #2 except that
 we could still put the index tuple into econtext-ecxt_scantuple, and
 ExecEvalVar would do the right thing as it stands.

I have mostly-working code for approach #3, but I haven't tried to make
EXPLAIN work yet.  While looking at that I realized that there's a
pretty good argument for adding the above-mentioned explicit TargetEntry
list representing the index columns to index-only plan nodes.  Namely,
that if we don't do it, EXPLAIN will have to go to the catalogs to find
out what's in that index, and this will fall down for hypothetical
indexes injected into the planner by index advisors.  We could imagine
adding some more hooks to let the advisor inject bogus catalog data at
EXPLAIN time, but on the whole it seems easier and less fragile to just
have the planner include a data structure it has to build anyway into
the finished plan.

The need for this additional node list field also sways me in a
direction that I'd previously been on the fence about, namely that
I think index-only scans need to be their own independent plan node type
instead of sharing a node type with regular indexscans.  It's just too
weird that a simple boolean indexonly property would mean completely
different contents/interpretation of the tlist and quals.

I've run into one other thing that's going to need to be hacked up
a bit: index-only scans on name columns fall over with this modified
code, because there's now tighter checking of the implied tuple
descriptors:

regression=# select relname from pg_class where relname = 'tenk1';
ERROR:  attribute 1 has wrong type
DETAIL:  Table has type cstring, but query expects name.

The reason for this is the hack we put in some time back to conserve
space in system catalog indexes by having name columns be indexed as
though they were cstring, cf commit
5f6f840e93a3649e0d07e85bad188d163e96ec0e.  We will probably need some
compensatory hack in index-only scans, unless we can think of a less
klugy way of representing that optimization.  (Basically, the index-only
code is assuming that btrees don't have storage type distinct from input
type, and that's not the case for the name opclass.  I had kind of
expected the original patch to have some issues with that too, and I'm
still not fully convinced that there aren't corner cases where it'd be
an issue even with the currently committed code.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-09 Thread Greg Stark

On Mon, Oct 10, 2011 at 2:47 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 The need for this additional node list field also sways me in a
 direction that I'd previously been on the fence about, namely that
 I think index-only scans need to be their own independent plan node type
 instead of sharing a node type with regular indexscans

At a superficial PR level it'll go over quite well to have a special
plan node be visible in the explain output. People will love to see
Fast Index Scan or Covering Index Scan or whatever you call it in
their plans.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-08 Thread Tom Lane

I wrote:
 Robert Haas robertmh...@gmail.com writes:
 Not really.  We have detected a small performance regression when both
 heap and index fit in shared_buffers and an index-only scan is used.
 I have a couple of ideas for improving this.  One is to store a
 virtual tuple into the slot instead of building a regular tuple, but
 what do we do about tuples with OIDs?

[ that's done ]

 I was also toying with the notion of pushing the slot fill-in into the
 AM, so that the AM API is to return a filled TupleSlot not an
 IndexTuple.  This would not save any cycles AFAICT --- at least in
 btree, we still have to make a palloc'd copy of the IndexTuple so that
 we can release lock on the index page.  The point of it would be to
 avoid the assumption that the index's internal storage has exactly the
 format of IndexTuple.  Don't know whether we'd ever have any actual use
 for that flexibility, but it seems like it wouldn't cost much to
 preserve the option.

BTW, I concluded that that would be a bad idea, because it would involve
the index AM in some choices that we're likely to want to change.  In
particular it would foreclose ever doing anything with expression
indexes, without an AM API change.  Better to just define the AM's
responsibility as to hand back a tuple defined according to the index's
columns.

 Another is to avoid locking the
 index buffer multiple times - right now it locks the index buffer to
 get the TID, and then relocks it to extract the index tuple (after
 checking that nothing disturbing has happened meanwhile).  It seems
 likely that with some refactoring we could get this down to a single
 lock/unlock cycle, but I haven't figured out exactly where the TID
 gets copied out.

 Yeah, maybe, but let's get the patch committed before we start looking
 for second-order optimizations.

On reflection I'm starting to think that the above would be a good idea
because there are a couple of bogosities in the basic choices this patch
made.  In particular, I'm thinking about how we could use an index on
f(x) to avoid recalculating f() in something like

select f(x) from tab where f(x)  42;

assuming that f() is expensive but immutable.  The planner side of this
is already a bit daunting, because it's not clear how to recognize that
an index on f(x) is a covering index (the existing code is going to
think that x itself needs to be available).  But the executor side is a
real problem, because it will fail to make use of the f() value fetched
from the index anytime the heap visibility test fails.

I believe that we should rejigger things so that when an index-only scan
is selected, the executor *always* works from the data supplied by the
index.  Even if it has to visit the heap --- it will do that but just to
consult the tuple's visibility data, and then use what it got from the
index anyway.  This means we'd build the plan node's filter quals and
targetlist to reference the index tuple columns not the underlying
table's.  (Which in passing gets rid of the behavior you were
complaining about that EXPLAIN VERBOSE shows a lot of columns that
aren't actually being computed.)

In order to make this work, we have to remove the API wart that says the
index AM is allowed to choose to not return the index tuple.  And that
ties into what you were saying above.  What we ought to do is have
_bt_readpage() save off copies of the whole tuples not only the TIDs
when it is extracting data from the page.  This is no more net copying
work than what happens now (assuming that all the tuples get fetched)
since we won't need the per-tuple memcpy that occurs now in
bt_getindextuple.  The tuples can go into a page-sized workspace buffer
associated with the BTScanPosData structure, and then just be referenced
in-place in that workspace with no second copy step.

I'm inclined to do the last part immediately, since there's a
performance argument for it, and then work on revising the executor
implementation.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-08 Thread Robert Haas

On Oct 8, 2011, at 11:52 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm inclined to do the last part immediately, since there's a
 performance argument for it, and then work on revising the executor
 implementation.

Sounds great.  Thanks for working on this.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-07 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 Please find attached a patch implementing a basic version of
 index-only scans.

I'm making some progress with this, but I notice what seems like a
missing feature: there needs to be a way to turn it off.  Otherwise
performance comparisons will be difficult to impossible.

The most obvious solution is a planner control GUC, perhaps
enable_indexonlyscan.  Anyone object, or want to bikeshed the name?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-07 Thread Joshua D. Drake



On 10/07/2011 11:40 AM, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

Please find attached a patch implementing a basic version of
index-only scans.


I'm making some progress with this, but I notice what seems like a
missing feature: there needs to be a way to turn it off.  Otherwise
performance comparisons will be difficult to impossible.

The most obvious solution is a planner control GUC, perhaps
enable_indexonlyscan.  Anyone object, or want to bikeshed the name?


enable_onlyindexscan

I'm kidding.

+1 on Tom's proposed name.



regards, tom lane




--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-07 Thread Robert Haas

On Fri, Oct 7, 2011 at 2:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Please find attached a patch implementing a basic version of
 index-only scans.

 I'm making some progress with this, but I notice what seems like a
 missing feature: there needs to be a way to turn it off.  Otherwise
 performance comparisons will be difficult to impossible.

 The most obvious solution is a planner control GUC, perhaps
 enable_indexonlyscan.  Anyone object, or want to bikeshed the name?

I was expecting you to NOT want that, or I would have put it in to
begin with...  so go for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-07 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 7, 2011 at 2:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm making some progress with this, but I notice what seems like a
 missing feature: there needs to be a way to turn it off.  Otherwise
 performance comparisons will be difficult to impossible.
 
 The most obvious solution is a planner control GUC, perhaps
 enable_indexonlyscan.  Anyone object, or want to bikeshed the name?

 I was expecting you to NOT want that, or I would have put it in to
 begin with...  so go for it.

It seems unlikely to have any use except for testing, but I think we
need it for that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-07 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 Please find attached a patch implementing a basic version of
 index-only scans.  This patch is the work of my colleague Ibrar Ahmed
 and myself, and also incorporates some code from previous patches
 posted by Heikki Linnakanagas.

I've committed this after some rather substantial editorialization.
There's still a lot left to do of course, but it seems to need
performance testing next, and that'll be easier if the code is in HEAD.

 1. The way that nodeIndexscan.c builds up the faux heap tuple is
 perhaps susceptible to improvement.  I thought about building a
 virtual tuple, but then what do I do with an OID column, if I have
 one?  Or maybe this should be done some other way altogether.

I switched it to use a virtual tuple for now, and just not attempt to
use index-only scans if a system column is required.  We're likely to
want to rethink this anyway, because as currently constituted the code
can't do anything with an expression index, and avoiding recalculation
of an expensive function could be a nice win.  But the approach of
just building a faux heap tuple fundamentally doesn't work for that.

 2. Suppose we scan one tuple on a not-all-visible page followed by 99
 tuples on all-visible pages.  The code as written will hold the pin on
 the first heap page for the entire scan.  As soon as we hit the end of
 the scan or another tuple where we have to actually visit the page,
 the old pin will be released, but until then we hold onto it.

I did not do anything about this issue --- ISTM it needs performance
testing.

 3. The code in create_index_path() builds up a bitmapset of heap
 attributes that get used for any purpose anywhere in the query, and
 hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
 index under consideration.  However, if it were somehow possible to
 have the rel involved without using any attributes at all, we'd
 rebuild the cache over and over, since it would never become non-NULL.

I dealt with this by the expedient of getting rid of the caching ;-).
It's not clear to me that it was worth the trouble, and in any case
it's fundamentally wrong to suppose that every index faces the same
set of attributes it must supply.  It should not need to supply columns
that are only needed in indexquals or index predicate conditions.
I'm not sure how to deal with those refinements cheaply enough, but
the cache isn't helping.

 4. There are a couple of cases that use index-only scans even though
 the EXPLAIN output sort of makes it look like they shouldn't.  For
 example, in the above queries, an index-only scan is chosen even
 though the query does SELECT * from the table being scanned.  Even
 though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
 that the target list of an EXISTS query is in fact discarded, e.g.:

The reason it looks that way is that we're choosing to use a physical
result tuple to avoid an ExecProject step at runtime.  There's nothing
wrong with the logic, it's just that EXPLAIN shows something that might
mislead people.

 5. We haven't made any planner changes at all, not even for cost
 estimation.  It is not clear to me what the right way to do cost
 estimation here is.

Yeah, me either.  For the moment I put in a hard-wired estimate that
only 90% of the heap pages would actually get fetched.  This is
conservative and only meant to ensure that the planner picks an
index-only-capable plan over an indexscan with a non-covering index,
all else being equal.  We'll need to do performance testing before
we can refine that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-07 Thread Robert Haas

On Fri, Oct 7, 2011 at 8:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 1. The way that nodeIndexscan.c builds up the faux heap tuple is
 perhaps susceptible to improvement.  I thought about building a
 virtual tuple, but then what do I do with an OID column, if I have
 one?  Or maybe this should be done some other way altogether.

 I switched it to use a virtual tuple for now, and just not attempt to
 use index-only scans if a system column is required.  We're likely to
 want to rethink this anyway, because as currently constituted the code
 can't do anything with an expression index, and avoiding recalculation
 of an expensive function could be a nice win.  But the approach of
 just building a faux heap tuple fundamentally doesn't work for that.

Figuring out how to fix that problem likely requires more knowledge of
the executor than I have got.

 2. Suppose we scan one tuple on a not-all-visible page followed by 99
 tuples on all-visible pages.  The code as written will hold the pin on
 the first heap page for the entire scan.  As soon as we hit the end of
 the scan or another tuple where we have to actually visit the page,
 the old pin will be released, but until then we hold onto it.

 I did not do anything about this issue --- ISTM it needs performance
 testing.

I'm actually less worried about any performance problem than I am
about the possibility of holding up VACUUM.  That can happen the old
way, too, but now the pin could stay on the same page for quite a
while even when the scan is advancing.

I think we maybe ought to think seriously about solving the problem at
the other end, though - either make VACUUM skip pages that it can't
get a cleanup lock on without blocking (except in anti-wraparound
mode) or have it just do the amount of work that can be done with an
exclusive lock (i.e. prune but not defragment, which would work even
in anti-wraparound mode).  That would solve the problems of (1)
undetected VACUUM deadlock vs. a buffer pin, (2) indefinite VACUUM
stall due to a suspended query, and (3) this issue.

 3. The code in create_index_path() builds up a bitmapset of heap
 attributes that get used for any purpose anywhere in the query, and
 hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
 index under consideration.  However, if it were somehow possible to
 have the rel involved without using any attributes at all, we'd
 rebuild the cache over and over, since it would never become non-NULL.

 I dealt with this by the expedient of getting rid of the caching ;-).
 It's not clear to me that it was worth the trouble, and in any case
 it's fundamentally wrong to suppose that every index faces the same
 set of attributes it must supply.  It should not need to supply columns
 that are only needed in indexquals or index predicate conditions.
 I'm not sure how to deal with those refinements cheaply enough, but
 the cache isn't helping.

Oh, hmm.

 4. There are a couple of cases that use index-only scans even though
 the EXPLAIN output sort of makes it look like they shouldn't.  For
 example, in the above queries, an index-only scan is chosen even
 though the query does SELECT * from the table being scanned.  Even
 though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
 that the target list of an EXISTS query is in fact discarded, e.g.:

 The reason it looks that way is that we're choosing to use a physical
 result tuple to avoid an ExecProject step at runtime.  There's nothing
 wrong with the logic, it's just that EXPLAIN shows something that might
 mislead people.

I wonder if we oughta do something about that.

I was also thinking we should probably make EXPLAIN ANALYZE display
the number of heap fetches, so that you can see how index-only your
index-only scan actually was.

 5. We haven't made any planner changes at all, not even for cost
 estimation.  It is not clear to me what the right way to do cost
 estimation here is.

 Yeah, me either.  For the moment I put in a hard-wired estimate that
 only 90% of the heap pages would actually get fetched.  This is
 conservative and only meant to ensure that the planner picks an
 index-only-capable plan over an indexscan with a non-covering index,
 all else being equal.  We'll need to do performance testing before
 we can refine that.

Yep.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-06 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 Please find attached a patch implementing a basic version of
 index-only scans.  This patch is the work of my colleague Ibrar Ahmed
 and myself, and also incorporates some code from previous patches
 posted by Heikki Linnakanagas.

I'm starting to review this patch now.  Has any further work been done
since the first version was posted?  Also, was any documentation
written?  I'm a tad annoyed by having to reverse-engineer the changes
in the AM API specification from the code.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-06 Thread Robert Haas

On Thu, Oct 6, 2011 at 3:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Please find attached a patch implementing a basic version of
 index-only scans.  This patch is the work of my colleague Ibrar Ahmed
 and myself, and also incorporates some code from previous patches
 posted by Heikki Linnakanagas.

 I'm starting to review this patch now.

Thanks!

 Has any further work been done
 since the first version was posted?  Also, was any documentation
 written?  I'm a tad annoyed by having to reverse-engineer the changes
 in the AM API specification from the code.

Not really.  We have detected a small performance regression when both
heap and index fit in shared_buffers and an index-only scan is used.
I have a couple of ideas for improving this.  One is to store a
virtual tuple into the slot instead of building a regular tuple, but
what do we do about tuples with OIDs?  Another is to avoid locking the
index buffer multiple times - right now it locks the index buffer to
get the TID, and then relocks it to extract the index tuple (after
checking that nothing disturbing has happened meanwhile).  It seems
likely that with some refactoring we could get this down to a single
lock/unlock cycle, but I haven't figured out exactly where the TID
gets copied out.

With regard to the AM API, the basic idea is we're just adding a
Boolean to say whether the AM is capable of returning index tuples.
If it's not, then we don't ever try an index-only scan.  If it is,
then we'll set the want_index_tuple flag if an index-only scan is
possible.  This requests that the AM attempt to return the tuple; but
the AM is also allowed to fail and not return the tuple whenever it
wants.  This is more or less the interface that Heikki suggested a
couple years back, but it might well be vulnerable to improvement.

Incidentally, if you happen to feel the urge to beat this around and
send it back rather than posting a list of requested changes, feel
free.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-06 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 Not really.  We have detected a small performance regression when both
 heap and index fit in shared_buffers and an index-only scan is used.
 I have a couple of ideas for improving this.  One is to store a
 virtual tuple into the slot instead of building a regular tuple, but
 what do we do about tuples with OIDs?

Yeah, I was just looking at that.  I think it's going to be a clear win
to use a virtual tuple slot instead of constructing and deconstructing
a HeapTuple.  The obvious solution is to decree that you can't use an
index-only scan if the query requires fetching OID (or any other system
column).  This would be slightly annoying for catalog fetches but I'm
not sure I believe that catalog queries are that important a use-case.

I was also toying with the notion of pushing the slot fill-in into the
AM, so that the AM API is to return a filled TupleSlot not an
IndexTuple.  This would not save any cycles AFAICT --- at least in
btree, we still have to make a palloc'd copy of the IndexTuple so that
we can release lock on the index page.  The point of it would be to
avoid the assumption that the index's internal storage has exactly the
format of IndexTuple.  Don't know whether we'd ever have any actual use
for that flexibility, but it seems like it wouldn't cost much to
preserve the option.

 Another is to avoid locking the
 index buffer multiple times - right now it locks the index buffer to
 get the TID, and then relocks it to extract the index tuple (after
 checking that nothing disturbing has happened meanwhile).  It seems
 likely that with some refactoring we could get this down to a single
 lock/unlock cycle, but I haven't figured out exactly where the TID
 gets copied out.

Yeah, maybe, but let's get the patch committed before we start looking
for second-order optimizations.

 With regard to the AM API, the basic idea is we're just adding a
 Boolean to say whether the AM is capable of returning index tuples.
 If it's not, then we don't ever try an index-only scan.  If it is,
 then we'll set the want_index_tuple flag if an index-only scan is
 possible.  This requests that the AM attempt to return the tuple; but
 the AM is also allowed to fail and not return the tuple whenever it
 wants.

It was the allowed to fail bit that I was annoyed to discover only by
reading some well-buried code.

 Incidentally, if you happen to feel the urge to beat this around and
 send it back rather than posting a list of requested changes, feel
 free.

You can count on that ;-).  I've already found a lot of things I didn't
care for.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-10-06 Thread Thom Brown

On 6 October 2011 21:11, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Not really.  We have detected a small performance regression when both
 heap and index fit in shared_buffers and an index-only scan is used.
 I have a couple of ideas for improving this.  One is to store a
 virtual tuple into the slot instead of building a regular tuple, but
 what do we do about tuples with OIDs?

 Yeah, I was just looking at that.  I think it's going to be a clear win
 to use a virtual tuple slot instead of constructing and deconstructing
 a HeapTuple.  The obvious solution is to decree that you can't use an
 index-only scan if the query requires fetching OID (or any other system
 column).  This would be slightly annoying for catalog fetches but I'm
 not sure I believe that catalog queries are that important a use-case.

 I was also toying with the notion of pushing the slot fill-in into the
 AM, so that the AM API is to return a filled TupleSlot not an
 IndexTuple.  This would not save any cycles AFAICT --- at least in
 btree, we still have to make a palloc'd copy of the IndexTuple so that
 we can release lock on the index page.  The point of it would be to
 avoid the assumption that the index's internal storage has exactly the
 format of IndexTuple.  Don't know whether we'd ever have any actual use
 for that flexibility, but it seems like it wouldn't cost much to
 preserve the option.

 Another is to avoid locking the
 index buffer multiple times - right now it locks the index buffer to
 get the TID, and then relocks it to extract the index tuple (after
 checking that nothing disturbing has happened meanwhile).  It seems
 likely that with some refactoring we could get this down to a single
 lock/unlock cycle, but I haven't figured out exactly where the TID
 gets copied out.

 Yeah, maybe, but let's get the patch committed before we start looking
 for second-order optimizations.

+1

Been testing this today with a few regression tests and haven't seen
anything noticeably broken.  Does what it says on the tin.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-09-25 Thread Marti Raudsepp

On Sun, Aug 14, 2011 at 00:31, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 That is somewhat compensated by the fact that tuples that are accessed more
 often are also more likely to be in cache. Fetching the heap tuple to check
 visibility is very cheap when the tuple is in cache.

 I'm not sure how far that compensates it, though. I'm sure there's typically
 nevertheless a fairly wide range of pages that have been modified since the
 last vacuum, but not in cache anymore.

Would it make sense to re-evaluate the visibility bit just before a
page gets flushed out from shared buffers? On a system with no long
transactions, it seems likely that a dirty page is already all-visible
by the time bgwriter (or shared buffers memory pressure) gets around
to writing it out. That way we don't have to wait for vacuum to do it
and would make your observation hold more often.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-09-25 Thread Robert Haas

On Sun, Sep 25, 2011 at 2:43 PM, Marti Raudsepp ma...@juffo.org wrote:
 On Sun, Aug 14, 2011 at 00:31, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 That is somewhat compensated by the fact that tuples that are accessed more
 often are also more likely to be in cache. Fetching the heap tuple to check
 visibility is very cheap when the tuple is in cache.

 I'm not sure how far that compensates it, though. I'm sure there's typically
 nevertheless a fairly wide range of pages that have been modified since the
 last vacuum, but not in cache anymore.

 Would it make sense to re-evaluate the visibility bit just before a
 page gets flushed out from shared buffers? On a system with no long
 transactions, it seems likely that a dirty page is already all-visible
 by the time bgwriter (or shared buffers memory pressure) gets around
 to writing it out. That way we don't have to wait for vacuum to do it
 and would make your observation hold more often.

This has been suggested before, and, sure, there might be cases where
it helps.  But you need to choose your test case fairly carefully.
For example, if you're doing a large sequential scan on a table, the
ring-buffer logic causes processes to evict their own pages, and so
the background writer doesn't get a chance to touch any of those
pages.  You need some kind of a workload where pages are being evicted
from shared buffers slowly enough that it ends up being the background
writer, rather than the individual backends, that do the work.  But if
you have that kind of workload, then we can infer that most of your
working set fits into shared buffers.  And in that case you don't
really need index-only scans in the first place.  The main point of
index only scans is to optimize the case where you have a gigantic
table and you're trying to avoid swamping the system with random I/O.
I'm not saying that such a change would be a particularly bad idea,
but I'm not personally planning to work on it any time soon because I
can't convince myself that it really helps all that much.

I think the real solution to getting visibility map bits set is to
vacuum more frequently, but have it be cheaper each time.  Our default
autovacuum settings vacuum the table when the number of updates and
deletes reaches 20% of the table size.  But those settings were put in
place under the assumption that we'll have to scan the entire heap,
dirtying every page that contains dead tuples, scan all the indexes
and remove the associated index pointers, and then scan and dirty the
heap pages that contain now-dead line pointers a second time to remove
those.  The visibility map has eroded those assumptions to some
degree, because now we probably won't have to scan the entire heap
every time we vacuum; and I'm hoping we're going to see some further
erosion.  Pavan has a pending patch which, if we can work out the
details, will eliminate the second heap scan; and we've also talked
about doing the index scan only when there are enough dead line
pointers to justify the effort.  That, it seems to me, would open the
door to lowering the scale factor, maybe by quite a bit - which, in
turn, would help us control bloat better and get visibility map bits
set sooner.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-09-23 Thread Cédric Villemain

2011/8/16 Anssi Kääriäinen anssi.kaariai...@thl.fi:
 On 08/14/2011 12:31 AM, Heikki Linnakangas wrote:

 The same idea could of course be used to calculate the effective cache
 hit ratio for each table. Cache hit ratio would have the problem of feedback
 loops, though.

 Yeah, I'm not excited about making the planner and statistics more
 dynamic. Feedback loops and plan instability are not fun.

 I might be a little out of my league here... But I was thinking about the
 cache hit ratio and feedback loops. I understand automatic tuning would be
 hard. But making automatic tuning easier (by using pg_tune for example)
 would be a big plus for most use cases.

 To make it easier to tune the page read costs automatically, it would be
 nice if there would be four variables instead of the current two:
  - random_page_cost is the cost of reading a random page from storage.
 Currently it is not, it is the cost of accessing a random page, taking in
 account it might be in memory.
  - seq_page_cost is the cost of reading pages sequentially from storage
  - memory_page_cost is the cost of reading a page in memory
  - cache_hit_ratio is the expected cache hit ratio

 memory_page_cost would be server global, random and seq page costs
 tablespace specific, and cache_hit_ratio relation specific. You would get
 the current behavior by tuning *_page_costs realistically, and setting
 cache_hit_ratio globally so that the expected random_page_cost /
 seq_page_cost stays the same as now.

 The biggest advantage of this would be that the correct values are much
 easier to detect automatically compared to current situation. This can be
 done using pg_statio_* views and IO speed testing. They should not be tuned
 automatically by PostgreSQL, at least not the cache_hit_ratio, as that leads
 to the possibility of feedback loops and plan instability. The variables
 would also be much easier to understand.

 There is the question if one should be allowed to tune the *_page_costs at
 all. If I am not missing something, it is possible to detect the correct
 values programmatically and they do not change if you do not change the
 hardware. Cache hit ratio is the real reason why they are currently so
 important for tuning.

 An example why the current random_page_cost and seq_page_cost tuning is not
 adequate is that you can only set random_page_cost per tablespace. That
 makes perfect sense if random_page_cost would be the cost of accessing a
 page in storage. But it is not, it is a combination of that and caching
 effects, so that it actually varies per relation (and over time). How do you
 set it correctly for a query where one relation is fully cached and another
 one not?

 Another problem is that if you use random_page_cost == seq_page_cost, you
 are effectively saying that everything is in cache. But if everything is in
 cache, the cost of page access relative to cpu_*_costs is way off. The more
 random_page_cost and seq_page_cost are different, the more they mean the
 storage access costs. When they are the same, they mean the memory page
 cost. There can be an order of magnitude in difference of a storage page
 cost and a memory page cost. So it is hard to tune the cpu_*_costs
 realistically for cases where sometimes data is in cache and sometimes not.

 Ok, enough hand waving for one post :) Sorry if this all is obvious /
 discussed before. My googling didn't turn out anything directly related,
 although these have some similarity:
  - Per-table random_page_cost for tables that we know are always cached
 [http://archives.postgresql.org/pgsql-hackers/2008-04/msg01503.php]
  - Script to compute random page cost
 [http://archives.postgresql.org/pgsql-hackers/2002-09/msg00503.php]
 -  The science of optimization in practical terms?
 [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00718.php], getting
 really interesting starting from here:
 [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00787.php]

late reply.
You can add this link to your list:
http://archives.postgresql.org/pgsql-hackers/2011-06/msg01140.php


-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-08-16 Thread Anssi Kääriäinen


On 08/14/2011 12:31 AM, Heikki Linnakangas wrote:

The same idea could of course be used to calculate the effective cache hit 
ratio for each table. Cache hit ratio would have the problem of feedback loops, 
though.

Yeah, I'm not excited about making the planner and statistics more
dynamic. Feedback loops and plan instability are not fun.
I might be a little out of my league here... But I was thinking about 
the cache hit ratio and feedback loops. I understand automatic tuning 
would be hard. But making automatic tuning easier (by using pg_tune for 
example) would be a big plus for most use cases.


To make it easier to tune the page read costs automatically, it would be 
nice if there would be four variables instead of the current two:
  - random_page_cost is the cost of reading a random page from storage. 
Currently it is not, it is the cost of accessing a random page, taking 
in account it might be in memory.

  - seq_page_cost is the cost of reading pages sequentially from storage
  - memory_page_cost is the cost of reading a page in memory
  - cache_hit_ratio is the expected cache hit ratio

memory_page_cost would be server global, random and seq page costs 
tablespace specific, and cache_hit_ratio relation specific. You would 
get the current behavior by tuning *_page_costs realistically, and 
setting cache_hit_ratio globally so that the expected random_page_cost / 
seq_page_cost stays the same as now.


The biggest advantage of this would be that the correct values are much 
easier to detect automatically compared to current situation. This can 
be done using pg_statio_* views and IO speed testing. They should not be 
tuned automatically by PostgreSQL, at least not the cache_hit_ratio, as 
that leads to the possibility of feedback loops and plan instability. 
The variables would also be much easier to understand.


There is the question if one should be allowed to tune the *_page_costs 
at all. If I am not missing something, it is possible to detect the 
correct values programmatically and they do not change if you do not 
change the hardware. Cache hit ratio is the real reason why they are 
currently so important for tuning.


An example why the current random_page_cost and seq_page_cost tuning is 
not adequate is that you can only set random_page_cost per tablespace. 
That makes perfect sense if random_page_cost would be the cost of 
accessing a page in storage. But it is not, it is a combination of that 
and caching effects, so that it actually varies per relation (and over 
time). How do you set it correctly for a query where one relation is 
fully cached and another one not?


Another problem is that if you use random_page_cost == seq_page_cost, 
you are effectively saying that everything is in cache. But if 
everything is in cache, the cost of page access relative to cpu_*_costs 
is way off. The more random_page_cost and seq_page_cost are different, 
the more they mean the storage access costs. When they are the same, 
they mean the memory page cost. There can be an order of magnitude in 
difference of a storage page cost and a memory page cost. So it is hard 
to tune the cpu_*_costs realistically for cases where sometimes data is 
in cache and sometimes not.


Ok, enough hand waving for one post :) Sorry if this all is obvious / 
discussed before. My googling didn't turn out anything directly related, 
although these have some similarity:
 - Per-table random_page_cost for tables that we know are always cached 
[http://archives.postgresql.org/pgsql-hackers/2008-04/msg01503.php]

 - Script to compute random page cost
[http://archives.postgresql.org/pgsql-hackers/2002-09/msg00503.php]
-  The science of optimization in practical terms?
[http://archives.postgresql.org/pgsql-hackers/2009-02/msg00718.php], 
getting really interesting starting from here:

[http://archives.postgresql.org/pgsql-hackers/2009-02/msg00787.php]

 - Anssi


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-08-15 Thread Jim Nasby

On Aug 13, 2011, at 4:31 PM, Heikki Linnakangas wrote:
 The example is much more realistic if the query is a fetch of N latest rows 
 from a table. Very common use case, and the whole relation's visibility 
 statistics are completely wrong for that query.
 
 That is somewhat compensated by the fact that tuples that are accessed more 
 often are also more likely to be in cache. Fetching the heap tuple to check 
 visibility is very cheap when the tuple is in cache.
 
 I'm not sure how far that compensates it, though. I'm sure there's typically 
 nevertheless a fairly wide range of pages that have been modified since the 
 last vacuum, but not in cache anymore.

http://xkcd.org/937/ :)

Could something be added to pg_stats that tracks visibility map usefulness on a 
per-attribute basis? Perhaps another set of stats buckets that show visibility 
percentages for each stats bucket?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 151 matches

Mail list logo