subject:"\[HACKERS\] KNN\-GiST with recheck"

Re: [HACKERS] KNN-GiST with recheck

2015-05-18 Thread Heikki Linnakangas


On 05/16/2015 12:42 AM, Jim Nasby wrote:

On 5/14/15 6:30 PM, Heikki Linnakangas wrote:

On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:

I think this is now ready for committing, but I'm pretty tired now so
I'll read through this one more time in the morning, so that I won't
wake up to a red buildfarm.


If anyone feels motivated to fix, there's a typo in the comment for
IndexNextWithReorder (s/his/this/):
+ * Like IndexNext, but his version can also re-check any


Fixed, thanks.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-05-15 Thread Alexander Korotkov

On Fri, May 15, 2015 at 2:30 AM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:

 I think this is now ready for committing, but I'm pretty tired now so
 I'll read through this one more time in the morning, so that I won't
 wake up to a red buildfarm.


 Forgot to attach the latest patch, here you go.


Looks good for me.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] KNN-GiST with recheck

2015-05-15 Thread Alexander Korotkov

On Fri, May 15, 2015 at 2:48 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 05/15/2015 11:31 AM, Alexander Korotkov wrote:

 On Fri, May 15, 2015 at 2:30 AM, Heikki Linnakangas hlinn...@iki.fi
 wrote:

  On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:

  I think this is now ready for committing, but I'm pretty tired now so
 I'll read through this one more time in the morning, so that I won't
 wake up to a red buildfarm.


 Forgot to attach the latest patch, here you go.



 Looks good for me.


 Ok, pushed after some further minor cleanup.


Great! Thank you!

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] KNN-GiST with recheck

2015-05-15 Thread Heikki Linnakangas


On 05/15/2015 11:31 AM, Alexander Korotkov wrote:

On Fri, May 15, 2015 at 2:30 AM, Heikki Linnakangas hlinn...@iki.fi wrote:


On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:


I think this is now ready for committing, but I'm pretty tired now so
I'll read through this one more time in the morning, so that I won't
wake up to a red buildfarm.



Forgot to attach the latest patch, here you go.



Looks good for me.


Ok, pushed after some further minor cleanup.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-05-15 Thread Bruce Momjian

On Fri, May 15, 2015 at 02:48:29PM +0300, Heikki Linnakangas wrote:
 On 05/15/2015 11:31 AM, Alexander Korotkov wrote:
 On Fri, May 15, 2015 at 2:30 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 
 On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:
 
 I think this is now ready for committing, but I'm pretty tired now so
 I'll read through this one more time in the morning, so that I won't
 wake up to a red buildfarm.
 
 
 Forgot to attach the latest patch, here you go.
 
 
 Looks good for me.
 
 Ok, pushed after some further minor cleanup.

Great!  That PostGIS workaround they had to use for accurate distances
with CTEs and LIMIT 100 was an ugly hack.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-05-15 Thread Jim Nasby


On 5/14/15 6:30 PM, Heikki Linnakangas wrote:

On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:

I think this is now ready for committing, but I'm pretty tired now so
I'll read through this one more time in the morning, so that I won't
wake up to a red buildfarm.


If anyone feels motivated to fix, there's a typo in the comment for 
IndexNextWithReorder (s/his/this/):

+ * Like IndexNext, but his version can also re-check any


--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-05-15 Thread Alexander Korotkov

On Fri, May 15, 2015 at 2:49 PM, Alexander Korotkov aekorot...@gmail.com
wrote:

 On Fri, May 15, 2015 at 2:48 PM, Heikki Linnakangas hlinn...@iki.fi
 wrote:

 On 05/15/2015 11:31 AM, Alexander Korotkov wrote:

 On Fri, May 15, 2015 at 2:30 AM, Heikki Linnakangas hlinn...@iki.fi
 wrote:

  On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:

  I think this is now ready for committing, but I'm pretty tired now so
 I'll read through this one more time in the morning, so that I won't
 wake up to a red buildfarm.


 Forgot to attach the latest patch, here you go.



 Looks good for me.


 Ok, pushed after some further minor cleanup.


 Great! Thank you!


BTW, I found that now IndexScan node lackof copy and output support for
indexorderbyops.
Attached patch fixes that. Copy and output functions assume that
indexorderbyops has the same length as indexorderby. In order to make this
more evident I move check for best_path-path.pathkeys in create_plan from
if into assertion. AFAICS, pathkeys should always present where there are
indexorderby.

--
With best regards,
Alexander Korotkov.


fix-indexscan-node.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-05-14 Thread Heikki Linnakangas


On 05/15/2015 02:28 AM, Heikki Linnakangas wrote:

I think this is now ready for committing, but I'm pretty tired now so
I'll read through this one more time in the morning, so that I won't
wake up to a red buildfarm.


Forgot to attach the latest patch, here you go.

- Heikki
From df00d9c972a760e1ed777a7c9b1603dad1d3f134 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas heikki.linnakangas@iki.fi
Date: Fri, 15 May 2015 00:56:27 +0300
Subject: [PATCH 1/1] Allow GiST distance function to return merely a
 lower-bound.

The distance function can now set *recheck = false, like index quals. The
executor will then re-check the ORDER BY expressions, and use a queue to
reorder the results on the fly.

This makes it possible to do kNN-searches on polygons and circles, which
store a bounding box in the index, rather than the exact value.

Alexander Korotkov and me
---
 doc/src/sgml/gist.sgml |  35 ++-
 src/backend/access/gist/gistget.c  |  22 +-
 src/backend/access/gist/gistproc.c |  37 +++
 src/backend/access/gist/gistscan.c |   5 +
 src/backend/executor/nodeIndexscan.c   | 351 -
 src/backend/optimizer/plan/createplan.c|  69 --
 src/backend/utils/adt/geo_ops.c|  27 +++
 src/include/access/genam.h |   3 +
 src/include/access/relscan.h   |   9 +
 src/include/catalog/catversion.h   |   2 +-
 src/include/catalog/pg_amop.h  |   2 +
 src/include/catalog/pg_amproc.h|   2 +
 src/include/catalog/pg_operator.h  |   8 +-
 src/include/catalog/pg_proc.h  |   4 +
 src/include/nodes/execnodes.h  |  19 ++
 src/include/nodes/plannodes.h  |  12 +-
 src/include/utils/geo_decls.h  |   3 +
 src/test/regress/expected/create_index.out |  78 +++
 src/test/regress/sql/create_index.sql  |  12 +
 19 files changed, 663 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index e7d1ff9..1291f8d 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -105,6 +105,7 @@
literal~=/
   /entry
   entry
+   literallt;-gt;/
   /entry
  /row
  row
@@ -163,6 +164,7 @@
literal~=/
   /entry
   entry
+   literallt;-gt;/
   /entry
  /row
  row
@@ -207,6 +209,12 @@
   /table
 
  para
+  Currently, ordering by the distance operator literallt;-gt;/
+  is supported only with literalpoint/ by the operator classes
+  of the geometric types.
+ /para
+
+ para
   For historical reasons, the literalinet_ops/ operator class is
   not the default class for types typeinet/ and typecidr/.
   To use it, mention the class name in commandCREATE INDEX/,
@@ -780,6 +788,7 @@ my_distance(PG_FUNCTION_ARGS)
 data_type  *query = PG_GETARG_DATA_TYPE_P(1);
 StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2);
 /* Oid subtype = PG_GETARG_OID(3); */
+/* bool *recheck = (bool *) PG_GETARG_POINTER(4); */
 data_type  *key = DatumGetDataType(entry-gt;key);
 double  retval;
 
@@ -792,14 +801,24 @@ my_distance(PG_FUNCTION_ARGS)
 /programlisting
 
The arguments to the functiondistance/ function are identical to
-   the arguments of the functionconsistent/ function, except that no
-   recheck flag is used.  The distance to a leaf index entry must always
-   be determined exactly, since there is no way to re-order the tuples
-   once they are returned.  Some approximation is allowed when determining
-   the distance to an internal tree node, so long as the result is never
-   greater than any child's actual distance.  Thus, for example, distance
-   to a bounding box is usually sufficient in geometric applications.  The
-   result value can be any finite typefloat8/ value.  (Infinity and
+   the arguments of the functionconsistent/ function.
+  /para
+
+  para
+   Some approximation is allowed when determining the distance, as long as
+   the result is never greater than the entry's actual distance. Thus, for
+   example, distance to a bounding box is usually sufficient in geometric
+   applications.  For an internal tree node, the distance returned must not
+   be greater than the distance to any of the child nodes. If the returned
+   distance is not accurate, the function must set *recheck to false. (This
+   is not necessary for internal tree nodes; for them, the calculation is
+   always assumed to be inaccurate). The executor will calculate the
+   accurate distance after fetching the tuple from the heap, and reorder
+   the tuples if necessary.
+  /para
+
+  para
+   The result value can be any finite typefloat8/ value.  (Infinity and
minus infinity are used internally to handle cases such as nulls, so it
is not recommended that functiondistance/ functions return these
values.)
diff --git

Re: [HACKERS] KNN-GiST with recheck

2015-05-14 Thread Heikki Linnakangas


On 05/14/2015 01:43 PM, Alexander Korotkov wrote:

On Wed, May 13, 2015 at 10:17 PM, Alexander Korotkov aekorot...@gmail.com
wrote:


One quick comment:


It would be good to avoid the extra comparisons of the distances, when
the index doesn't return any lossy items. As the patch stands, it adds one
extra copyDistances() call and a cmp_distances() call for each tuple (in a
knn-search), even if there are no lossy tuples.



I will fix it until Friday.



Attached patch is rebased against current master. Extra extra
copyDistances() call and a cmp_distances() call for each tuple are avoided
in the case of no lossy tuples.


Thanks! I spent some time cleaning this up:

* fixed a memory leak

* fixed a silly bug in rechecking multi-column scans

* I restructured the changes to IndexNext. I actually created a whole 
separate copy of IndexNext, called IndexNextWithReorder, that is used 
when there are ORDER BY expressions that might need to be rechecked. 
There is now some duplicated code between them, but I think they are 
both easier to understand this way. The IndexNext function is now as 
simple as before, and the IndexNextWithReorder doesn't need so many 
if()-checks on whether the reorder queue exists at all.


* I renamed Distance to OrderByValues in the executor parts. We call the 
ORDER BY x - y construct an ORDER BY expression, so let's continue 
using that terminology.


I think this is now ready for committing, but I'm pretty tired now so 
I'll read through this one more time in the morning, so that I won't 
wake up to a red buildfarm.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-05-14 Thread Alexander Korotkov

On Wed, May 13, 2015 at 10:17 PM, Alexander Korotkov aekorot...@gmail.com
wrote:

 One quick comment:

 It would be good to avoid the extra comparisons of the distances, when
 the index doesn't return any lossy items. As the patch stands, it adds one
 extra copyDistances() call and a cmp_distances() call for each tuple (in a
 knn-search), even if there are no lossy tuples.


 I will fix it until Friday.


Attached patch is rebased against current master. Extra extra
copyDistances() call and a cmp_distances() call for each tuple are avoided
in the case of no lossy tuples.

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-9.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-05-13 Thread Alexander Korotkov

On Wed, May 13, 2015 at 10:16 PM, Heikki Linnakangas hlinn...@iki.fi
wrote:

 On 04/17/2015 12:05 PM, Alexander Korotkov wrote:

 On Wed, Feb 25, 2015 at 12:15 PM, Alexander Korotkov 
 aekorot...@gmail.com
 wrote:

  Hi!

 On Tue, Feb 24, 2015 at 5:39 PM, Tomas Vondra 
 tomas.von...@2ndquadrant.com wrote:

  On 17.2.2015 14:21, Alexander Korotkov wrote:

 On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov
 aekorot...@gmail.com mailto:aekorot...@gmail.com wrote:

 Revised patch with reordering in GiST is attached
 (knn-gist-recheck-in-gist.patch) as well as testing script (test.py).


 I meant to do a bit of testing on this (assuming it's still needed), but
 the patches need rebasing - Heikki fixed a few issues, so they don't
 apply cleanly.


 Both patches are revised.


 Both patches are rebased against current master.


 This looks pretty much ready. I'm going to spend some time on this on
 Friday, and if all looks good, commit. (Thursday's a public holiday here).


Very good, thanks!


 One quick comment:

 It would be good to avoid the extra comparisons of the distances, when the
 index doesn't return any lossy items. As the patch stands, it adds one
 extra copyDistances() call and a cmp_distances() call for each tuple (in a
 knn-search), even if there are no lossy tuples.


I will fix it until Friday.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] KNN-GiST with recheck

2015-05-13 Thread Heikki Linnakangas


On 04/17/2015 12:05 PM, Alexander Korotkov wrote:

On Wed, Feb 25, 2015 at 12:15 PM, Alexander Korotkov aekorot...@gmail.com
wrote:


Hi!

On Tue, Feb 24, 2015 at 5:39 PM, Tomas Vondra 
tomas.von...@2ndquadrant.com wrote:


On 17.2.2015 14:21, Alexander Korotkov wrote:

On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov
aekorot...@gmail.com mailto:aekorot...@gmail.com wrote:

Revised patch with reordering in GiST is attached
(knn-gist-recheck-in-gist.patch) as well as testing script (test.py).


I meant to do a bit of testing on this (assuming it's still needed), but
the patches need rebasing - Heikki fixed a few issues, so they don't
apply cleanly.



Both patches are revised.



Both patches are rebased against current master.


This looks pretty much ready. I'm going to spend some time on this on 
Friday, and if all looks good, commit. (Thursday's a public holiday here).


One quick comment:

It would be good to avoid the extra comparisons of the distances, when 
the index doesn't return any lossy items. As the patch stands, it adds 
one extra copyDistances() call and a cmp_distances() call for each tuple 
(in a knn-search), even if there are no lossy tuples.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-04-17 Thread Alexander Korotkov

On Wed, Feb 25, 2015 at 12:15 PM, Alexander Korotkov aekorot...@gmail.com
wrote:

 Hi!

 On Tue, Feb 24, 2015 at 5:39 PM, Tomas Vondra 
 tomas.von...@2ndquadrant.com wrote:

 On 17.2.2015 14:21, Alexander Korotkov wrote:
  On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov
  aekorot...@gmail.com mailto:aekorot...@gmail.com wrote:
 
  Revised patch with reordering in GiST is attached
  (knn-gist-recheck-in-gist.patch) as well as testing script (test.py).

 I meant to do a bit of testing on this (assuming it's still needed), but
 the patches need rebasing - Heikki fixed a few issues, so they don't
 apply cleanly.


 Both patches are revised.


Both patches are rebased against current master.

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-8.patch
Description: Binary data


knn-gist-recheck-in-gist-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-02-25 Thread Alexander Korotkov

Hi!

On Tue, Feb 24, 2015 at 5:39 PM, Tomas Vondra tomas.von...@2ndquadrant.com
wrote:

 On 17.2.2015 14:21, Alexander Korotkov wrote:
  On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov
  aekorot...@gmail.com mailto:aekorot...@gmail.com wrote:
 
  Revised patch with reordering in GiST is attached
  (knn-gist-recheck-in-gist.patch) as well as testing script (test.py).

 I meant to do a bit of testing on this (assuming it's still needed), but
 the patches need rebasing - Heikki fixed a few issues, so they don't
 apply cleanly.


Both patches are revised.

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-in-gist-2.patch
Description: Binary data


knn-gist-recheck-7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-02-24 Thread Tomas Vondra

Hi,

On 17.2.2015 14:21, Alexander Korotkov wrote:
 On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov
 aekorot...@gmail.com mailto:aekorot...@gmail.com wrote:
 
 Revised patch with reordering in GiST is attached
 (knn-gist-recheck-in-gist.patch) as well as testing script (test.py).

I meant to do a bit of testing on this (assuming it's still needed), but
the patches need rebasing - Heikki fixed a few issues, so they don't
apply cleanly.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-02-17 Thread Alexander Korotkov

On Sun, Feb 15, 2015 at 2:08 PM, Alexander Korotkov aekorot...@gmail.com
wrote:

 Following changes has been made in attached patch:

  * Get sort operators from pathkeys.
  * Recheck argument of distance function has been reverted.


Few comments were added and pairing heap comparison function was fixed in
attached version of patch (knn-gist-recheck-6.patch).
Also I expected that reordering in executor would be slower than reordering
in GiST because of maintaining two heaps instead of one. I've revised
version of patch with reordering in GiST to use pairing heap. I compare two
types of reordering on 10^7 random points and polygons. Results are below.
Test shows that overhead of reordering in executor is insignificant (less
than statistical error).

 Reorder in GiST   Reorder in executor
points
limit=10 0.10615   0.0880125
limit=1000.236668750.2292375
limit=1000   1.514868751.5208375
polygons
limit=10 0.116506250.1347
limit=1000.462793750.45294375
limit=1000   3.5170125 3.54868125

Revised patch with reordering in GiST is attached
(knn-gist-recheck-in-gist.patch) as well as testing script (test.py).

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-6.patch
Description: Binary data


knn-gist-recheck-in-gist.patch
Description: Binary data
#!/usr/bin/env python
import psycopg2
import json
import sys

# Create test tables with following statements.
#
# create table p as (select point(random(), random()) from generate_series(1,1000));
# create index p_idx on p using gist(v);
# create table g as (select polygon(3 + (random()*5)::int, circle(point(random(), random()), 0.001)) v from generate_series(1,1000));
# create index g_idx on g using gist(v);

dbconn = psycopg2.connect(dbname='postgres' user='smagen' host='/tmp' password='' port=5431)

points = []
pointsCount = 16
tableName = None

def generatePoints(n):
	global points
	m = 1
	d = 0.5
	points.append((0, 0))
	while m = n:
		for i in range(0, m):
			points.append((points[i][0] + d, points[i][1]))
			points.append((points[i][0], points[i][1] + d))
			points.append((points[i][0] + d, points[i][1] + d))
		d /= 2.0
		m *= 4

generatePoints(pointsCount)

def runKnn(point, limit):
	cursor = dbconn.cursor()
	cursor.execute(EXPLAIN (ANALYZE, FORMAT JSON) SELECT * FROM  + tableName +  ORDER BY v - %s::point LIMIT %s;, (point, limit))
	plan = cursor.fetchone()[0][0]
	cursor.close()
	return (plan['Planning Time'], plan['Execution Time'])

def makeTests(n, limit):
	planningTime = 0
	executionTime = 0
	for i in range(0, n):
		for j in range(0, pointsCount):
			point = '(' + str(points[j][0]) + ',' + str(points[j][1]) + ')'
			result = runKnn(point, limit)
			planningTime += result[0]
			executionTime += result[1]
	planningTime /= n * pointsCount
	executionTime /= n * pointsCount
	return (planningTime, executionTime)

if (len(sys.argv)  2):
	print Usage: %s table_name % sys.argv[0]
	sys.exit(2)

tableName = sys.argv[1]

for limit in [10, 100, 1000]:
	result = makeTests(10, limit)
	print limit: %s\nplanning:  %s\nexecution: %s % (limit, result[0], result[1])

dbconn.close()

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-02-15 Thread Alexander Korotkov

On Thu, Jan 8, 2015 at 1:12 AM, Alexander Korotkov aekorot...@gmail.com
wrote:

 On Tue, Dec 16, 2014 at 4:37 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 Patch attached. It should be applied on top of my pairing heap patch at
 http://www.postgresql.org/message-id/548ffa2c.7060...@vmware.com. Some
 caveats:

 * The signature of the distance function is unchanged, it doesn't get a
 recheck argument. It is just assumed that if the consistent function sets
 the recheck flag, then the distance needs to be rechecked as well. We might
 want to add the recheck argument, like you Alexander did in your patch, but
 it's not important right now.


 I didn't get how that expected to work if we have only order by qual
 without filter qual. In this case consistent function just isn't called at
 all.

 * I used the distance term in the executor, although the ORDER BY expr
 machinery is more general than that. The value returned by the ORDER BY
 expression doesn't have to be a distance, although that's the only thing
 supported by GiST and the built-in opclasses.

 * I short-circuited the planner to assume that the ORDER BY expression
 always returns a float. That's true today for knn-GiST, but is obviously a
 bogus assumption in general.

 This needs some work to get into a committable state, but from a
 modularity point of view, this is much better than having the indexam to
 peek into the heap.


 Nice idea to put reordering into index scan node. Doesn't look like much
 of overengineering. I'm going to bring it to more commitable state.


Following changes has been made in attached patch:

 * Get sort operators from pathkeys.
 * Recheck argument of distance function has been reverted.

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2015-01-07 Thread Alexander Korotkov

On Tue, Dec 16, 2014 at 4:37 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 Patch attached. It should be applied on top of my pairing heap patch at
 http://www.postgresql.org/message-id/548ffa2c.7060...@vmware.com. Some
 caveats:

 * The signature of the distance function is unchanged, it doesn't get a
 recheck argument. It is just assumed that if the consistent function sets
 the recheck flag, then the distance needs to be rechecked as well. We might
 want to add the recheck argument, like you Alexander did in your patch, but
 it's not important right now.


I didn't get how that expected to work if we have only order by qual
without filter qual. In this case consistent function just isn't called at
all.

* I used the distance term in the executor, although the ORDER BY expr
 machinery is more general than that. The value returned by the ORDER BY
 expression doesn't have to be a distance, although that's the only thing
 supported by GiST and the built-in opclasses.

 * I short-circuited the planner to assume that the ORDER BY expression
 always returns a float. That's true today for knn-GiST, but is obviously a
 bogus assumption in general.

 This needs some work to get into a committable state, but from a
 modularity point of view, this is much better than having the indexam to
 peek into the heap.


Nice idea to put reordering into index scan node. Doesn't look like much of
overengineering. I'm going to bring it to more commitable state.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] KNN-GiST with recheck

2014-12-16 Thread Heikki Linnakangas


On 10/06/2014 12:36 PM, Emre Hasegeli wrote:

Thanks. The main question now is design of this patch. Currently, it does
all the work inside access method. We already have some discussion of pro
and cons of this method. I would like to clarify alternatives now. I can
see following way:

1. Implement new executor node which performs sorting by priority queue.
Let's call it Priority queue. I think it should be separate node from
Sort node. Despite Priority queue and Sort are essentially similar
from user view, they would be completely different in implementation.
2. Implement some interface to transfer distance values from access
method to Priority queue node.


If we assume that all of them need recheck, maybe it can be done
without passing distance values.


No, the executor needs the lower-bound distance value, as calculated by 
the indexam, so that it knows which tuples it can return from the queue 
already. For example, imagine the following items coming from the index:


tuple # lower bound actual distance
1   1   1
2   2   10
3   30  30
4   40  40

After the executor has fetched tuple 2, and re-checked the distance, it 
pushes the tuple to the queue. It then fetches tuple 3, with lower bound 
30, and it can now immediately return tuple # 2 from the queue. Because 
10  30, so there cannot be any more tuples coming from the index that 
would need to go before tuple # 2.


The executor needs the lower bound as calculated by the index, as well 
as the actual distance it calculates itself, to make those decisions.



3. Somehow tell the planner that it could use Priority queue in
corresponding cases. I see two ways of doing this:
   - Add flag to operator in opclass indicating that index can only
   order by lower bound of col op value, not by col op value itself.
   - Define new relation between operators. Value of one operator could
   be lower bound for value of another operator. So, planner can
put Priority
   queue node when lower bound ordering is possible from index. Also ALTER
   OPERATOR command would be reasonable, so extensions could upgrade.


I think, it would be better to make it a property of the operator
class.  We can add a column to pg_amop or define another value for
amoppurpose on pg_amop.  Syntax can be something like this:

CREATE OPERATOR CLASS circle_ops DEFAULT
FOR TYPE circle USING gist AS
OPERATOR 15  -(circle, point) FOR ORDER BY pg_catalog.float_ops LOWER 
BOUND;

While looking at it, I realize that current version of the patch does
not use the sort operator family defined with the operator class.  It
assumes that the distance function will return values compatible with
the operator.  Operator class definition makes me think that there is
not such an assumption.


Yeah. I also noticed that the type of the argument passed to the 
consistent function varies, and doesn't necessarily match that declared 
in pg_proc. Looking at gist_point_consistent, the argument type can be a 
point, a polygon, or a circle, depending on the strategy group. But 
it's declared as a point in pg_proc.



Besides overhead, this way makes significant infrastructural changes. So,
it may be over-engineering. However, it's probably more clean and beautiful
solution.
I would like to get some feedback from people familiar with KNN-GiST like
Heikki or Tom. What do you think about this? Any other ideas?


I would be happy to test and review the changes.  I think it is nice
to solve the problem in a generalized way improving the access method
infrastructure.  Definitely, we should have a consensus on the design
before working on the infrastructure changes.


I took a stab on this. I added the reorder queue directly to the Index 
Scan node, rather than adding a whole new node type for it. It seems 
reasonable, as Index Scan is responsible for rechecking the quals, too, 
even though re-ordering the tuples is more complicated than rechecking 
quals.


To recap, the idea is that the index can define an ordering op, even if 
it cannot return the tuples in exactly the right order. It is enough 
that for each tuple, it returns a lower bound of the expression that is 
used for sorting. For example, for ORDER BY key - column, it is 
enough that it returns a lower bound of key - column for each tuple. 
The index must return the tuples ordered by the lower bounds. The 
executor re-checks the expressions, and re-orders the tuples to the 
correct order.


Patch attached. It should be applied on top of my pairing heap patch at 
http://www.postgresql.org/message-id/548ffa2c.7060...@vmware.com. Some 
caveats:


* The signature of the distance function is unchanged, it doesn't get a 
recheck argument. It is just assumed that if the consistent function 
sets the recheck flag, then the distance needs to be rechecked as well. 
We might want to add the recheck argument, like you Alexander did in 
your

Re: [HACKERS] KNN-GiST with recheck

2014-12-15 Thread Heikki Linnakangas


On 08/03/2014 04:48 PM, Emre Hasegeli wrote:

1. This patch introduces a new polygon - point operator. That seems
useful on its own, with or without this patch.


Yeah, but exact-knn cant come with no one implementation. But it would
better come in a separate patch.


I tried to split them.  Separated patches are attached.  I changed
the order of the arguments as point - polygon, because point was
the first one on all the others.  Its commutator was required for
the index, so I added it on the second patch.  I also added tests
for the operator.  I think it is ready for committer as a separate
patch.  We can add it to the open CommitFest.


Ok, committed this part now with minor changes. The implementation was 
copy-pasted from circle - polygon, so I put the common logic to a 
dist_ppoly_internal function, and called that in both dist_cpoly and 
dist_ppoly.


I was surprised that there were no documentation changes in the patch, 
but looking at the docs, we just list the geometric operators without 
explaining what the argument types are. That's not very comprehensive, 
might be good to expand the docs on that, but it's not this patch's fault.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-12-14 Thread Michael Paquier

On Tue, Jan 28, 2014 at 10:54 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 1. This patch introduces a new polygon - point operator. That seems
 useful on its own, with or without this patch.
This patch is tracked with this entry in the commit fest app and is
marked as Ready for committer. Hence I am moving this specific part
to 2014-12 to keep track of it:

 3. A binary heap would be a better data structure to buffer the rechecked
 values. A Red-Black tree allows random insertions and deletions, but in this
 case you need to insert arbitrary values but only remove the minimum item.
 That's exactly what a binary heap excels at. We have a nice binary heap
 implementation in the backend that you can use, see
 src/backend/lib/binaryheap.c.
Based on those comments, I am marking this entry as returned with feedback:
https://commitfest.postgresql.org/action/patch_view?id=1367
Heikki has sent as well a new patch to use a binary heap method
instead of the red-black tree here:
http://www.postgresql.org/message-id/54886bb8.9040...@vmware.com
IMO this last patch should be added in the CF app, that's not the case now.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-11-28 Thread Alexander Korotkov

On Sat, Nov 22, 2014 at 2:20 AM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Jan 13, 2014 at 9:17 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  This patch was split from thread:
 
 http://www.postgresql.org/message-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=gon-...@mail.gmail.com
 
  I've split it to separate thead, because it's related to partial sort
 only
  conceptually not technically. Also I renamed it to knn-gist-recheck
 from
  partial-knn as more appropriate name. In the attached version docs are
  updated. Possible weak point of this patch design is that it fetches heap
  tuple from GiST scan. However, I didn't receive any notes about its
 design,
  so, I'm going to put it to commitfest.


 The partial sort thing is not in the current 2014-10 commitfest
 (although this patch is). Is that intentional?


It's not. I just didn't revise partial sort yet :(

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] KNN-GiST with recheck

2014-10-06 Thread Emre Hasegeli

 Thanks. The main question now is design of this patch. Currently, it does
 all the work inside access method. We already have some discussion of pro
 and cons of this method. I would like to clarify alternatives now. I can
 see following way:
 
1. Implement new executor node which performs sorting by priority queue.
Let's call it Priority queue. I think it should be separate node from
Sort node. Despite Priority queue and Sort are essentially similar
from user view, they would be completely different in implementation.
2. Implement some interface to transfer distance values from access
method to Priority queue node.

If we assume that all of them need recheck, maybe it can be done
without passing distance values.

3. Somehow tell the planner that it could use Priority queue in
corresponding cases. I see two ways of doing this:
   - Add flag to operator in opclass indicating that index can only
   order by lower bound of col op value, not by col op value itself.
   - Define new relation between operators. Value of one operator could
   be lower bound for value of another operator. So, planner can
 put Priority
   queue node when lower bound ordering is possible from index. Also 
 ALTER
   OPERATOR command would be reasonable, so extensions could upgrade.

I think, it would be better to make it a property of the operator
class.  We can add a column to pg_amop or define another value for
amoppurpose on pg_amop.  Syntax can be something like this:

CREATE OPERATOR CLASS circle_ops DEFAULT
   FOR TYPE circle USING gist AS
   OPERATOR 15  -(circle, point) FOR ORDER BY pg_catalog.float_ops LOWER 
BOUND;

While looking at it, I realize that current version of the patch does
not use the sort operator family defined with the operator class.  It
assumes that the distance function will return values compatible with
the operator.  Operator class definition makes me think that there is
not such an assumption.

 Besides overhead, this way makes significant infrastructural changes. So,
 it may be over-engineering. However, it's probably more clean and beautiful
 solution.
 I would like to get some feedback from people familiar with KNN-GiST like
 Heikki or Tom. What do you think about this? Any other ideas?

I would be happy to test and review the changes.  I think it is nice
to solve the problem in a generalized way improving the access method
infrastructure.  Definitely, we should have a consensus on the design
before working on the infrastructure changes.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-09-29 Thread Alexander Korotkov

On Mon, Sep 29, 2014 at 6:16 AM, Bruce Momjian br...@momjian.us wrote:

 On Fri, Sep 26, 2014 at 10:49:42AM +0400, Alexander Korotkov wrote:
  Does this also fix the identical PostGIS problem or is there
 something
  PostGIS needs to do?
 
 
  This patch provides general infrastructure for recheck in KNN-GiST.
 PostGIS
  need corresponding change in its GiST opclass. Since PostGIS already
 define -
  and # operators as distance to bounding box border and bounding box
 center,
  it can't change their behaviour.
  it has to support new operator exact distance in opclass.

 Ah, OK, so they just need something that can be used for the recheck.  I
 think they currently use ST_Distance() for that.  Does it have to be an
 operator?  If they defined an operator for ST_Distance(), would
 ST_Distance() work too for KNN-GiST?


Currently, ST_Distance is a function, but it's no problem to make it an
operator with KNN-GiST support.


 In summary, you still create a normal GiST index on the column:


 http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html

 CREATE INDEX planet_osm_line_ref_index ON planet_osm_line(ref);

 which indexes by the bounding box.  The new code will allow ordered
 index hits to be filtered by something like ST_Distance(), rather than
 having to a LIMIT 50 in a CTE, then call ST_Distance(), like this:

 EXPLAIN ANALYZE WITH distance AS (
 SELECT way AS road, ref AS route
 FROM planet_osm_line
 WHERE highway = 'secondary'
 ORDER BY ST_GeomFromText('POLYGON((14239931.42
 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08
 3053984.84,14239931.42 3054117.72))', 900913) # way
 LIMIT 50
 )
 SELECT ST_Distance(ST_GeomFromText('POLYGON((14239931.42
 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08
 3053984.84,14239931.42 3054117.72))', 900913), road) AS true_distance, route
 FROM distance
 ORDER BY true_distance
 LIMIT 1;


Yeah. It this query 50 is pure empirical value. It could be both too low or
too high. Too low value can cause wrong query answers. Too high value can
cause lower performance. With patch simple KNN query will work like this
query with always right value in LIMIT clause.


 Notice the CTE uses # (bounding box center), and then the outer query
 uses ST_Distance and LIMIT 1 to find the closest item.

 Excellent!


Thanks. The main question now is design of this patch. Currently, it does
all the work inside access method. We already have some discussion of pro
and cons of this method. I would like to clarify alternatives now. I can
see following way:

   1. Implement new executor node which performs sorting by priority queue.
   Let's call it Priority queue. I think it should be separate node from
   Sort node. Despite Priority queue and Sort are essentially similar
   from user view, they would be completely different in implementation.
   2. Implement some interface to transfer distance values from access
   method to Priority queue node.
   3. Somehow tell the planner that it could use Priority queue in
   corresponding cases. I see two ways of doing this:
  - Add flag to operator in opclass indicating that index can only
  order by lower bound of col op value, not by col op value itself.
  - Define new relation between operators. Value of one operator could
  be lower bound for value of another operator. So, planner can
put Priority
  queue node when lower bound ordering is possible from index. Also ALTER
  OPERATOR command would be reasonable, so extensions could upgrade.

Besides overhead, this way makes significant infrastructural changes. So,
it may be over-engineering. However, it's probably more clean and beautiful
solution.
I would like to get some feedback from people familiar with KNN-GiST like
Heikki or Tom. What do you think about this? Any other ideas?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] KNN-GiST with recheck

2014-09-28 Thread Bruce Momjian

On Fri, Sep 26, 2014 at 10:49:42AM +0400, Alexander Korotkov wrote:
 Does this also fix the identical PostGIS problem or is there something
 PostGIS needs to do?
 
 
 This patch provides general infrastructure for recheck in KNN-GiST. PostGIS
 need corresponding change in its GiST opclass. Since PostGIS already define 
 -
 and # operators as distance to bounding box border and bounding box center,
 it can't change their behaviour.
 it has to support new operator exact distance in opclass. 

Ah, OK, so they just need something that can be used for the recheck.  I
think they currently use ST_Distance() for that.  Does it have to be an
operator?  If they defined an operator for ST_Distance(), would
ST_Distance() work too for KNN-GiST?

In summary, you still create a normal GiST index on the column:

http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html

CREATE INDEX planet_osm_line_ref_index ON planet_osm_line(ref);

which indexes by the bounding box.  The new code will allow ordered
index hits to be filtered by something like ST_Distance(), rather than
having to a LIMIT 50 in a CTE, then call ST_Distance(), like this:

EXPLAIN ANALYZE WITH distance AS (
SELECT way AS road, ref AS route
FROM planet_osm_line
WHERE highway = 'secondary'
ORDER BY ST_GeomFromText('POLYGON((14239931.42 
3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08 
3053984.84,14239931.42 3054117.72))', 900913) # way
LIMIT 50
)
SELECT ST_Distance(ST_GeomFromText('POLYGON((14239931.42 
3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08 
3053984.84,14239931.42 3054117.72))', 900913), road) AS true_distance, route
FROM distance
ORDER BY true_distance
LIMIT 1;

Notice the CTE uses # (bounding box center), and then the outer query
uses ST_Distance and LIMIT 1 to find the closest item.

Excellent!

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-09-26 Thread Alexander Korotkov

On Fri, Sep 26, 2014 at 5:18 AM, Bruce Momjian br...@momjian.us wrote:

 On Sun, Sep 14, 2014 at 11:34:26PM +0400, Alexander Korotkov wrote:
   Cost estimation of GiST is a big problem anyway. It doesn't care
 (and
   can't) about amount of recheck for regular operators. In this
 patch, same
   would be for knn recheck. The problem is that touching heap from
 access

 This is very important work.  While our existing KNN-GiST index code
 works fine for scalar values and point-to-point distance ordering, it
 doesn't work well for 2-dimensional objects because they are only
 indexed by their bounding boxes (a rectangle around the object).  The
 indexed bounding box can't produce accurate distances to other objects.

 As an example, see this PostGIS blog post showing how to use LIMIT in a
 CTE to filter results and then compute the closest object (search for
 LIMIT 50):


 http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html

 This patch fixes our code for distances from a point to indexed 2-D
 objects.

 Does this also fix the identical PostGIS problem or is there something
 PostGIS needs to do?


This patch provides general infrastructure for recheck in KNN-GiST. PostGIS
need corresponding change in its GiST opclass. Since PostGIS already define
- and # operators as distance to bounding box border and bounding box
center, it can't change their behaviour.
it has to support new operator exact distance in opclass.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] KNN-GiST with recheck

2014-09-25 Thread Emre Hasegeli

 Fixed, thanks.

Here are my questions and comments about the code.

doc/src/sgml/gist.sgml:812:
be rechecked from heap tuple before tuple is returned.  If
literalrecheck/ flag isn't set then it's true by default for
compatibility reasons.  The literalrecheck/ flag can be used only

Recheck flag is set to false on gistget.c so I think it should say
false by default.  On the other hand, it is true by default on
the consistent function.  It is written as the safest assumption
on the code comments.  I don't know why the safest is chosen over
the backwards compatible for the consistent function.

src/backend/access/gist/gistget.c:505:
   /* Recheck distance from heap tuple if needed */
   if (GISTSearchItemIsHeap(*item) 
   searchTreeItemNeedDistanceRecheck(scan, 
 so-curTreeItem))
   {
   searchTreeItemDistanceRecheck(scan, 
 so-curTreeItem, item);
   continue;
   }

Why so-curTreeItem is passed to these functions?  They can use
scan-opaque-curTreeItem.

src/backend/access/gist/gistscan.c:49:
   /*
* When all distance values are the same, items without recheck
* can be immediately returned.  So they are placed first.
*/
   if (recheckCmp == 0  distance_a.recheck != distance_b.recheck)
   recheckCmp = distance_a.recheck ? 1 : -1;

I don't understand why items without recheck can be immediately
returned.  Do you think it will work correctly when there is
an operator class which will return recheck true and false for
the items under the same page?

src/backend/access/index/indexam.c:258:
   /* Prepare data structures for getting original indexed values from 
 heap */
   scan-indexInfo = BuildIndexInfo(scan-indexRelation);
   scan-estate = CreateExecutorState();
   scan-slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));

With the changes in indexam.c, heap access become legal for all index
access methods.  I think it is better than the previous version but
I am leaving the judgement to someone experienced.  I will try to
summarize the pros and cons of sorting the rows in the GiST access
method, as far as I understand.

Pros:

* It does not require another queue.  It should be effective to sort
  the rows inside the queue the GiST access method already has.
* It does not complicate index access method infrastructure.

Cons:

* It could be done without additional heap access.
* Other access methods could make use of the sorting infrastructure
  one day.
* It could be more transparent to the users.  Sorting information
  could be shown on the explain output.
* A more suitable data structure like binary heap could be used
  for the queue to sort the rows.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-09-25 Thread Alexander Korotkov

On Thu, Sep 25, 2014 at 9:00 PM, Emre Hasegeli e...@hasegeli.com wrote:

  Fixed, thanks.

 Here are my questions and comments about the code.

 doc/src/sgml/gist.sgml:812:
 be rechecked from heap tuple before tuple is returned.  If
 literalrecheck/ flag isn't set then it's true by default for
 compatibility reasons.  The literalrecheck/ flag can be used
 only

 Recheck flag is set to false on gistget.c so I think it should say
 false by default.  On the other hand, it is true by default on
 the consistent function.  It is written as the safest assumption
 on the code comments.  I don't know why the safest is chosen over
 the backwards compatible for the consistent function.


Agree. It should be clarified in docs.

src/backend/access/gist/gistget.c:505:
/* Recheck distance from heap tuple if needed */
if (GISTSearchItemIsHeap(*item) 
searchTreeItemNeedDistanceRecheck(scan,
 so-curTreeItem))
{
searchTreeItemDistanceRecheck(scan,
 so-curTreeItem, item);
continue;
}

 Why so-curTreeItem is passed to these functions?  They can use
 scan-opaque-curTreeItem.


I didn't get the difference. Few lines before:

GISTScanOpaque so = (GISTScanOpaque) scan-opaque;


 src/backend/access/gist/gistscan.c:49:
/*
 * When all distance values are the same, items without
 recheck
 * can be immediately returned.  So they are placed first.
 */
if (recheckCmp == 0  distance_a.recheck !=
 distance_b.recheck)
recheckCmp = distance_a.recheck ? 1 : -1;

 I don't understand why items without recheck can be immediately
 returned.  Do you think it will work correctly when there is
 an operator class which will return recheck true and false for
 the items under the same page?


Yes, I believe so. Item with recheck can't decrease it's distance, it can
only increase it. In the corner case  item can have same distance after
recheck as it was before. Then anyway items which distances are the same
can be returned in any order.


 src/backend/access/index/indexam.c:258:
/* Prepare data structures for getting original indexed values
 from heap */
scan-indexInfo = BuildIndexInfo(scan-indexRelation);
scan-estate = CreateExecutorState();
scan-slot =
 MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));

 With the changes in indexam.c, heap access become legal for all index
 access methods.  I think it is better than the previous version but
 I am leaving the judgement to someone experienced.  I will try to
 summarize the pros and cons of sorting the rows in the GiST access
 method, as far as I understand.

 Pros:

 * It does not require another queue.  It should be effective to sort
   the rows inside the queue the GiST access method already has.
 * It does not complicate index access method infrastructure.

 Cons:

 * It could be done without additional heap access.
 * Other access methods could make use of the sorting infrastructure
   one day.
 * It could be more transparent to the users.  Sorting information
   could be shown on the explain output.


It would be also nice to show some information about KNN itself.


 * A more suitable data structure like binary heap could be used
   for the queue to sort the rows.


Binary heap seems to be better data structure for whole KNN-GiST. But it's
a subject for a separate patch: replace RB-tree to heap in KNN-GiST. It's
not related to recheck stuff.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] KNN-GiST with recheck

2014-09-25 Thread Bruce Momjian

On Sun, Sep 14, 2014 at 11:34:26PM +0400, Alexander Korotkov wrote:
  Cost estimation of GiST is a big problem anyway. It doesn't care (and
  can't) about amount of recheck for regular operators. In this patch, 
 same
  would be for knn recheck. The problem is that touching heap from access

This is very important work.  While our existing KNN-GiST index code
works fine for scalar values and point-to-point distance ordering, it
doesn't work well for 2-dimensional objects because they are only
indexed by their bounding boxes (a rectangle around the object).  The
indexed bounding box can't produce accurate distances to other objects. 

As an example, see this PostGIS blog post showing how to use LIMIT in a
CTE to filter results and then compute the closest object (search for
LIMIT 50):

http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html

This patch fixes our code for distances from a point to indexed 2-D
objects.

Does this also fix the identical PostGIS problem or is there something
PostGIS needs to do?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-09-17 Thread Emre Hasegeli

  While looking it at I found a bug.  It returns the second column
  in wrong order when both of the distance functions return recheck = true.
  Test script attached to run on the regression database.  I tried to
  fix but could not.  searchTreeItemDistanceRecheck function is not
  very easy to follow.  I think it deserves more comments.
 
 Fixed, thanks. It was logical error in comparison function implementation.

I managed to break it again by ordering rows only by the second column
of the index.  Test script attached.


knn-gist-recheck-test-secondcolumn.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-09-17 Thread Emre Hasegeli

 I managed to break it again by ordering rows only by the second column
 of the index.  Test script attached.

I was confused.  It is undefined behavior.  Sorry for the noise.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-09-17 Thread Alexander Korotkov

On Wed, Sep 17, 2014 at 12:30 PM, Emre Hasegeli e...@hasegeli.com wrote:

  I managed to break it again by ordering rows only by the second column
  of the index.  Test script attached.

 I was confused.  It is undefined behavior.  Sorry for the noise.


No problem. Thanks a lot for testing.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] KNN-GiST with recheck

2014-09-14 Thread Emre Hasegeli

I added the point to polygon distance operator patch to the open
CommitFest as ready for committer and added myself as reviewer to
both of the patches.

 I think that for most use cases just some operators require further sorting
 and some of them not. But it could appear one day that some index gives
 part of its knn answers exact and part of them inexact. Same happen to
 recheck of regular operators. Initially recheck flag was defined in
 opclass. But later recheck became runtime flag.

I cannot think of an use case, but it makes sense to add the flag to
the distance function just like the consistent function if we will go
with this implementation.
 
 Cost estimation of GiST is a big problem anyway. It doesn't care (and
 can't) about amount of recheck for regular operators. In this patch, same
 would be for knn recheck. The problem is that touching heap from access
 method breaks incapsulation. One idea about this is to do sorting in
 another nodes. However, I wonder if it would be an overengineering and
 overhead. In attached patch I propose a different approach: put code
 touching heap into separate index_get_heap_values function. Also new
 version of patch includes regression tests and some cleanup.

While looking it at I found a bug.  It returns the second column
in wrong order when both of the distance functions return recheck = true.
Test script attached to run on the regression database.  I tried to
fix but could not.  searchTreeItemDistanceRecheck function is not
very easy to follow.  I think it deserves more comments.


knn-gist-recheck-test-multicolumn.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-09-14 Thread Alexander Korotkov

On Sun, Sep 14, 2014 at 10:09 PM, Emre Hasegeli e...@hasegeli.com wrote:

 I added the point to polygon distance operator patch to the open
 CommitFest as ready for committer and added myself as reviewer to
 both of the patches.


Thanks.

 Cost estimation of GiST is a big problem anyway. It doesn't care (and
  can't) about amount of recheck for regular operators. In this patch, same
  would be for knn recheck. The problem is that touching heap from access
  method breaks incapsulation. One idea about this is to do sorting in
  another nodes. However, I wonder if it would be an overengineering and
  overhead. In attached patch I propose a different approach: put code
  touching heap into separate index_get_heap_values function. Also new
  version of patch includes regression tests and some cleanup.

 While looking it at I found a bug.  It returns the second column
 in wrong order when both of the distance functions return recheck = true.
 Test script attached to run on the regression database.  I tried to
 fix but could not.  searchTreeItemDistanceRecheck function is not
 very easy to follow.  I think it deserves more comments.


Fixed, thanks. It was logical error in comparison function implementation.

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-08-04 Thread Alexander Korotkov

On Sun, Aug 3, 2014 at 5:48 PM, Emre Hasegeli e...@hasegeli.com wrote:

   1. This patch introduces a new polygon - point operator. That seems
   useful on its own, with or without this patch.
 
  Yeah, but exact-knn cant come with no one implementation. But it would
  better come in a separate patch.

 I tried to split them.  Separated patches are attached.  I changed
 the order of the arguments as point - polygon, because point was
 the first one on all the others.  Its commutator was required for
 the index, so I added it on the second patch.  I also added tests
 for the operator.  I think it is ready for committer as a separate
 patch.  We can add it to the open CommitFest.

 I have made some cosmetic changes on the patches.  I hope they are
 useful.

 I added support to point - circle operator with the same GiST
 distance function you added for polygon.  I can change it, if it is not
 the right way.


Great, thanks!

   2. I wonder how useful it really is to allow mixing exact and non-exact

return values from the distance function. The distance function
 included in
   the patch always returns recheck=true. I have a feeling that all other
   distance function will also always return either true or false.
 
  For geometrical datatypes recheck variations in consistent methods are
 also
  very rare (I can't remember any). But imagine opclass for arrays where
 keys
  have different representation depending on array length. For such opclass
  and knn on similarity recheck flag could be useful.

 I also wonder how useful it is.  Your example is convincing, but maybe
 setting it index-wide will make the decisions on the framework easier.
 For example, how hard would it be to decide if further sorting is
 required or not on the planner?


I think that for most use cases just some operators require further sorting
and some of them not. But it could appear one day that some index gives
part of its knn answers exact and part of them inexact. Same happen to
recheck of regular operators. Initially recheck flag was defined in
opclass. But later recheck became runtime flag.

  4. (as you mentioned in the other thread: ) It's a modularity violation
   that you peek into the heap tuple from gist. I think the proper way to
 do
   this would be to extend the IndexScan executor node to perform the
   re-shuffling of tuples that come from the index in wrong order, or
 perhaps
   add a new node type for it.
  
   Of course that's exactly what your partial sort patch does :-). I
 haven't
   looked at that in detail, but I don't think the approach the partial
 sort
   patch takes will work here as is. In the KNN-GiST case, the index is
   returning tuples roughly in the right order, but a tuple that it
 returns
   might in reality belong somewhere later in the ordering. In the partial
   sort patch, the input stream of tuples is divided into
 non-overlapping
   groups, so that the tuples within the group are not sorted, but the
 groups
   are. I think the partial sort case is a special case of the KNN-GiST
 case,
   if you consider the lower bound of each tuple to be the leading keys
 that
   you don't need to sort.
 
  Yes. But, for instance btree accesses heap for unique checking. Is really
  it so crimilal? :-)
  This is not only question of a new node or extending existing node. We
 need
  to teach planner/executor access method can return value of some
 expression
  which is lower bound of another expression. AFICS now access method can
  return only original indexed datums and TIDs. So, I afraid that enormous
  infrastructure changes are required. And I can hardly imagine what they
  should look like.

 Unfortunately, I am not experienced enough to judge your implementation.
 As far as I understand the problem is partially sorting rows on
 the index scan node.  It can lead the planner to choose non-optimal
 plans, because of not taking into account the cost of sorting.


Cost estimation of GiST is a big problem anyway. It doesn't care (and
can't) about amount of recheck for regular operators. In this patch, same
would be for knn recheck. The problem is that touching heap from access
method breaks incapsulation. One idea about this is to do sorting in
another nodes. However, I wonder if it would be an overengineering and
overhead. In attached patch I propose a different approach: put code
touching heap into separate index_get_heap_values function. Also new
version of patch includes regression tests and some cleanup.

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-08-03 Thread Emre Hasegeli

  1. This patch introduces a new polygon - point operator. That seems
  useful on its own, with or without this patch.
 
 Yeah, but exact-knn cant come with no one implementation. But it would
 better come in a separate patch.

I tried to split them.  Separated patches are attached.  I changed
the order of the arguments as point - polygon, because point was
the first one on all the others.  Its commutator was required for
the index, so I added it on the second patch.  I also added tests
for the operator.  I think it is ready for committer as a separate
patch.  We can add it to the open CommitFest.

I have made some cosmetic changes on the patches.  I hope they are
useful.

I added support to point - circle operator with the same GiST
distance function you added for polygon.  I can change it, if it is not
the right way.

  2. I wonder how useful it really is to allow mixing exact and non-exact
  return values from the distance function. The distance function included in
  the patch always returns recheck=true. I have a feeling that all other
  distance function will also always return either true or false.
 
 For geometrical datatypes recheck variations in consistent methods are also
 very rare (I can't remember any). But imagine opclass for arrays where keys
 have different representation depending on array length. For such opclass
 and knn on similarity recheck flag could be useful.

I also wonder how useful it is.  Your example is convincing, but maybe
setting it index-wide will make the decisions on the framework easier.
For example, how hard would it be to decide if further sorting is
required or not on the planner?

  4. (as you mentioned in the other thread: ) It's a modularity violation
  that you peek into the heap tuple from gist. I think the proper way to do
  this would be to extend the IndexScan executor node to perform the
  re-shuffling of tuples that come from the index in wrong order, or perhaps
  add a new node type for it.
 
  Of course that's exactly what your partial sort patch does :-). I haven't
  looked at that in detail, but I don't think the approach the partial sort
  patch takes will work here as is. In the KNN-GiST case, the index is
  returning tuples roughly in the right order, but a tuple that it returns
  might in reality belong somewhere later in the ordering. In the partial
  sort patch, the input stream of tuples is divided into non-overlapping
  groups, so that the tuples within the group are not sorted, but the groups
  are. I think the partial sort case is a special case of the KNN-GiST case,
  if you consider the lower bound of each tuple to be the leading keys that
  you don't need to sort.
 
 Yes. But, for instance btree accesses heap for unique checking. Is really
 it so crimilal? :-)
 This is not only question of a new node or extending existing node. We need
 to teach planner/executor access method can return value of some expression
 which is lower bound of another expression. AFICS now access method can
 return only original indexed datums and TIDs. So, I afraid that enormous
 infrastructure changes are required. And I can hardly imagine what they
 should look like.

Unfortunately, I am not experienced enough to judge your implementation.
As far as I understand the problem is partially sorting rows on
the index scan node.  It can lead the planner to choose non-optimal
plans, because of not taking into account the cost of sorting.
diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c
index 54391fd..402ea40 100644
--- a/src/backend/utils/adt/geo_ops.c
+++ b/src/backend/utils/adt/geo_ops.c
@@ -2408,36 +2408,42 @@ lseg_interpt(PG_FUNCTION_ARGS)
  **Routines for position comparisons of differently-typed
  **2D objects.
  **
  ***/
 
 /*-
  * dist_
  * Minimum distance from one object to another.
  *---*/
 
+/*
+ * Distance from a point to a line
+ */
 Datum
 dist_pl(PG_FUNCTION_ARGS)
 {
Point  *pt = PG_GETARG_POINT_P(0);
LINE   *line = PG_GETARG_LINE_P(1);
 
PG_RETURN_FLOAT8(dist_pl_internal(pt, line));
 }
 
 static double
 dist_pl_internal(Point *pt, LINE *line)
 {
return fabs((line-A * pt-x + line-B * pt-y + line-C) /
HYPOT(line-A, line-B));
 }
 
+/*
+ * Distance from a point to a lseg
+ */
 Datum
 dist_ps(PG_FUNCTION_ARGS)
 {
Point  *pt = PG_GETARG_POINT_P(0);
LSEG   *lseg = PG_GETARG_LSEG_P(1);
 
PG_RETURN_FLOAT8(dist_ps_internal(pt, lseg));
 }
 
 static double
@@ -2487,21 +2493,21 @@ dist_ps_internal(Point *pt, LSEG *lseg)
result = point_dt(pt, lseg-p[0]);
tmpdist = point_dt(pt, lseg-p[1]);
if (tmpdist  result)

Re: [HACKERS] KNN-GiST with recheck

2014-02-03 Thread Alexander Korotkov

On Tue, Jan 28, 2014 at 9:32 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 01/28/2014 04:12 PM, Alexander Korotkov wrote:

 On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com

 wrote:


  4. (as you mentioned in the other thread: ) It's a modularity violation
 that you peek into the heap tuple from gist. I think the proper way to do
 this would be to extend the IndexScan executor node to perform the
 re-shuffling of tuples that come from the index in wrong order, or
 perhaps
 add a new node type for it.

 Of course that's exactly what your partial sort patch does :-). I haven't
 looked at that in detail, but I don't think the approach the partial sort
 patch takes will work here as is. In the KNN-GiST case, the index is
 returning tuples roughly in the right order, but a tuple that it returns
 might in reality belong somewhere later in the ordering. In the partial
 sort patch, the input stream of tuples is divided into non-overlapping
 groups, so that the tuples within the group are not sorted, but the
 groups
 are. I think the partial sort case is a special case of the KNN-GiST
 case,
 if you consider the lower bound of each tuple to be the leading keys that
 you don't need to sort.


 Yes. But, for instance btree accesses heap for unique checking. Is really
 it so crimilal? :-)


 Well, it is generally considered an ugly hack in b-tree too. I'm not 100%
 opposed to doing such a hack in GiST, but would very much prefer not to.


  This is not only question of a new node or extending existing node. We
 need
 to teach planner/executor access method can return value of some
 expression
 which is lower bound of another expression. AFICS now access method can
 return only original indexed datums and TIDs. So, I afraid that enormous
 infrastructure changes are required. And I can hardly imagine what they
 should look like.


 Yeah, I'm not sure either. Maybe a new field in IndexScanDesc, along with
 xs_itup. Or as an attribute of xs_itup itself.


This shouldn't look like a hack too. Otherwise I see no point of it: it's
better to have some isolated hack in access method than hack in
planner/executor. So I see following changes to be needed to implement this
right way:
1) Implement new relation between operators: operator1 is lower bound of
operator2.
2) Extend am interface to let it return values of operators.
3) Implement new node for knn-sorting.
However, it requires a lot of changes in PostgreSQL infrastructure and can
appear to be not enough general too (we don't know until we have another
application).

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] KNN-GiST with recheck

2014-01-28 Thread Heikki Linnakangas


On 01/13/2014 07:17 PM, Alexander Korotkov wrote:

Here goes a desription of this patch same as in original thread.

KNN-GiST provides ability to get ordered results from index, but this order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.

See an example.

create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,100) id);
create index test_idx on test using gist (p);

We can get results ordered by distance from polygon to point.

postgres=# select id, p - point(0.5,0.5) from test order by p -
point(0.5,0.5) limit 10;
id   |   ?column?
+--
  755611 | 0.000405855808916853
  807562 | 0.000464123777564343
  437778 | 0.000738524708741959
  947860 |  0.00076250998760724
  389843 | 0.000886362723569568
   17586 | 0.000981960100555216
  411329 |  0.00145338112316853
  894191 |  0.00149399559703506
  391907 |   0.0016647896049741
  235381 |  0.00167554614889509
(10 rows)

It's fast using just index scan.

 QUERY PLAN

--
  Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
-  Index Scan using test_idx on test  (cost=0.29..157672.29
rows=100 width=36) (actual time=0.179..0.228 rows=10 loops=1)
  Order By: (p - '(0.5,0.5)'::point)
  Total runtime: 0.305 ms
(4 rows)


Nice! Some thoughts:

1. This patch introduces a new polygon - point operator. That seems 
useful on its own, with or without this patch.


2. I wonder how useful it really is to allow mixing exact and non-exact 
return values from the distance function. The distance function included 
in the patch always returns recheck=true. I have a feeling that all 
other distance function will also always return either true or false.


3. A binary heap would be a better data structure to buffer the 
rechecked values. A Red-Black tree allows random insertions and 
deletions, but in this case you need to insert arbitrary values but only 
remove the minimum item. That's exactly what a binary heap excels at. We 
have a nice binary heap implementation in the backend that you can use, 
see src/backend/lib/binaryheap.c.


4. (as you mentioned in the other thread: ) It's a modularity violation 
that you peek into the heap tuple from gist. I think the proper way to 
do this would be to extend the IndexScan executor node to perform the 
re-shuffling of tuples that come from the index in wrong order, or 
perhaps add a new node type for it.


Of course that's exactly what your partial sort patch does :-). I 
haven't looked at that in detail, but I don't think the approach the 
partial sort patch takes will work here as is. In the KNN-GiST case, the 
index is returning tuples roughly in the right order, but a tuple that 
it returns might in reality belong somewhere later in the ordering. In 
the partial sort patch, the input stream of tuples is divided into 
non-overlapping groups, so that the tuples within the group are not 
sorted, but the groups are. I think the partial sort case is a special 
case of the KNN-GiST case, if you consider the lower bound of each tuple 
to be the leading keys that you don't need to sort.


BTW, this capability might also be highly useful for the min/max indexes 
as well. A min/max index cannot return an exact ordering of tuples, but 
it can also give a lower bound for a group of tuples.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-01-28 Thread Alexander Korotkov

On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 01/13/2014 07:17 PM, Alexander Korotkov wrote:

 Here goes a desription of this patch same as in original thread.

 KNN-GiST provides ability to get ordered results from index, but this
 order
 is based only on index information. For instance, GiST index contains
 bounding rectangles for polygons, and we can't get exact distance to
 polygon from index (similar situation is in PostGIS). In attached patch,
 GiST distance method can set recheck flag (similar to consistent method).
 This flag means that distance method returned lower bound of distance and
 we should recheck it from heap.

 See an example.

 create table test as (select id, polygon(3+(random()*10)::int,
 circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
 generate_series(1,100) id);
 create index test_idx on test using gist (p);

 We can get results ordered by distance from polygon to point.

 postgres=# select id, p - point(0.5,0.5) from test order by p -
 point(0.5,0.5) limit 10;
 id   |   ?column?
 +--
   755611 | 0.000405855808916853
   807562 | 0.000464123777564343
   437778 | 0.000738524708741959
   947860 |  0.00076250998760724
   389843 | 0.000886362723569568
17586 | 0.000981960100555216
   411329 |  0.00145338112316853
   894191 |  0.00149399559703506
   391907 |   0.0016647896049741
   235381 |  0.00167554614889509
 (10 rows)

 It's fast using just index scan.

  QUERY PLAN

 
 --
   Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
 rows=10 loops=1)
 -  Index Scan using test_idx on test  (cost=0.29..157672.29
 rows=100 width=36) (actual time=0.179..0.228 rows=10 loops=1)
   Order By: (p - '(0.5,0.5)'::point)
   Total runtime: 0.305 ms
 (4 rows)


 Nice! Some thoughts:

 1. This patch introduces a new polygon - point operator. That seems
 useful on its own, with or without this patch.


Yeah, but exact-knn cant come with no one implementation. But it would
better come in a separate patch.

2. I wonder how useful it really is to allow mixing exact and non-exact
 return values from the distance function. The distance function included in
 the patch always returns recheck=true. I have a feeling that all other
 distance function will also always return either true or false.


For geometrical datatypes recheck variations in consistent methods are also
very rare (I can't remember any). But imagine opclass for arrays where keys
have different representation depending on array length. For such opclass
and knn on similarity recheck flag could be useful.



 3. A binary heap would be a better data structure to buffer the rechecked
 values. A Red-Black tree allows random insertions and deletions, but in
 this case you need to insert arbitrary values but only remove the minimum
 item. That's exactly what a binary heap excels at. We have a nice binary
 heap implementation in the backend that you can use, see
 src/backend/lib/binaryheap.c.


Hmm. For me binary heap would be a better data structure for KNN-GiST at
all :-)


 4. (as you mentioned in the other thread: ) It's a modularity violation
 that you peek into the heap tuple from gist. I think the proper way to do
 this would be to extend the IndexScan executor node to perform the
 re-shuffling of tuples that come from the index in wrong order, or perhaps
 add a new node type for it.

 Of course that's exactly what your partial sort patch does :-). I haven't
 looked at that in detail, but I don't think the approach the partial sort
 patch takes will work here as is. In the KNN-GiST case, the index is
 returning tuples roughly in the right order, but a tuple that it returns
 might in reality belong somewhere later in the ordering. In the partial
 sort patch, the input stream of tuples is divided into non-overlapping
 groups, so that the tuples within the group are not sorted, but the groups
 are. I think the partial sort case is a special case of the KNN-GiST case,
 if you consider the lower bound of each tuple to be the leading keys that
 you don't need to sort.


Yes. But, for instance btree accesses heap for unique checking. Is really
it so crimilal? :-)
This is not only question of a new node or extending existing node. We need
to teach planner/executor access method can return value of some expression
which is lower bound of another expression. AFICS now access method can
return only original indexed datums and TIDs. So, I afraid that enormous
infrastructure changes are required. And I can hardly imagine what they
should look like.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] KNN-GiST with recheck

2014-01-28 Thread Heikki Linnakangas


On 01/28/2014 04:12 PM, Alexander Korotkov wrote:

On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas hlinnakan...@vmware.com

wrote:



4. (as you mentioned in the other thread: ) It's a modularity violation
that you peek into the heap tuple from gist. I think the proper way to do
this would be to extend the IndexScan executor node to perform the
re-shuffling of tuples that come from the index in wrong order, or perhaps
add a new node type for it.

Of course that's exactly what your partial sort patch does :-). I haven't
looked at that in detail, but I don't think the approach the partial sort
patch takes will work here as is. In the KNN-GiST case, the index is
returning tuples roughly in the right order, but a tuple that it returns
might in reality belong somewhere later in the ordering. In the partial
sort patch, the input stream of tuples is divided into non-overlapping
groups, so that the tuples within the group are not sorted, but the groups
are. I think the partial sort case is a special case of the KNN-GiST case,
if you consider the lower bound of each tuple to be the leading keys that
you don't need to sort.


Yes. But, for instance btree accesses heap for unique checking. Is really
it so crimilal? :-)


Well, it is generally considered an ugly hack in b-tree too. I'm not 
100% opposed to doing such a hack in GiST, but would very much prefer 
not to.



This is not only question of a new node or extending existing node. We need
to teach planner/executor access method can return value of some expression
which is lower bound of another expression. AFICS now access method can
return only original indexed datums and TIDs. So, I afraid that enormous
infrastructure changes are required. And I can hardly imagine what they
should look like.


Yeah, I'm not sure either. Maybe a new field in IndexScanDesc, along 
with xs_itup. Or as an attribute of xs_itup itself.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] KNN-GiST with recheck

2014-01-13 Thread Alexander Korotkov

Hackers!

This patch was split from thread:
http://www.postgresql.org/message-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=gon-...@mail.gmail.com

I've split it to separate thead, because it's related to partial sort only
conceptually not technically. Also I renamed it to knn-gist-recheck from
partial-knn as more appropriate name. In the attached version docs are
updated. Possible weak point of this patch design is that it fetches heap
tuple from GiST scan. However, I didn't receive any notes about its design,
so, I'm going to put it to commitfest.

Here goes a desription of this patch same as in original thread.

KNN-GiST provides ability to get ordered results from index, but this order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.

See an example.

create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,100) id);
create index test_idx on test using gist (p);

We can get results ordered by distance from polygon to point.

postgres=# select id, p - point(0.5,0.5) from test order by p -
point(0.5,0.5) limit 10;
   id   |   ?column?
+--
 755611 | 0.000405855808916853
 807562 | 0.000464123777564343
 437778 | 0.000738524708741959
 947860 |  0.00076250998760724
 389843 | 0.000886362723569568
  17586 | 0.000981960100555216
 411329 |  0.00145338112316853
 894191 |  0.00149399559703506
 391907 |   0.0016647896049741
 235381 |  0.00167554614889509
(10 rows)

It's fast using just index scan.

QUERY PLAN

--
 Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
   -  Index Scan using test_idx on test  (cost=0.29..157672.29
rows=100 width=36) (actual time=0.179..0.228 rows=10 loops=1)
 Order By: (p - '(0.5,0.5)'::point)
 Total runtime: 0.305 ms
(4 rows)

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

41 matches

Mail list logo