Re: [HACKERS] KNN-GiST with recheck

2015-01-07 Thread Alexander Korotkov
On Tue, Dec 16, 2014 at 4:37 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 Patch attached. It should be applied on top of my pairing heap patch at
 http://www.postgresql.org/message-id/548ffa2c.7060...@vmware.com. Some
 caveats:

 * The signature of the distance function is unchanged, it doesn't get a
 recheck argument. It is just assumed that if the consistent function sets
 the recheck flag, then the distance needs to be rechecked as well. We might
 want to add the recheck argument, like you Alexander did in your patch, but
 it's not important right now.


I didn't get how that expected to work if we have only order by qual
without filter qual. In this case consistent function just isn't called at
all.

* I used the distance term in the executor, although the ORDER BY expr
 machinery is more general than that. The value returned by the ORDER BY
 expression doesn't have to be a distance, although that's the only thing
 supported by GiST and the built-in opclasses.

 * I short-circuited the planner to assume that the ORDER BY expression
 always returns a float. That's true today for knn-GiST, but is obviously a
 bogus assumption in general.

 This needs some work to get into a committable state, but from a
 modularity point of view, this is much better than having the indexam to
 peek into the heap.


Nice idea to put reordering into index scan node. Doesn't look like much of
overengineering. I'm going to bring it to more commitable state.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Commit fest 2014-12, let's begin!

2014-12-15 Thread Alexander Korotkov
On Mon, Dec 15, 2014 at 4:12 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 12/15/2014 08:37 AM, Michael Paquier wrote:

 On Mon, Dec 15, 2014 at 3:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Michael Paquier michael.paqu...@gmail.com writes:

 - Point to polygon distance operator

 I looked at that briefly during the last fest, but was unsure whether it
 was too entangled with the GiST patches that Heikki was looking at.

 Recalling my memories of this morning, things are rather independent.


 Right. I also looked at it briefly, but I wasn't sure if we really want
 it. AFAICT, no-one has actually asked for that operator, it was written
 only to be an example of an operator that would benefit from the knn-gist
 with recheck patch. If there is some other, real, use for the knn-gist with
 recheck patch, then I'm OK with that, but otherwise it's dubious to add an
 operator just so that it can then be made faster by another patch. That
 said, it seems quite harmless, so might as well commit it.


Lack of recheck is major limitation of KNN-GiST now. People are not asking
for that because they don't know what is needed to implement exact KNN for
PostGIS. Now they have to invent kluges like this:

WITH closest_candidates AS (
  SELECT
streets.gid,
streets.name,
streets.geom
  FROM
nyc_streets streets
  ORDER BY
streets.geom -
'SRID=26918;POINT(583571.905921312 4506714.34119218)'::geometry
  LIMIT 100
)
SELECT gid, name
FROM closest_candidates
ORDER BY
  ST_Distance(
geom,
'SRID=26918;POINT(583571.905921312 4506714.34119218)'::geometry
)
LIMIT 1;

See blog posts:
http://blog.light42.com/wordpress/?p=102
http://workshops.boundlessgeo.com/postgis-intro/knn.html

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Commit fest 2014-12, let's begin!

2014-12-15 Thread Alexander Korotkov
On Mon, Dec 15, 2014 at 6:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  On Mon, Dec 15, 2014 at 4:12 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com
  wrote:
  Right. I also looked at it briefly, but I wasn't sure if we really want
  it. AFAICT, no-one has actually asked for that operator, it was written
  only to be an example of an operator that would benefit from the
 knn-gist
  with recheck patch.

  Lack of recheck is major limitation of KNN-GiST now. People are not
 asking
  for that because they don't know what is needed to implement exact KNN
 for
  PostGIS. Now they have to invent kluges like this:
  [ query using ORDER BY ST_Distance ]

 It's not apparent to me that the proposed operator is a replacement for
 ST_Distance.  The underlying data in an example like this won't be either
 points or polygons, it'll be PostGIS datatypes.

 In short, I believe that PostGIS could use what you're talking about,
 but I agree with Heikki's objection that nobody has asked for this
 particular operator.


polygon - point is for sure not ST_Distance replacement. I was giving
this argument about KNN-GiST with recheck itself. polygon - point is
needed just as in-core example of KNN-GiST with recheck.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Commit fest 2014-12, let's begin!

2014-12-15 Thread Alexander Korotkov
On Mon, Dec 15, 2014 at 7:47 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 12/15/2014 05:22 PM, Alexander Korotkov wrote:

 On Mon, Dec 15, 2014 at 6:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:


 Alexander Korotkov aekorot...@gmail.com writes:

 On Mon, Dec 15, 2014 at 4:12 PM, Heikki Linnakangas 

 hlinnakan...@vmware.com

 wrote:
 Right. I also looked at it briefly, but I wasn't sure if we really want
 it. AFAICT, no-one has actually asked for that operator, it was written
 only to be an example of an operator that would benefit from the

 knn-gist

 with recheck patch.


  Lack of recheck is major limitation of KNN-GiST now. People are not

 asking

 for that because they don't know what is needed to implement exact KNN

 for

 PostGIS. Now they have to invent kluges like this:
 [ query using ORDER BY ST_Distance ]


 It's not apparent to me that the proposed operator is a replacement for
 ST_Distance.  The underlying data in an example like this won't be either
 points or polygons, it'll be PostGIS datatypes.

 In short, I believe that PostGIS could use what you're talking about,
 but I agree with Heikki's objection that nobody has asked for this
 particular operator.


 polygon - point is for sure not ST_Distance replacement. I was giving
 this argument about KNN-GiST with recheck itself. polygon - point is
 needed just as in-core example of KNN-GiST with recheck.


 Right. I don't think point - polygon is too useful by itself, but we
 need an example in core that could make use KNN-GiST recheck patch. We
 can't write a regression test for it otherwise, for starters.

 Actually, we probably could've used the circle - polygon for that just
 as well...


Did you mean searching for circles or polygons in the last sentence?

--
With best regards,
Alexander Korotkov.


[HACKERS] Report search_path value back to the client.

2014-12-02 Thread Alexander Kukushkin
Hi,

As of now postgres is reporting only really important variables, but
among them
there are also not so important, like 'application_name'. The only reason
to report
it, was: help pgbouncer, and perhaps other connection poolers.
Change was introduced by commit: 59ed94ad0c9f74a3f057f359316c845cedc4461e

This fact makes me wonder, why 'search_path' value is not reported back to
the
client? Use-case is absolutely the same as with 'application_name' but a
little bit
more important.

Our databases provides different version of stored procedures which are
located
in a different schemas: 'api_version1', 'api_version2', 'api_version5',
etc...
When application establish connection to the database it set search_path
value
for example to api_version1. At the same time, new version of the same
application
will set search_path value to api_version2. Sometimes we have hundreds of
instances of applications which may use different versions of stored
procedures
which are located in different schemas.

It's really crazy to keep so many (hundreds) connections to the database and
it would be much better to have something like pgbouncer in front of
postgres.
Right now it's not possible, because pgbouncer is not aware of search_path
and it's not really possible to fix pgbouncer, because there is no easy way
to
get current value of search_path.

I would like to mark 'search_path' as GUC_REPORT:
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2904,7 +2904,7 @@ static struct config_string ConfigureNamesString[] =
{search_path, PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop(Sets the schema search order for
names that are not schema-qualified.),
NULL,
-   GUC_LIST_INPUT | GUC_LIST_QUOTE
+   GUC_LIST_INPUT | GUC_LIST_QUOTE | GUC_REPORT
},
namespace_search_path,
\$user\,public,

What do you think?

Regards,
--
Alexander Kukushkin


Re: [HACKERS] Fillfactor for GIN indexes

2014-11-28 Thread Alexander Korotkov
Hi!

On Fri, Nov 21, 2014 at 8:12 AM, Michael Paquier michael.paqu...@gmail.com
wrote:

 Please find attached a simple patch adding fillfactor as storage parameter
 for GIN indexes. The default value is the same as the one currently aka 100
 to have the pages completely packed when a GIN index is created.


That's not true. Let us discuss it a little bit.
In btree access method index is created by sorting index tuples. Once they
are sorted it can write a tree with any fullness of the pages. With
completely filled pages you will get flood of index page splits on table
updates or inserts. In order to avoid that the default fillfactor is 90.
Besides index creation, btree access method tries to follow given
fillfactor in the case of inserting ordered data. When rightmost page
splits then fillfactor percent of data goes to the left page.
Btree page life cycle is between being freshly branched by split (50%
filled) and being full packed (100% filled) and then splitted. Thus we can
roughly estimate average fullness of btree page to be 75%. This
equilibrium state can be achieved by huge amount of inserts over btree
index.

Fillfactor was spreaded to other access methods. For instance, GiST. In
GiST, tree is always constructed by page splits. So, freshly created GiST
is already in its equilibrium state. Even more GiST doesn't splits page
by halves. Split ration depends on opclass. So, equilibrium fullness of
particular GiST index is even less than 75%. What does fillfactor do with
GiST? It makes GiST pages split on fillfactor fullness on index build. So,
GiST is even less fulled. But why? You anyway don't get flood of split on
fresh GiST index because it's in equilibrium state. That's why I would
rather delete fillfactor from GiST until we have some algorithm to
construct GiST without page splits.

What is going on with GIN? GIN is, basically, two-level nested btree. So,
fillfactor should be applicable here. But GIN is not constructed like
btree...
GIN accumulates maintenance_work_mem of data and then inserts it into the
tree. So fresh GIN is the result of insertion of maintenance_work_mem
buckets. And each bucket is sorted. On small index (or large
maintenance_work_mem) it would be the only one bucket. GIN has two levels
of btree: entry tree and posting tree. Currently entry tree has on special
logic to control fullness during index build. So, with only one bucket
entry tree will be 50% filled. With many buckets entry tree will be at
equilibrium state. In some specific cases (about two buckets) entry tree
can be almost fully packed. Insertion into posting tree is always ordered
during index build because posting tree is a tree of TIDs. When posting
tree page splits it leaves left page fully packed during index build and
75% packed during insertion (because it expects some sequential inserts).

Idea of GIN building algorithm is that entry tree is expected to be
relatively small and fit to cache. Insertions into posting trees are
orders. Thus this algorithm should be IO efficient and have low overhead.
However, with large entry trees this algorithm can perform much worse.

My summary is following:
1) In order to have fully correct support of fillfactor in GIN we need to
rewrite GIN build algorithm.
2) Without rewriting GIN build algorithm, not much can be done with entry
tree. However, you can implement some heuristics.
3) You definitely need to touch code that selects ratio of split in
dataPlaceToPageLeaf (starting with if (!btree-isBuild)).
3) GIN data pages are always compressed excepts pg_upgraded indexes from
pre-9.4. Take care about it in following code.
  if (GinPageIsCompressed(page))
  freespace = GinDataLeafPageGetFreeSpace(page);
+ else if (btree-isBuild)
+ freespace = BLCKSZ * (100 - fillfactor) / 100;

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] KNN-GiST with recheck

2014-11-28 Thread Alexander Korotkov
On Sat, Nov 22, 2014 at 2:20 AM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Jan 13, 2014 at 9:17 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  This patch was split from thread:
 
 http://www.postgresql.org/message-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=gon-...@mail.gmail.com
 
  I've split it to separate thead, because it's related to partial sort
 only
  conceptually not technically. Also I renamed it to knn-gist-recheck
 from
  partial-knn as more appropriate name. In the attached version docs are
  updated. Possible weak point of this patch design is that it fetches heap
  tuple from GiST scan. However, I didn't receive any notes about its
 design,
  so, I'm going to put it to commitfest.


 The partial sort thing is not in the current 2014-10 commitfest
 (although this patch is). Is that intentional?


It's not. I just didn't revise partial sort yet :(

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] WIP: Access method extendability

2014-11-24 Thread Alexander Korotkov
Hi, Heikki!

Thank you for summarizing. In general, I agree with your notes with
some exceptions.

On Mon, Nov 24, 2014 at 1:52 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 11/10/2014 10:30 PM, Alexander Korotkov wrote:

 Don't allowing CREATE ACCESS METHOD command seems problematic for me. How
 could it work with pg_upgrade? pg_dump wouldn't dump extra pg_am records.
 So, pg_upgrade would break at creating operator classes on new cluster.
 So,
 I agree with dropping create am command only if we let pg_dump to dump
 extra pg_am records...


 pg_dump would dump the CREATE EXTENSION command, and the extension's
 installation script inserts the row to pg_am. pg_dump doesn't dump objects
 that are part of an extension, so that's how this would work with the
 CREATE ACCESS METHOD command, too.


In binary upgrade mode pg_dump have to guarantee that all database objects
will have same oids. That's why in binary upgrade mode pg_dump dumps
extension contents instead of just CREATE EXTENSION command.


 Backtracking a bit, to summarize the discussion so far:

 * It would be nice to have indexes that are not WAL-logged, but are
 automatically re-created after a crash. But it's a completely different and
 orthogonal feature, so there's no need to discuss that further in this
 thread.

 * If an extension is buggy, it can crash and corrupt the whole database.
 There isn't really anything we can do about that, and this patch doesn't
 make that any better or worse.

 * CREATE ACCESS METHOD command isn't worth it.


Taking into account my previous note, how can custom extensions survive
pg_upgrade without CREATE ACCESS METHOD command?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] WIP: Access method extendability

2014-11-10 Thread Alexander Korotkov
Hi!

Thanks everybody for discussion. Sorry for delay. I'll try to address most
of questions arised in this discussion.

In general, I'm agree with concept of marking index as invalid in certain
cases.
I see following possible consequences of buggy WAL-logged custom AM:
1) Server crash during insert/update.
2) Server crash during select.
3) Invalid query answers.
4) Server crash during vacuum.
5) Server crash in recovery.

#5 is totally unacceptable. I've tried to design generic WAL record so that
it should be always possible to purely mechanically apply the record. It's
always possible to move piece of memory inside the page. It's always
possible to copy piece of memory from WAL-record to the page. Buggy WAL for
sure could cause an index corruption as well as any other bug in AM. WAL
support doesn't bring nothing special to this expect the fact that WAL is
quite hard to debug.

However, in current implementation I missed some hidden asserts about page
structure. Page could be so corrupted that we can't, for instance, safely
call XLogReadBufferForRedo(). All this cases must be worked out. If we
can't update page during recovery, index must be marked as invalid. It's
some amount of work, but it seems to be very feasible.

#4 seems problematic too. If autovacuum crashes on some index, then
postgres can enter a crash loop. So, if autovacuum crashes on extension
provided AM, that index should be marked as invalid.

Consequences #1, #3 and #3 could be easily caused by buggy opclass as well.
The reason why we're not knee-deep in extension provied bugs in GiST/GIN
opclasses is not easyness of writing correct GiST/GIN opclasses. Actual
reason is that we have only few GiST/GIN opclasses which weren't written by
GiST/GIN authors.

So, I don't expect PostgreSQL to be flooded with buggy AMs once we add AM
extendability. Our motivation behind this proposal is that we want to be
able to develop custom extensions with AMs. We want to be able to provide
our AMs to our customers whithout having to push that into PostgreSQL core
or fork PostgreSQL. Bugs in that AMs in our responsibility to out
customers. Some commercial companies could implement patented AMs (for
instance, fractal tree), and that is their responsibility to their
customers.

Also, I think it's OK to put adopting custom AMs to changes of AM interface
to authors of those custom AMs. AM extensions anyway should be adopted to
each new major release. AFAIR, interface of RelationOpen() function has
been changed not too long ago. Custom AM would use many functions which we
use to access relations. Their interface can be changed in the next release.

PostGIS GiST opclass has bugs which are reproducable, known and not fixed.
This is responsibility of PostGIS to their customers. We can feel sorry for
PostGIS that they are so slow on fixing this. But we shouldn't feel sorry
for GiST extendability, that woulde be redicilous.

Some recearches could write their own extensions. We can hope that they are
smart enough to not recommend it for production use. We can back our hope
with warning during installing extension provided AM. That warning could
say that all corruption caused by extension provided AM is up to AM
developer. This warning could make users to beware of using extension
provided AMs in production
unless they are fully trust extension developer (have support subscription
if it's commercial).

PostgreSQL doesn't have any kind of safe extensions. Every extension must
be trusted. Every extension can break (almost) everything.When one types
CREATE EXTENSION he must completely trust extension author. This applies to
every extension.

I would be very careful with discouraging commercial AM extensions. We
should always keen in the mind how many of PostgreSQL developers are
working for companies which own their commercial PostgreSQL forks and how
big their contribution is. Patented extensions looks scary for sure. But
it's up to software patents not up to PostgreSQL extendability...

Particular steps I'm going to do on these patches:
1) Make generic_redo never crash on broken pages.
2) Make autovacuum launcher mark index as invalid if vacuum process crashed
on custom AM index. Since, we can't write something into postgres cluster
when one process has crushed, ITSM autovacuum should have some separate
place to put this information. Thus, after restart postgres can read it and
mark index as invalid.

Don't allowing CREATE ACCESS METHOD command seems problematic for me. How
could it work with pg_upgrade? pg_dump wouldn't dump extra pg_am records.
So, pg_upgrade would break at creating operator classes on new cluster. So,
I agree with dropping create am command only if we let pg_dump to dump
extra pg_am records...

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] WIP: Access method extendability

2014-10-28 Thread Alexander Korotkov
On Tue, Oct 28, 2014 at 8:04 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 28 October 2014 14:22, Simon Riggs si...@2ndquadrant.com wrote:

  Or put it another way, it will be easier to write new index AMs
  because we'll be able to skip the WAL part until we know we want it.

 To be clear: I am suggesting you do *less* work, not more.

 By allowing AMs to avoid writing WAL we get
 * higher performance unlogged indexes
 * we get fewer bugs in early days of new AMs
 * writers of new AMs are OK to avoid majority of hard work and hard testing

 So overall, we get new AMs working faster because we can skip writing
 the WAL code until we are certain the new AM code is useful and bug
 free.

 For example, if GIN had avoided implementing WAL it would have been
 easier to change on-disk representation.


Major problem of changing on-disk representation is that we have to support
both binary formats because of pg_upgrade. This problem is even burdened
with WAL, because WAL record redo function also have to support both
formats. However, it's also quite independent of WAL.

Having access methods as extensions can significantly improves situations
here. Imagine, GIN was an extension. One day we decide to change its binary
format. Then we can issue new extension, GIN2 for instance. User can
upgrade from GIN to GIN2 in following steps:

   1. CREATE EXTENSION gin2;
   2. CREATE INDEX CONCURRENTLY [new_index] USING gin2 ([index_def]);
   3. DROP INDEX  CONCURRENTLY [old_index];
   4. DROP EXTENSION gin;

No need to write and debug the code which reads both old and new binary
format. For sure, we need to support old GIN extension for some time. But,
we have to support old in-core versions too.

Unfortunately, I didn't mention this in the first post because I consider
this as a serious argument for extensible AMs.

Also, I'm not sure that many users have enough of courage to use unlogged
AMs. Absence of durability is only half of trouble, another half is lack of
streaming replication. I think if we have unlogged GIN then external
indexing engines would be used by majority of users instead of GIN.

--
With best regards,
Alexander Korotkov.


[HACKERS] WIP: Access method extendability

2014-10-15 Thread Alexander Korotkov
Hackers,

Postgres was initially designed to support access methods extendability.
This extendability lives to present day. However, this is mostly internal
in-core extendability. One can quite easily add new access method into
PostgreSQL core. But if one try to implement access method as external
module, he will be faced with following difficulties:

   1. Need to directly insert into pg_am, because of no CREATE ACCESS
   METHOD command. And no support of dependencies between am and opclasses
   etc.
   2. Module can't define xlog records. So, new am would be not WAL-logged.

The first problem is purely mechanical. Nothing prevents us to implement
CREATE ACCESS METHOD and DROP ACCESS METHOD commands and support all
required dependencies.

Problem of WAL is a bit more complex. According to previous discussions, we
don't want to let extensions declare their own xlog records. If we let them
then recovery process will depend on extensions. That is much violates
reliability. Solution is to implement some generic xlog record which is
able to represent difference between blocks in some general manner.

3 patches are attached:

   1. CREATE/DROP ACCESS METHOD commands. With support of dependencies.
   2. New generic xlog record type.
   3. Bloom contrib module as example of usage of previous two features.
   This module was posted few years ago by Teodor Sigaev. Now, it's wrapped as
   an extension and WAL-logged.

Patches are in WIP state. No documentation and very little of comments.
However, it believe that it's enough to do some general concept review.

Some notes about generic xlog format. Generic xlog represent difference
between pages as operations set of two kinds:

   1. Move memory inside the page. Optional flag is to zero gap on a
   previous memory location.
   2. Copy memory fragment of memory from xlog record to page. As an option
   bitwise logical operations are supported as well as plain copy.

Generic xlog can open page in two modes:

   1. Create mode: page is zeroed independent on its lsn.
   2. Update mode: page is updated only if it's lsn is lower than record lsn

Usually, xlog record is filled in critical sections when memory allocations
is prohibited. Thus, user have to previously initialize it with knowledge
of pages count, total operations count and total length of data.

--
With best regards,
Alexander Korotkov.


create-am.1.patch.gz
Description: GNU Zip compressed data


generic-xlog.1.patch.gz
Description: GNU Zip compressed data


bloom-contrib.1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] KNN-GiST with recheck

2014-09-29 Thread Alexander Korotkov
On Mon, Sep 29, 2014 at 6:16 AM, Bruce Momjian br...@momjian.us wrote:

 On Fri, Sep 26, 2014 at 10:49:42AM +0400, Alexander Korotkov wrote:
  Does this also fix the identical PostGIS problem or is there
 something
  PostGIS needs to do?
 
 
  This patch provides general infrastructure for recheck in KNN-GiST.
 PostGIS
  need corresponding change in its GiST opclass. Since PostGIS already
 define -
  and # operators as distance to bounding box border and bounding box
 center,
  it can't change their behaviour.
  it has to support new operator exact distance in opclass.

 Ah, OK, so they just need something that can be used for the recheck.  I
 think they currently use ST_Distance() for that.  Does it have to be an
 operator?  If they defined an operator for ST_Distance(), would
 ST_Distance() work too for KNN-GiST?


Currently, ST_Distance is a function, but it's no problem to make it an
operator with KNN-GiST support.


 In summary, you still create a normal GiST index on the column:


 http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html

 CREATE INDEX planet_osm_line_ref_index ON planet_osm_line(ref);

 which indexes by the bounding box.  The new code will allow ordered
 index hits to be filtered by something like ST_Distance(), rather than
 having to a LIMIT 50 in a CTE, then call ST_Distance(), like this:

 EXPLAIN ANALYZE WITH distance AS (
 SELECT way AS road, ref AS route
 FROM planet_osm_line
 WHERE highway = 'secondary'
 ORDER BY ST_GeomFromText('POLYGON((14239931.42
 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08
 3053984.84,14239931.42 3054117.72))', 900913) # way
 LIMIT 50
 )
 SELECT ST_Distance(ST_GeomFromText('POLYGON((14239931.42
 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08
 3053984.84,14239931.42 3054117.72))', 900913), road) AS true_distance, route
 FROM distance
 ORDER BY true_distance
 LIMIT 1;


Yeah. It this query 50 is pure empirical value. It could be both too low or
too high. Too low value can cause wrong query answers. Too high value can
cause lower performance. With patch simple KNN query will work like this
query with always right value in LIMIT clause.


 Notice the CTE uses # (bounding box center), and then the outer query
 uses ST_Distance and LIMIT 1 to find the closest item.

 Excellent!


Thanks. The main question now is design of this patch. Currently, it does
all the work inside access method. We already have some discussion of pro
and cons of this method. I would like to clarify alternatives now. I can
see following way:

   1. Implement new executor node which performs sorting by priority queue.
   Let's call it Priority queue. I think it should be separate node from
   Sort node. Despite Priority queue and Sort are essentially similar
   from user view, they would be completely different in implementation.
   2. Implement some interface to transfer distance values from access
   method to Priority queue node.
   3. Somehow tell the planner that it could use Priority queue in
   corresponding cases. I see two ways of doing this:
  - Add flag to operator in opclass indicating that index can only
  order by lower bound of col op value, not by col op value itself.
  - Define new relation between operators. Value of one operator could
  be lower bound for value of another operator. So, planner can
put Priority
  queue node when lower bound ordering is possible from index. Also ALTER
  OPERATOR command would be reasonable, so extensions could upgrade.

Besides overhead, this way makes significant infrastructural changes. So,
it may be over-engineering. However, it's probably more clean and beautiful
solution.
I would like to get some feedback from people familiar with KNN-GiST like
Heikki or Tom. What do you think about this? Any other ideas?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] KNN-GiST with recheck

2014-09-26 Thread Alexander Korotkov
On Fri, Sep 26, 2014 at 5:18 AM, Bruce Momjian br...@momjian.us wrote:

 On Sun, Sep 14, 2014 at 11:34:26PM +0400, Alexander Korotkov wrote:
   Cost estimation of GiST is a big problem anyway. It doesn't care
 (and
   can't) about amount of recheck for regular operators. In this
 patch, same
   would be for knn recheck. The problem is that touching heap from
 access

 This is very important work.  While our existing KNN-GiST index code
 works fine for scalar values and point-to-point distance ordering, it
 doesn't work well for 2-dimensional objects because they are only
 indexed by their bounding boxes (a rectangle around the object).  The
 indexed bounding box can't produce accurate distances to other objects.

 As an example, see this PostGIS blog post showing how to use LIMIT in a
 CTE to filter results and then compute the closest object (search for
 LIMIT 50):


 http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html

 This patch fixes our code for distances from a point to indexed 2-D
 objects.

 Does this also fix the identical PostGIS problem or is there something
 PostGIS needs to do?


This patch provides general infrastructure for recheck in KNN-GiST. PostGIS
need corresponding change in its GiST opclass. Since PostGIS already define
- and # operators as distance to bounding box border and bounding box
center, it can't change their behaviour.
it has to support new operator exact distance in opclass.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] KNN-GiST with recheck

2014-09-25 Thread Alexander Korotkov
On Thu, Sep 25, 2014 at 9:00 PM, Emre Hasegeli e...@hasegeli.com wrote:

  Fixed, thanks.

 Here are my questions and comments about the code.

 doc/src/sgml/gist.sgml:812:
 be rechecked from heap tuple before tuple is returned.  If
 literalrecheck/ flag isn't set then it's true by default for
 compatibility reasons.  The literalrecheck/ flag can be used
 only

 Recheck flag is set to false on gistget.c so I think it should say
 false by default.  On the other hand, it is true by default on
 the consistent function.  It is written as the safest assumption
 on the code comments.  I don't know why the safest is chosen over
 the backwards compatible for the consistent function.


Agree. It should be clarified in docs.

src/backend/access/gist/gistget.c:505:
/* Recheck distance from heap tuple if needed */
if (GISTSearchItemIsHeap(*item) 
searchTreeItemNeedDistanceRecheck(scan,
 so-curTreeItem))
{
searchTreeItemDistanceRecheck(scan,
 so-curTreeItem, item);
continue;
}

 Why so-curTreeItem is passed to these functions?  They can use
 scan-opaque-curTreeItem.


I didn't get the difference. Few lines before:

GISTScanOpaque so = (GISTScanOpaque) scan-opaque;


 src/backend/access/gist/gistscan.c:49:
/*
 * When all distance values are the same, items without
 recheck
 * can be immediately returned.  So they are placed first.
 */
if (recheckCmp == 0  distance_a.recheck !=
 distance_b.recheck)
recheckCmp = distance_a.recheck ? 1 : -1;

 I don't understand why items without recheck can be immediately
 returned.  Do you think it will work correctly when there is
 an operator class which will return recheck true and false for
 the items under the same page?


Yes, I believe so. Item with recheck can't decrease it's distance, it can
only increase it. In the corner case  item can have same distance after
recheck as it was before. Then anyway items which distances are the same
can be returned in any order.


 src/backend/access/index/indexam.c:258:
/* Prepare data structures for getting original indexed values
 from heap */
scan-indexInfo = BuildIndexInfo(scan-indexRelation);
scan-estate = CreateExecutorState();
scan-slot =
 MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));

 With the changes in indexam.c, heap access become legal for all index
 access methods.  I think it is better than the previous version but
 I am leaving the judgement to someone experienced.  I will try to
 summarize the pros and cons of sorting the rows in the GiST access
 method, as far as I understand.

 Pros:

 * It does not require another queue.  It should be effective to sort
   the rows inside the queue the GiST access method already has.
 * It does not complicate index access method infrastructure.

 Cons:

 * It could be done without additional heap access.
 * Other access methods could make use of the sorting infrastructure
   one day.
 * It could be more transparent to the users.  Sorting information
   could be shown on the explain output.


It would be also nice to show some information about KNN itself.


 * A more suitable data structure like binary heap could be used
   for the queue to sort the rows.


Binary heap seems to be better data structure for whole KNN-GiST. But it's
a subject for a separate patch: replace RB-tree to heap in KNN-GiST. It's
not related to recheck stuff.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] KNN-GiST with recheck

2014-09-17 Thread Alexander Korotkov
On Wed, Sep 17, 2014 at 12:30 PM, Emre Hasegeli e...@hasegeli.com wrote:

  I managed to break it again by ordering rows only by the second column
  of the index.  Test script attached.

 I was confused.  It is undefined behavior.  Sorry for the noise.


No problem. Thanks a lot for testing.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collation-aware comparisons in GIN opclasses

2014-09-16 Thread Alexander Korotkov
On Mon, Sep 15, 2014 at 11:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Peter Geoghegan p...@heroku.com writes:
  On Mon, Sep 15, 2014 at 8:28 AM, Alexander Korotkov
  aekorot...@gmail.com wrote:
  Rename such opclasses and make them not default.
  Create new default opclasses with bitwise comparison functions.
  Write recommendation to re-create indexes with default opclasses into
  documentation.

  I certainly think this should be fixed if at all possible, but I'm not
  sure about this plan. Can we really rename an opclass without
  consequence, including having that respected across pg_upgrade?

 No.  And we don't know how to change the default opclass without
 breaking things, either.  See previous discussions about how we
 might fix the totally-broken default gist opclass that btree_gist
 creates for the inet type [1].  The motivation for getting rid of that
 is *way* stronger than it might be slow, but there's no apparent
 way to make something else be the default without creating havoc.


I've read thread about gist opclass for inet type. But that case is more
difficult because conflict is between builtin opclass and contrib opclass.
This case seems to be much simpler: we need to change builtin opclass to
builtin opclass and contrib opclass to contrib opclass. I realized that
it's problematic to rename builtin opclass due to pg_upgrade. However, it
seems still possible to create new opclass and make it default.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collation-aware comparisons in GIN opclasses

2014-09-16 Thread Alexander Korotkov
On Tue, Sep 16, 2014 at 11:29 AM, Emre Hasegeli e...@hasegeli.com wrote:

 Changing the default opclasses should work if we make
 pg_dump --binary-upgrade dump the default opclasses with indexes
 and exclusion constraints.  I think it makes sense to do so in
 --binary-upgrade mode.  I can try to come with a patch for this.


Can you explain it a bit more detail? I didn't get it.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collation-aware comparisons in GIN opclasses

2014-09-16 Thread Alexander Korotkov
On Tue, Sep 16, 2014 at 12:14 PM, Emre Hasegeli e...@hasegeli.com wrote:

   Changing the default opclasses should work if we make
   pg_dump --binary-upgrade dump the default opclasses with indexes
   and exclusion constraints.  I think it makes sense to do so in
   --binary-upgrade mode.  I can try to come with a patch for this.
 
  Can you explain it a bit more detail? I didn't get it.

 pg_upgrade uses pg_dump --binary-upgrade to dump the schema of
 the old database.  Now, it generates CREATE INDEX statements without
 explicit opclass if opclass is the default.  We can change pg_dump
 to generate the statements with opclass even if opclass is the default
 in --binary-upgrade mode.


Thanks, I get it. I checked pg_dump implementation. It appears to be not as
easy as it could be. pg_dump doesn't form index definition by itself. It
calls pg_get_indexdef function. This function have no option to dump names
of default opclasses. Since we can't change behaviour of old postgres
version, we have to make pg_dump form index definition by itself.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] PoC: Partial sort

2014-09-15 Thread Alexander Korotkov
On Sun, Sep 14, 2014 at 7:39 AM, Peter Geoghegan p...@heroku.com wrote:

 On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  Actually, higher cardinality skip columns is better. Sorting of smaller
  groups is faster than sorting larger groups of same size. Also, with
 smaller
  groups you achieve limit more accurate (in average), i.e. sort smaller
  amount of total rows.

 Higher cardinality leading key columns are better for what? Do you
 mean they're better in that those cases are more sympathetic to your
 patch, or better in the general sense that they'll perform better for
 the user? The first example query, that you chose to demonstrate your
 patch had a leading, indexed column (column v1) with much lower
 cardinality/n_distinct than the column that had to be sorted on
 (column v2). That was what my comments were based on.

 When this feature is used, there will often be a significantly lower
 cardinality in the leading, indexed column (as in your example).
 Otherwise, the user might well have been happy to just order on the
 indexed/leading column alone. That isn't definitely true, but it's
 often true.


I just meant higher cardinality is cheaper for do partial sort. We could
have some misunderstood because of notions high and low are relative.
When cardinality is 1 then partial sort seems to be useless. When
cardinality is few then it still could be cheaper to do sequential scan +
sort rather than index scan + partial sort. When cardinality is close to
number of rows then as you mentioned user probably don't need to sort by
rest of columns. At least one exception is when user needs to force
uniqueness of order. So, we need to target something in the middle of this
two corner cases.


  I'm not sure if that's worth it to more or less
  duplicate heap_tuple_attr_equals() to save a mere n expensive
  comparisons, but it's something to think about (actually, there are
  probably less than even n comparisons in practice because there'll be
  a limit).
 
  Not correct. Smaller groups are not OK. Imagine that two representations
 of
  same skip column value exists. Index may return them in any order, even
  change them one by one. In this case sorting on other column never takes
  place, while it should.

 I think I explained this badly - it wouldn't be okay to make the
 grouping smaller, but as you say we could tie-break with a proper
 B-Tree support function 1 comparison to make sure we really had
 reached the end of our grouping.

 FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp()
 first, so the use of the equality operator with text in mind that you
 mention may soon not be useful at all. The evidence suggests that
 memcmp() is so much cheaper than special preparatory NUL-termination +
 strcoll() that we should always try it first when sorting text, even
 when we have no good reason to think it will work. The memcmp() is
 virtually free. And so, you see why it might be worth thinking about
 this when we already have reasonable confidence that many comparisons
 will indicate that datums are equal. Other datatypes will have
 expensive real comparators, not just text or numeric.


When strings are not equal bttextcmp still needs to use strcoll while
texteq doesn't need that. So, it would be still advantage of using equality
operator over comparison function: equality operator don't have to compare
unequal values. However, real cost of this advantage depends on presorted
columns cardinality as well.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] PoC: Partial sort

2014-09-15 Thread Alexander Korotkov
On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan p...@heroku.com wrote:

 I think we might be better off if a tuplesort function was called
 shortly after tuplesort_begin_heap() is called. How top-n heap sorts
 work is something that largely lives in tuplesort's head. Today, we
 call tuplesort_set_bound() to hint to tuplesort By the way, this is a
 top-n heap sort applicable sort. I think that with this patch, we
 should then hint (where applicable) by the way, you won't actually be
 required to sort those first n indexed attributes; rather, you can
 expect to scan those in logical order. You may work the rest out
 yourself, and may be clever about exploiting the sorted-ness of the
 first few columns. The idea of managing a bunch of tiny sorts from
 with ExecSort(), and calling the new function tuplesort_reset() seems
 questionable. tuplesortstate is supposed to be private/opaque to
 nodeSort.c, and the current design strains that.

 I'd like to keep nodeSort.c simple. I think it's pretty clear that the
 guts of this do not belong within ExecSort(), in any case. Also, the
 additions there should be much better commented, wherever they finally
 end up.


As I understand, you propose to incapsulate partial sort algorithm into
tuplesort. However, in this case we anyway need some significant change of
its interface: let tuplesort decide when it's able to return tuple.
Otherwise, we would miss significant part of LIMIT clause optimization.
tuplesort_set_bound() can't solve all the cases. There could be other
planner nodes between the partial sort and LIMIT.

--
With best regards,
Alexander Korotkov.


[HACKERS] Triconsistent catalog declaration

2014-09-15 Thread Alexander Korotkov
Hi!

9.4 GIN introduces new triconsistent method which can return one of three
values in spite of just consistent method. But in catalog declaration
triconsistent still returns bool type. It doesn't affect anything because
nobody calls triconsistent from SQL. But I think it would be correct to
declare them returns int2.

--
With best regards,
Alexander Korotkov.


tri-consistent-catalog.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Collation-aware comparisons in GIN opclasses

2014-09-15 Thread Alexander Korotkov
Hackers,

some GIN opclasses uses collation-aware comparisons while they don't need
to do especially collation-aware comparison. Examples are text[] and hstore
opclasses. Depending on collation this may make them a much slower.

See example.

# show lc_collate ;
 lc_collate
─
 ru_RU.UTF-8
(1 row)

# create table test as (select array_agg(i::text) from
generate_series(1,100) i group by (i-1)/10);
SELECT 10

# create index test_idx on test using gin(array_agg);
CREATE INDEX
Time: *26930,423 ms*

# create index test_idx2 on test using gin(array_agg collate C);
CREATE INDEX
Time: *5143,682 ms*

Index creation with collation ru_RU.UTF-8 is 5 times slower while
collation has absolutely no effect on index functionality.

However, we can just replace comparison function for those opclasses
because it would break binary compatibility for pg_upgrade. I see following
solution:

   1. Rename such opclasses and make them not default.
   2. Create new default opclasses with bitwise comparison functions.
   3. Write recommendation to re-create indexes with default opclasses into
   documentation.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Collation-aware comparisons in GIN opclasses

2014-09-15 Thread Alexander Korotkov
On Mon, Sep 15, 2014 at 10:51 PM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Sep 15, 2014 at 8:28 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  some GIN opclasses uses collation-aware comparisons while they don't
 need to
  do especially collation-aware comparison. Examples are text[] and hstore
  opclasses. Depending on collation this may make them a much slower.

 I'm glad that I saw how pointless this was with the jsonb GIN default
 opclass during development.

  Rename such opclasses and make them not default.
  Create new default opclasses with bitwise comparison functions.
  Write recommendation to re-create indexes with default opclasses into
  documentation.

 I certainly think this should be fixed if at all possible, but I'm not
 sure about this plan. Can we really rename an opclass without
 consequence, including having that respected across pg_upgrade?


Just rename doesn't seem to be safe. Since pg_upgrade uses pg_dump, all
indexes are linked to opclasses using their names. Therefore existed
indexes will be linked to new opclasses. It's likely we need to execute SQL
script renaming opclasses before pg_upgrade. Another option is to don't
rename old opclasses, just create new default opclasses with new names.
Bruce, what is your opinion about pg_upgrade?
Contrib opclasses would be safe to rename using migration script.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] KNN-GiST with recheck

2014-09-14 Thread Alexander Korotkov
On Sun, Sep 14, 2014 at 10:09 PM, Emre Hasegeli e...@hasegeli.com wrote:

 I added the point to polygon distance operator patch to the open
 CommitFest as ready for committer and added myself as reviewer to
 both of the patches.


Thanks.

 Cost estimation of GiST is a big problem anyway. It doesn't care (and
  can't) about amount of recheck for regular operators. In this patch, same
  would be for knn recheck. The problem is that touching heap from access
  method breaks incapsulation. One idea about this is to do sorting in
  another nodes. However, I wonder if it would be an overengineering and
  overhead. In attached patch I propose a different approach: put code
  touching heap into separate index_get_heap_values function. Also new
  version of patch includes regression tests and some cleanup.

 While looking it at I found a bug.  It returns the second column
 in wrong order when both of the distance functions return recheck = true.
 Test script attached to run on the regression database.  I tried to
 fix but could not.  searchTreeItemDistanceRecheck function is not
 very easy to follow.  I think it deserves more comments.


Fixed, thanks. It was logical error in comparison function implementation.

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-09-13 Thread Alexander Korotkov
 analyze select x,y,z from b where z=1 order by x,y limit 10;
QUERY PLAN
──
 Limit  (cost=169374.43..263471.04 rows=10 width=12) (actual
time=2237.082..2237.083 rows=10 loops=1)
   -  Partial sort  (cost=169374.43..338748.34 rows=18 width=12) (actual
time=2237.082..2237.083 rows=10 loops=1)
 Sort Key: x, y
 Presorted Key: x
 Sort Method: quicksort  Memory: 25kB
 -  Index Scan using b_x_idx on b  (cost=0.43..338748.13 rows=18
width=12) (actual time=0.047..2237.062 rows=10 loops=1)
   Filter: (z = 1)
   Rows Removed by Filter: 990
 Planning time: 0.089 ms
 Execution time: 2237.133 ms
(10 rows)

AFAICS wrong selectivity estimations are general problem which cause
optimizer failures. But in your example x+y=1 if expression index on
x+y would exist then statistics over x+y will be collected. So, in case
of expression index estimation will be fine.

--
With best regards,
Alexander Korotkov.


partial-sort-basic-2.patch
Description: Binary data


partial-sort-merge-2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Alexander Korotkov
Hi!

Let's consider some examples.

# select '[1,2]'::jsonb @ '[1,2,2]'::jsonb;
 ?column?
--
 f
(1 row)

One may think it's because second jsonb array contain two 2. So, contains
takes care about count of equal elements.

# select '[1,1,2]'::jsonb @ '[1,2,2]'::jsonb;
 ?column?
--
 t
(1 row)

But, it's not. Jsonb contains takes care only about length of array.

# select '[[1,2]]'::jsonb @ '[[1,2,2]]'::jsonb;
 ?column?
--
 t
(1 row)

Even more weird :)
The reason why jsonb contains behaves so is check in the beginning
of jsonb_contains. It makes fast check of jsonb type and elements count
before calling JsonbDeepContains.

if (JB_ROOT_COUNT(val)  JB_ROOT_COUNT(tmpl) ||
JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl))
PG_RETURN_BOOL(false);

It's likely that JB_ROOT_COUNT(val)  JB_ROOT_COUNT(tmpl) should be
checked only for objects, not arrays. Also, should JsonbDeepContains does
same fast check when it deals with nested objects?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-12 Thread Alexander Korotkov
On Fri, Sep 12, 2014 at 10:38 PM, Peter Geoghegan p...@heroku.com wrote:

 On Fri, Sep 12, 2014 at 11:21 AM, Josh Berkus j...@agliodbs.com wrote:
  # select '[1,1,2]'::jsonb @ '[1,2,2]'::jsonb;
   ?column?
  --
   t
  (1 row)
 
  But, it's not. Jsonb contains takes care only about length of array.
 
  OK, now, that's messed up.

 To be clear: I don't think that this example is messed up (in
 isolation). I think it's the correct behavior. What I find
 objectionable is the inconsistency. I believe that this is Alexander's
 concern too. Alexander's first example exhibits broken behavior.


Agree. I just tried to explain how current behaviour could look for user
who sees it for the first time.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] PoC: Partial sort

2014-09-12 Thread Alexander Korotkov
On Sun, Jul 13, 2014 at 6:45 AM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Feb 10, 2014 at 10:59 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  Done. Patch is splitted.

 I took a quick look at this.

 Have you thought about making your new cmpSortSkipCols() function not
 use real comparisons? Since in the circumstances in which this
 optimization is expected to be effective (e.g. your original example)
 we can also expect a relatively low cardinality for the first n
 indexed attributes (again, as in your original example), in general
 when cmpSortSkipCols() is called there is a high chance that it will
 return true. If any pair of tuples (logically adjacent tuples fed in
 to cmpSortSkipCols() by an index scan in logical order) are not fully
 equal (i.e. their leading, indexed attributes are not equal) then we
 don't care about the details -- we just know that a new sort grouping
 is required.


Actually, higher cardinality skip columns is better. Sorting of smaller
groups is faster than sorting larger groups of same size. Also, with
smaller groups you achieve limit more accurate (in average), i.e. sort
smaller amount of total rows.


 The idea here is that you can get away with simple binary equality
 comparisons, as we do when considering HOT-safety. Of course, you
 might find that two bitwise unequal values are equal according to
 their ordinary B-Tree support function 1 comparator (e.g. two numerics
 that differ only in their display scale). AFAICT this should be okay,
 since that just means that you have smaller sort groupings than
 strictly necessary. I'm not sure if that's worth it to more or less
 duplicate heap_tuple_attr_equals() to save a mere n expensive
 comparisons, but it's something to think about (actually, there are
 probably less than even n comparisons in practice because there'll be
 a limit).


Not correct. Smaller groups are not OK. Imagine that two representations of
same skip column value exists. Index may return them in any order, even
change them one by one. In this case sorting on other column never takes
place, while it should. But some optimizations are still possible:

   1. Use bitwise comparison first, then recheck. But, no guarantees that
   acceleration will be achieved.
   2. Use equality check instead of btree comparison. For text datatype
   it would be rather faster because of no locale-aware comparison.


--
With best regards,
Alexander Korotkov.


Re: [HACKERS] gist vacuum gist access

2014-09-08 Thread Alexander Korotkov
On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 09/07/2014 05:11 PM, Костя Кузнецов wrote:

 hello.
 i recode vacuum for gist index.
 all tests is ok.
 also i test vacuum on table size 2 million rows. all is ok.
 on my machine old vaccum work about 9 second. this version work about 6-7
 sec .
 review please.


 If I'm reading this correctly, the patch changes gistbulkdelete to scan
 the index in physical order, while the old code starts from the root and
 scans the index from left to right, in logical order.

 Scanning the index in physical order is wrong, if any index pages are
 split while vacuum runs. A page split could move some tuples to a
 lower-numbered page, so that the vacuum will not scan those tuples.

 In the b-tree code, we solved that problem back in 2006, so it can be done
 but requires a bit more code. In b-tree, we solved it with a vacuum cycle
 ID number that's set on the page halves when a page is split. That allows
 VACUUM to identify pages that have been split concurrently sees them, and
 jump back to vacuum them too. See commit http://git.postgresql.org/
 gitweb/?p=postgresql.git;a=commit;h=5749f6ef0cc1c67ef9c9ad2108b3d9
 7b82555c80. It should be possible to do something similar in GiST, and in
 fact you might be able to reuse the NSN field that's already set on the
 page halves on split, instead of adding a new vacuum cycle ID.


Idea is right. But in fact, does GiST ever recycle any page? It has
F_DELETED flag, but ISTM this flag is never set. So, I think it's possible
that this patch is working correctly. However, probably GiST sometimes
leaves new page unused due to server crash.
Anyway, I'm not fan of committing patch in this shape. We need to let GiST
recycle pages first, then implement VACUUM similar to b-tree.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] gist vacuum gist access

2014-09-08 Thread Alexander Korotkov
On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com
wrote:

 On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 On 09/07/2014 05:11 PM, Костя Кузнецов wrote:

 hello.
 i recode vacuum for gist index.
 all tests is ok.
 also i test vacuum on table size 2 million rows. all is ok.
 on my machine old vaccum work about 9 second. this version work about
 6-7 sec .
 review please.


 If I'm reading this correctly, the patch changes gistbulkdelete to scan
 the index in physical order, while the old code starts from the root and
 scans the index from left to right, in logical order.

 Scanning the index in physical order is wrong, if any index pages are
 split while vacuum runs. A page split could move some tuples to a
 lower-numbered page, so that the vacuum will not scan those tuples.

 In the b-tree code, we solved that problem back in 2006, so it can be
 done but requires a bit more code. In b-tree, we solved it with a vacuum
 cycle ID number that's set on the page halves when a page is split. That
 allows VACUUM to identify pages that have been split concurrently sees
 them, and jump back to vacuum them too. See commit
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=
 5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do
 something similar in GiST, and in fact you might be able to reuse the NSN
 field that's already set on the page halves on split, instead of adding a
 new vacuum cycle ID.


 Idea is right. But in fact, does GiST ever recycle any page? It has
 F_DELETED flag, but ISTM this flag is never set. So, I think it's possible
 that this patch is working correctly. However, probably GiST sometimes
 leaves new page unused due to server crash.
 Anyway, I'm not fan of committing patch in this shape. We need to let GiST
 recycle pages first, then implement VACUUM similar to b-tree.


Another note. Assuming we have NSN which can play the role of vacuum cycle
ID, can we implement sequential (with possible jump back) index scan for
GiST?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] gist vacuum gist access

2014-09-08 Thread Alexander Korotkov
On Mon, Sep 8, 2014 at 12:51 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 09/08/2014 11:19 AM, Alexander Korotkov wrote:

 On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com
 
 wrote:

  On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

  In the b-tree code, we solved that problem back in 2006, so it can be
 done but requires a bit more code. In b-tree, we solved it with a
 vacuum
 cycle ID number that's set on the page halves when a page is split.
 That
 allows VACUUM to identify pages that have been split concurrently sees
 them, and jump back to vacuum them too. See commit
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=
 5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do
 something similar in GiST, and in fact you might be able to reuse the
 NSN
 field that's already set on the page halves on split, instead of adding
 a
 new vacuum cycle ID.

 ...


 Another note. Assuming we have NSN which can play the role of vacuum
 cycle
 ID, can we implement sequential (with possible jump back) index scan
 for
 GiST?


 Yeah, I think it would work. It's pretty straightforward, the page split
 code already sets the NSN, just when we need it. Vacuum needs to memorize
 the current NSN when it begins, and whenever it sees a page with a higher
 NSN (or the FOLLOW_RIGHT flag is set), follow the right-link if it points
 to lower-numbered page.


I mean full index scan feature for SELECT queries might be implemented as
well as sequential VACUUM.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Alexander Korotkov
On Fri, Aug 8, 2014 at 9:14 PM, Teodor Sigaev teo...@sigaev.ru wrote:

 Curious idea: we could swap JEntry array and values: values in the
 begining of type will be catched by pg_lzcompress. But we will need to
 know offset of JEntry array, so header will grow up till 8 bytes
 (actually, it will be a varlena header!)


 May be I wasn't clear:jsonb type will start from string collection instead
 of JEntry array, JEntry array will be placed at the end of object/array.
 so, pg_lzcompress will find repeatable 4-byte pieces in first 1024 bytes of
 jsonb.


Another idea I have is that store offset in each JEntry is not necessary to
have benefit of binary search. Namely what if we store offsets in each 8th
JEntry and length in others? The speed of binary search will be about the
same: overhead is only calculation offsets in the 8-entries chunk. But
lengths will probably repeat.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] KNN-GiST with recheck

2014-08-04 Thread Alexander Korotkov
On Sun, Aug 3, 2014 at 5:48 PM, Emre Hasegeli e...@hasegeli.com wrote:

   1. This patch introduces a new polygon - point operator. That seems
   useful on its own, with or without this patch.
 
  Yeah, but exact-knn cant come with no one implementation. But it would
  better come in a separate patch.

 I tried to split them.  Separated patches are attached.  I changed
 the order of the arguments as point - polygon, because point was
 the first one on all the others.  Its commutator was required for
 the index, so I added it on the second patch.  I also added tests
 for the operator.  I think it is ready for committer as a separate
 patch.  We can add it to the open CommitFest.

 I have made some cosmetic changes on the patches.  I hope they are
 useful.

 I added support to point - circle operator with the same GiST
 distance function you added for polygon.  I can change it, if it is not
 the right way.


Great, thanks!

   2. I wonder how useful it really is to allow mixing exact and non-exact

return values from the distance function. The distance function
 included in
   the patch always returns recheck=true. I have a feeling that all other
   distance function will also always return either true or false.
 
  For geometrical datatypes recheck variations in consistent methods are
 also
  very rare (I can't remember any). But imagine opclass for arrays where
 keys
  have different representation depending on array length. For such opclass
  and knn on similarity recheck flag could be useful.

 I also wonder how useful it is.  Your example is convincing, but maybe
 setting it index-wide will make the decisions on the framework easier.
 For example, how hard would it be to decide if further sorting is
 required or not on the planner?


I think that for most use cases just some operators require further sorting
and some of them not. But it could appear one day that some index gives
part of its knn answers exact and part of them inexact. Same happen to
recheck of regular operators. Initially recheck flag was defined in
opclass. But later recheck became runtime flag.

  4. (as you mentioned in the other thread: ) It's a modularity violation
   that you peek into the heap tuple from gist. I think the proper way to
 do
   this would be to extend the IndexScan executor node to perform the
   re-shuffling of tuples that come from the index in wrong order, or
 perhaps
   add a new node type for it.
  
   Of course that's exactly what your partial sort patch does :-). I
 haven't
   looked at that in detail, but I don't think the approach the partial
 sort
   patch takes will work here as is. In the KNN-GiST case, the index is
   returning tuples roughly in the right order, but a tuple that it
 returns
   might in reality belong somewhere later in the ordering. In the partial
   sort patch, the input stream of tuples is divided into
 non-overlapping
   groups, so that the tuples within the group are not sorted, but the
 groups
   are. I think the partial sort case is a special case of the KNN-GiST
 case,
   if you consider the lower bound of each tuple to be the leading keys
 that
   you don't need to sort.
 
  Yes. But, for instance btree accesses heap for unique checking. Is really
  it so crimilal? :-)
  This is not only question of a new node or extending existing node. We
 need
  to teach planner/executor access method can return value of some
 expression
  which is lower bound of another expression. AFICS now access method can
  return only original indexed datums and TIDs. So, I afraid that enormous
  infrastructure changes are required. And I can hardly imagine what they
  should look like.

 Unfortunately, I am not experienced enough to judge your implementation.
 As far as I understand the problem is partially sorting rows on
 the index scan node.  It can lead the planner to choose non-optimal
 plans, because of not taking into account the cost of sorting.


Cost estimation of GiST is a big problem anyway. It doesn't care (and
can't) about amount of recheck for regular operators. In this patch, same
would be for knn recheck. The problem is that touching heap from access
method breaks incapsulation. One idea about this is to do sorting in
another nodes. However, I wonder if it would be an overengineering and
overhead. In attached patch I propose a different approach: put code
touching heap into separate index_get_heap_values function. Also new
version of patch includes regression tests and some cleanup.

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Distance from point to box

2014-07-30 Thread Alexander Korotkov
Hackers,

while reading code written by my GSoC student for knn-spgist I faced many
questions about calculation distance from point to box.

1) dist_pb calls close_pb which calls on_pb, dist_ps_internal, close_ps and
so on. So, it's very complex way to calculate very simple value. I see this
way has some mathematical beauty, but is it acceptable in practice?

2) computeDistance in knn-gist uses it's own quite simplified
implementation of this calculation. But why has it own implementation
instead on simplifying dist_pb?

3) computeDistance implementation looks still very complex for me. Right
way (simple and fast) to calculate point to box distance seems for me to be
so:

double dx = 0.0, dy = 0.0;

if (point-x  box-low.x)
dx = box-low.x - point-x;
if (point-x  box-high.x)
dx = point-x - box-high.x;
if (point-y  box-low.y)
dy = box-low.y - point-y;
if (point-y  box-high.y)
dy = point-y - box-high.y;
return HYPOT(dx, dy);

I feel myself quite tangled.
Could anybody clarify it for me? Did I miss something? Thanks.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Distance from point to box

2014-07-30 Thread Alexander Korotkov
On Wed, Jul 30, 2014 at 4:06 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:


  double dx = 0.0, dy = 0.0;

 if (point-x  box-low.x)
dx = box-low.x - point-x;
 if (point-x  box-high.x)
dx = point-x - box-high.x;
 if (point-y  box-low.y)
dy = box-low.y - point-y;
 if (point-y  box-high.y)
dy = point-y - box-high.y;
 return HYPOT(dx, dy);

 I feel myself quite tangled.
 Could anybody clarify it for me? Did I miss something? Thanks.


 ISTM that you miss the projection on the segment if dx=0 or dy=0.


I don't need to find projection itself, I need only distance. When dx = 0
then nearest point is on horizontal line of box, so distance to it is dy.
Same when dy = 0. When both of them are 0 then point is in the box.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Distance from point to box

2014-07-30 Thread Alexander Korotkov
On Wed, Jul 30, 2014 at 7:26 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:


  ISTM that you miss the projection on the segment if dx=0 or dy=0.


 I don't need to find projection itself, I need only distance. When dx = 0
 then nearest point is on horizontal line of box, so distance to it is dy.
 Same when dy = 0. When both of them are 0 then point is in the box.


 Indeed. I thought that the box sides where not parallel to the axis, but
 they are. So I do not see why it should be more complex. Maybe they is a
 general algorithm which works for polygons, and they use it for boxes?


Yeah, this answers question #1, but not #2 and #3 :)

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] 9.5 CF1

2014-07-02 Thread Alexander Korotkov
On Wed, Jul 2, 2014 at 10:08 AM, Abhijit Menon-Sen a...@2ndquadrant.com
wrote:

 KNN-GiST with recheck
 Partial sort
 Some earlier discussion, but no conclusions, and as far as I know
 nobody is working on these patches at the moment.


I'm now working on next revision of KNN-GiST with recheck. Also, I think
partial sort need a look of somebody more aware of planner than me and
Marti.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Negative imact of maintenance_work_mem to GIN size

2014-05-20 Thread Alexander Korotkov
On Tue, May 20, 2014 at 4:50 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 I found that sometimes larger maintenance_work_mem leads to larger GIN
 index. That is quite strange. ISTM that it's related to posting lists
 compression but I can't figure out how exactly it is.


It appears to be not related to posting lists compression. I did reproduce
it on 9.2.

create table test as (select array[(random()*1000)::int]::int[] as v
from generate_series(1,1000) g);
set maintenance_work_mem = '2GB';
create index test_idx1 on test using gin(v);
set maintenance_work_mem = '16MB';
create index test_idx2 on test using gin(v);

 Schema |Name| Type  |  Owner   |Table|
 Size   | Description
++---+--+-+-+-
 public | test_idx1  | index | smagen   | test| 392
MB  |
 public | test_idx2  | index | smagen   | test| 268
MB  |
(2 rows)

The reason of it is that we filling entry tree with inserting without any
special optimization. Large maintenance_work_mem gives us ascending order
of insertion. Thus insertion is performed always into rightmost page
leaving rest of pages half-empty. Small maintenance_work_mem gives us more
random order of insertion. Such insertions makes pages about 75% filled in
average. Posting trees has special optimization for this case while entry
tree doesn't.

--
With best regards,
Alexander Korotkov.


[HACKERS] Negative imact of maintenance_work_mem to GIN size

2014-05-19 Thread Alexander Korotkov
Hi!

I found that sometimes larger maintenance_work_mem leads to larger GIN
index. That is quite strange. ISTM that it's related to posting lists
compression but I can't figure out how exactly it is.

See example on delicious bookmarks dataset.

http://mira.sai.msu.ru/~megera/tmp/js.copy.gz

set maintenance_work_mem = '1GB';
create index js_idx1 on js using gin(v jsonb_path_idx);
set maintenance_work_mem = '16MB';
create index js_idx2 on js using gin(v jsonb_path_ops);

List of relations
 Schema |  Name   | Type  | Owner  | Table |  Size  | Description
+-+---++---++-
 public | js_idx1 | index | smagen | js| 432 MB |
 public | js_idx2 | index | smagen | js| 309 MB |
(2 rows)

--
With best regards,
Alexander Korotkov.


[HACKERS] Lossy bitmap scan is broken in GIN

2014-05-10 Thread Alexander Korotkov
Hackers,

GIN hangs on lossy bitmap scan. Here is test case:

create extension btree_gin;
create table test as (select random() v from generate_series(1,100));
create index test_idx on test using gin(v);
set work_mem = '16MB';
select count(*) from test where v  0.9;
 count
───
 99916
(1 row)

Time: 63,142 ms

set work_mem = '64kB';
select count(*) from test where v  0.9;

The last query hangs. I've debugged it. It's another bug in this cursed
loop in entryGetItem. Fix is attached.

--
With best regards,
Alexander Korotkov.


gin_lossy_bitmap_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Clean up jsonb code.

2014-05-09 Thread Alexander Korotkov
On Thu, May 8, 2014 at 11:45 AM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 05/08/2014 02:25 AM, Peter Geoghegan wrote:

 findJsonbValueFromSuperHeader()'s lowbound argument
 previously served to establish a low bound for searching when
 searching for multiple keys (so the second and subsequent
 user-supplied key could skip much of the object).


 Got that.

  In the case of
 jsonb_exists_any(), say, if you only have a reasonable expectation
 that about 1 key exists, and that happens to be the last key that the
 user passed to the text[] argument (to the existence/? operator), then
 n - 1 calls to what is now findJsonbValueFromContainer() (which now
 does not accept a lowbound) are wasted.


 Check.

  That's elem_count - 1
 top-level binary searches of the entire jsonb. Or elem_count such
 calls rather than 1 call (plus 1 sort of the supplied array) in the
 common case where jsonb_exists_any() will return false.


 Ok, but I don't see any big difference in that regard. It still called
 findJsonbValueFromContainer() elem_count times, before this commit. Each
 call was somewhat cheaper, because the lower bound of the binary search was
 initialized to where the previous search ended, but you still had to
 perform the search.

  Granted, that might not be that bad right now, given that it's only
 ever (say) elem_count or elem_count - 1 wasted binary searches through
 the *top* level, but that might not always be true.


 If we are ever to perform a deep search, I think we'll want to do much
 more to optimize that than just keep track of the lower bound. Like, start
 an iterator of tree and check for all of the keys in one go.

  And even today,
 sorting a presumably much smaller user-passed lookup array once has to
 be cheaper than searching through the entire jsonb perhaps elem_count
 times per call.


 Well, the quick testing I did suggested otherwise. It's not a big
 difference, but sorting has all kinds of overhead, like pallocing the array
 to sort, copying the elements around etc. For a small array, the startup
 cost of sorting trumps the savings in the binary searches. Possibly the way
 the sorting was done was not optimal, and you could reduce the copying and
 allocations involved in that, but it's hardly worth the trouble.


With current head I can't load delicious dataset into jsonb format. I got
segfault. It looks like memory corruption.

$ gzip -c -d js.copy.gz | psql postgres -c 'copy js from stdin;'
WARNING:  problem in alloc set ExprContext: bogus aset link in block
0x14a846000, chunk 0x14a84d278
CONTEXT:  COPY js, line 246766
WARNING:  problem in alloc set ExprContext: bad single-chunk 0x14a804b18 in
block 0x14a7e6000
CONTEXT:  COPY js, line 1009820
WARNING:  problem in alloc set ExprContext: bogus aset link in block
0x14a7e6000, chunk 0x14a804b18
CONTEXT:  COPY js, line 1009820
server closed the connection unexpectedly
 This probably means the server terminated abnormally
before or while processing the request.

You can get dataset here:
http://mira.sai.msu.ru/~megera/tmp/js.copy.gz

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Sequential disk access during VACUUM for GiST/GIN

2014-05-06 Thread Alexander Korotkov
Hi!

On Tue, Apr 29, 2014 at 2:34 PM, Костя Кузнецов chapae...@yandex.ru wrote:

 There is a task Sequential disk access during VACUUM for GiST/GIN  in
 list GSOC14.
 Nobody is working on this task?


I didn't hear anybody is working on it.


 Do I understand this task correctly?
 I must recode gistbulkdelete.
 GistBDItem *stack is must have items with sequential blkno as possible.


Yes, make gistbulkdelete and ginbulkdelete access disk sequentially while
now tree is traversed in logical order. So these functions need to be
completely reworked: I'm not sure GistBDItem will survive :)
The challenge is concurrency. Vacuum shouldn't block concurrent readers and
writers. You can see btbulkdelete  which supports sequential disk access
now.


 I have a question:
 where are access to disk in this function? ReadBufferExtended?


Yes, this function read buffer to shared memory (if it isn't already) and
pins it.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] 9.4 Proposal: Initdb creates a single table

2014-04-23 Thread Alexander Korotkov
On Wed, Apr 23, 2014 at 10:11 AM, Simon Riggs si...@2ndquadrant.com wrote:

 We start with a database called Postgres and a schema called Public.
 Yet we don't start up with any usable tables.

 I propose we add a single table called Postgres when we Initdb

   CREATE TABLE Postgres (Id Integer, Data Jsonb);
   COMMENT ON TABLE Postgres IS 'Single table for quick start usage -
 design your database';

 The purpose of this is to make the database immediately usable. By
 including this table in the default  initdb it will mean that programs
 can rely on the existence of this table and begin working quicker.

 By now, some of you will be doubled over laughing as if this is an
 April fool joke.  I don't mean it to be at all.


I can propose contrib PostgreNoSQL providing following:
1) Table postgres as you proposed.
2) Functions: get_postgres(id intgeger) returns jsonb, set_postgres(id
integer, data jsonb) returns void, search_postgres(query jsonb) returns
setof postgres. search_postgres will have semantics of @ jsonb operator
3) Background workers which provides HTTP wrapper over those functions.

--
With best regards,
Alexander Korotkov.


[HACKERS] Partial match fix for fast scan

2014-04-10 Thread Alexander Korotkov
Hi,

GIN partial match appears to be broken after fast scan. Following simple
test case raises assertion failure.

create extension btree_gin;
create table test as (select id, random() as val from
generate_series(1,100) id);
create index test_idx on test using gin (val);
vacuum test;
select * from test where val between 0.1 and 0.9;

Attached patch fixes bugs in entryGetItem function.
I would especially point that continue; checks while condition even if
it's postfix while. That's why I surrounded tbm_iterate with another
while.

--
With best regards,
Alexander Korotkov.


ginfastscanfix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partial match fix for fast scan

2014-04-10 Thread Alexander Korotkov
On Thu, Apr 10, 2014 at 8:22 PM, Fabrízio de Royes Mello 
fabriziome...@gmail.com wrote:


 On Thu, Apr 10, 2014 at 11:09 AM, Alexander Korotkov aekorot...@gmail.com
  wrote:

 Hi,

 GIN partial match appears to be broken after fast scan. Following simple
 test case raises assertion failure.

 create extension btree_gin;
 create table test as (select id, random() as val from
 generate_series(1,100) id);
 create index test_idx on test using gin (val);
 vacuum test;
 select * from test where val between 0.1 and 0.9;

 Attached patch fixes bugs in entryGetItem function.
 I would especially point that continue; checks while condition even
 if it's postfix while. That's why I surrounded tbm_iterate with another
 while.


 Interesting... to me (using current master) your test case doesn't fail...

 fabrizio=# select * from test where val between 0.1 and 0.9;
id   |val
 +---
   1 | 0.554413774050772
   2 | 0.767866868525743
   3 | 0.601187175605446
 ...


 But fail if I change the values of between clause:

 fabrizio=# select * from test where val between 0.1 and 0.19;
 ERROR:  tuple offset out of range: 8080



It must be compiled with --enable-cassert to fail on assertion.

--
With best regards,
Alexander Korotkov.


Re: default opclass for jsonb (was Re: [HACKERS] Call for GIST/GIN/SP-GIST opclass documentation)

2014-04-09 Thread Alexander Korotkov
On Wed, Apr 9, 2014 at 10:37 AM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 The ship has cleatly sailed to add parameterized opclasses to 9.4, but
 let's keep it in mind when we decide on the defaults.

 In the absence of parameterizable opclasses, it would be much more
 flexible to have opclasses that index, keys, values, key-value pairs and
 paths separately, instead of the current json_ops and json_hash_ops
 opclasses which index all of those in the same index. That way, if you only
 e.g. ever query on the existence of a key, you'd only need to index the
 keys.

 I don't understand how we ended up with the current dichotomy of json_ops
 and json_hash_ops...


+1 for parameterizable opclasses

--
With best regards,
Alexander Korotkov.


Re: default opclass for jsonb (was Re: [HACKERS] Call for GIST/GIN/SP-GIST opclass documentation)

2014-04-09 Thread Alexander Korotkov
On Wed, Apr 9, 2014 at 12:40 PM, Peter Geoghegan p...@heroku.com wrote:

 On Wed, Apr 9, 2014 at 1:21 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  I didn't say that. On the contrary, I think the shotgun approach
 jsonb_ops
  and jsonb_hash_ops take is too broad. It should be possible to specify
 what
  to index in a more detailed fashion.

 It is - use an expression index. That's by far the most important way
 to specify what to index in a more detailed fashion. There are others,
 but that's the major one. Beyond that, yes, it's necessary to
 carefully write your query predicate a certain way. However, a similar
 situation exists in MongoDB, where there is a distinction between
 Indexes on embedded fields (which must be accessed using special
 dot notation) and indexes on subdocuments (which cannot be
 accessed using dot notation). It's late here, but I'm pretty sure
 that's a feature and not a limitation.


I believe that serious limitation we now have is that we actually specify
kind of index to be used in the SQL query.
For example you need to find objects with active = true. You can write:

js @ {active: true}

then GIN index on js can be used. Also you can write:

js-'active' = true

then btree expression index on (js-'active') can be used. For sure, one
can do

js @ {active: true} AND js-'active' = true

This query can use any of indexes, but it is:
1) Cluge
2) Excess recheck
3) If both indexes present, excess bitmap and.

Having to choose index in SQL-query we make our SQL more imperative and
less declarative. Similar things can happen without json/hstore (user have
to rewrite SQL in order to use expression index), but now it could become
very common. My opinion is that we have to do something in planner to make
it understand at least this two kinds of queries to be equivalent.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GSoC 2014 proposal

2014-04-03 Thread Alexander Korotkov
On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 The BIRCH algorithm as described in the paper describes building a tree
 in memory. If I understood correctly, you're suggesting to use a pre-built
 GiST index instead. Interesting idea!

 There are a couple of signifcant differences between the CF tree
 described in the paper and GiST:

 1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
 a leaf item represents a cluster, which consists of one or more tuples. So
 the CF tree doesn't store an entry for every input tuple, which makes it
 possible to keep it in memory.

 2. In the CF tree, all entries in a leaf node must satisfy a threshold
 requirement, with respect to a threshold value T: the diameter (or radius)
 has to be less than T. GiST imposes no such restrictions. An item can
 legally be placed anywhere in the tree; placing it badly will just lead to
 degraded search performance, but it's still a legal GiST tree.

 3. A GiST index, like any other index in PostgreSQL, holds entries also
 for deleted tuples, until the index is vacuumed. So you cannot just use
 information from a non-leaf node and use it in the result, as the
 information summarized at a non-leaf level includes noise from the dead
 tuples.

 Can you elaborate how you are planning to use a GiST index to implement
 BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
 strict in where in the tree an item can be stored, and lets the operator
 class to specify exactly when a node is split etc.


 Hmmm, it's likely I've imagined something quite outside of this paper, and
 even already suggested it to Ivan... :)
 I need a little time to rethink it.


Using GiST we can implement BIRCH-like clustering like so:
1) Build a CF tree as GiST index without restriction of T threshold value.
2) Scan CF tree with threshold T with some auxiliary operator. If
consistent method see CF entry which diameter is greater than T then it
returns true. Otherwise it returns false and put this CF entry into output
area (could be either in-memory or temporary table).
3) Process other steps of algorithm as usual.

This modification would have following advantages:
1) User can build GiST index once and then try clustering with different
parameters. Initial GiST index build would be slowest operation while other
steps is expected to be fast.
2) Use GiST infrastructure and automatically get buffering build.

The drawback is that building GiST index is more expensive than building
in-memory CF tree with given threshold T (assuming T is well chosen).

Does it make any sense?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GSoC 2014 proposal

2014-04-03 Thread Alexander Korotkov
On Thu, Apr 3, 2014 at 11:21 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 04/03/2014 04:15 PM, Alexander Korotkov wrote:

 On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov aekorot...@gmail.com
 wrote:

  On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

  The BIRCH algorithm as described in the paper describes building a tree
 in memory. If I understood correctly, you're suggesting to use a
 pre-built
 GiST index instead. Interesting idea!

 There are a couple of signifcant differences between the CF tree
 described in the paper and GiST:

 1. In GiST, a leaf item always represents one heap tuple. In the CF
 tree,
 a leaf item represents a cluster, which consists of one or more tuples.
 So
 the CF tree doesn't store an entry for every input tuple, which makes it
 possible to keep it in memory.

 2. In the CF tree, all entries in a leaf node must satisfy a threshold
 requirement, with respect to a threshold value T: the diameter (or
 radius)
 has to be less than T. GiST imposes no such restrictions. An item can
 legally be placed anywhere in the tree; placing it badly will just lead
 to
 degraded search performance, but it's still a legal GiST tree.

 3. A GiST index, like any other index in PostgreSQL, holds entries also
 for deleted tuples, until the index is vacuumed. So you cannot just use
 information from a non-leaf node and use it in the result, as the
 information summarized at a non-leaf level includes noise from the dead
 tuples.

 Can you elaborate how you are planning to use a GiST index to implement
 BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
 strict in where in the tree an item can be stored, and lets the operator
 class to specify exactly when a node is split etc.


 Hmmm, it's likely I've imagined something quite outside of this paper,
 and
 even already suggested it to Ivan... :)
 I need a little time to rethink it.


 Using GiST we can implement BIRCH-like clustering like so:
 1) Build a CF tree as GiST index without restriction of T threshold value.
 2) Scan CF tree with threshold T with some auxiliary operator. If
 consistent method see CF entry which diameter is greater than T then it
 returns true. Otherwise it returns false and put this CF entry into output
 area (could be either in-memory or temporary table).
 3) Process other steps of algorithm as usual.


 I still don't understand how that would work. You can't trust the non-leaf
 entries, because their CF value can contain deleted entries. So you have to
 scan every tuple anyway. Once you do that, what's the point of the index?


Yeah, it is limitation of this idea. It's not going to be auto-updatable
CF. User can build index on freshly vacuumed table and play with clustering
some time. Updates on table during that time would be either allowed
clustering error or prohibited. Another potential solution is to let this
index to hold some snapshot of data. But it seems not possible to do in
extension now.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GSoC 2014 proposal

2014-04-02 Thread Alexander Korotkov
On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:

 The BIRCH algorithm as described in the paper describes building a tree in
 memory. If I understood correctly, you're suggesting to use a pre-built
 GiST index instead. Interesting idea!

 There are a couple of signifcant differences between the CF tree described
 in the paper and GiST:

 1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
 a leaf item represents a cluster, which consists of one or more tuples. So
 the CF tree doesn't store an entry for every input tuple, which makes it
 possible to keep it in memory.

 2. In the CF tree, all entries in a leaf node must satisfy a threshold
 requirement, with respect to a threshold value T: the diameter (or radius)
 has to be less than T. GiST imposes no such restrictions. An item can
 legally be placed anywhere in the tree; placing it badly will just lead to
 degraded search performance, but it's still a legal GiST tree.

 3. A GiST index, like any other index in PostgreSQL, holds entries also
 for deleted tuples, until the index is vacuumed. So you cannot just use
 information from a non-leaf node and use it in the result, as the
 information summarized at a non-leaf level includes noise from the dead
 tuples.

 Can you elaborate how you are planning to use a GiST index to implement
 BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
 strict in where in the tree an item can be stored, and lets the operator
 class to specify exactly when a node is split etc.


Hmmm, it's likely I've imagined something quite outside of this paper, and
even already suggested it to Ivan... :)
I need a little time to rethink it.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Cube extension kNN support

2014-03-31 Thread Alexander Korotkov
On Mon, Mar 31, 2014 at 10:01 PM, Sergey Konoplev gray...@gmail.com wrote:

 On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev gray...@gmail.com
 wrote:
  On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich stas.kelv...@gmail.com
 wrote:
  Here is the patch that introduces kNN search for cubes with euclidean,
 taxicab and chebyshev distances.
 
  What is the status of this patch?

 Referring to our private conversation with Alexander Korotkov, the
 patch is in WIP state currently, and, hopefully, will be ready by 9.5.
 I'm ready to actively participate in its testing on a real world
 production set of data.

 I'm not sure if it is doable at all, but are there any possibility to
 implement here, or, what would be just great, any ready/half ready
 solutions of a Hamming distance based kNN search?


Cube dealing with float8 numbers. There is another patch making it work
with other number types. But Hamming distance is for bit vectors, isn't it?


With best regards,
Alexander Korotkov.


[HACKERS] Proposal: fix range queries in btree_gin

2014-03-28 Thread Alexander Korotkov
Hackers,

after reading Heikki Linnakangas presentation about GIN from Nordic PGDay,
I figure out that btree_gin became much more attractive.
http://hlinnaka.iki.fi/presentations/NordicPGDay2014-GIN.pdf

But it have one weird behaviour: when you have range restriction like col
 const1 AND col  const2 it's handled as intersection of two separate
partial matches col  const1 and  col  const2. So, it's awfully slow :(

The opclass can't handle it better because it only deals with col 
const1 and  col  const2 separately. This two restrictions are
separately passed to gin_extract_query.

This problem is known, but now I can propose some solution.

We have range types, and restriction col @ range can be correctly
handled by gin_extract_query, because it will be passed there as single
restriction. This is workaround itself, but it's weird to force users
express queries like this.

We can add some logic to gin. If it sees:
1) Query contain both col  const1 and  col  const2 restrictions.
2) There is a range type for this type and comparison operator.
3) opclass supports col @ range
then rewrite this two restrictions as col @ range(const1, const2)

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] gsoc knn spgist

2014-03-25 Thread Alexander Korotkov
On Tue, Mar 25, 2014 at 8:16 PM, Костя Кузнецов chapae...@yandex.ru wrote:

 Hello.

 I submit a proposal. But Heikki Linnakangas write comments that i dont
 have a plan of implementation.

 My project is knn for spgist.

 Can I ask you a few questions?

 1. I research a commit gist knn implementation. in gist implementation in
 role of queue is ised rtree(with distance comparator) , in spgist
 implementation this is List. Can i use rtree in spgist ? if i cant then i
 can use.


KNN-GiST uses RB-tree for queue. RB-tree is very different from R-tree. And
yes, it can be used in SP-GiST. However, alternative is heap (in-memory
structure, not table heap). I don't know why GiST doesn't use heap instead
of RB-tree.


With best regards,
Alexander Korotkov.


Re: [HACKERS] jsonb and nested hstore

2014-03-20 Thread Alexander Korotkov
I've noticed two commits on github.

commit b8199ee3c2506ab81b47a0b440363fc90c0d6956
Author: Peter Geoghegan p...@heroku.com
Date:   Wed Mar 19 02:02:16 2014 -0700

For jsonb_hash_ops, hash less

By limiting the GIN entries to the least-nested level, the delicious.com
sample JSON dataset index shrinks in size from 382MB to 255MB without
any apparent downside.

commit 2cea5213dba011625fc0d5c6b447e838080087b1
Author: Peter Geoghegan p...@heroku.com
Date:   Wed Mar 19 02:13:42 2014 -0700

Revert For jsonb_hash_ops, hash less

This might be workable with another approach, but leave it for now. This
reverts commit b8199ee3c2506ab81b47a0b440363fc90c0d6956.

Besides implementation, what the idea was here? For me, it's impossible to
skip any single element, because it's possible for query to include only
this element. If we skip that element, we can't answer corresponding query
no more.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GSoC proposal. Index-only scans for GIST

2014-03-18 Thread Alexander Korotkov
On Tue, Mar 18, 2014 at 5:12 PM, Anastasia Lubennikova 
lubennikov...@gmail.com wrote:

 2. gistget algorithm (several parts omitted):

 Check the key
 gistindex_keytest() - does this index tuple satisfy the scan key(s)?

 Scan all items on the GiST index page and insert them into the queue (or
 directly to output areas)

 plain scan

 Heap tuple TIDs are returned into so-pageData[]

 ordered scan

 Heap tuple TIDs are pushed into individual search queue items

 If the fetch() is specified by the developer, then using it, algorithm can
 retrieve the data directly to output areas at this stage, without reference
 to the heap.


I think there are following changes to be made to GiST code:
1) When next consistent IndexTuple is found extract original Datums using
fetch method.
2) Put those original Datums to queue.
3) When returning next ItemPointer from queue put original Datums to
IndexScanDesc (into xs_itup).

 3. Create function gistcanreturn to check whether fetch() is specified by
 user.

 Amcanreturn - Function to check whether index supports index-only scans,
 or zero if none

 There is the question of support index-only scans for multicolumn indexes.
 Probably it would require extend the interface to add separate columns
 checking.

 To solve this question I need the community's help.


 4. Add details for index only scans into gistcostestimate function


Also, another part of work to be mentioned is to implement fetch function
for all suitable opclasses.


With best regards,
Alexander Korotkov.


Re: [HACKERS] GSoC proposal. Index-only scans for GIST

2014-03-18 Thread Alexander Korotkov
Josh,

Anastasia has already consulted to me in person. It is not big proposal.
But for newbie who is not familiar with PostgreSQL code base and especially
GiST it seems fair enough.


With best regards,
Alexander Korotkov.

On Tue, Mar 18, 2014 at 9:16 PM, Josh Berkus j...@agliodbs.com wrote:

 Alexander,

 Can you comment on the below proposal?  I'd like your opinion on how
 difficult it will be.

 On 03/18/2014 06:12 AM, Anastasia Lubennikova wrote:
  Hello!
  Here is the text of my proposal which I've applied to GSoC.
  (and link
 
 http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/lubennikovaav/5629499534213120
 )
  Any suggestions and comments are welcome.
 
  *Project name*
 
  Support for index-only scans for GIST index
 
 
 
  *Brief review*
 
  Currently GiST index don't have index-only scan functionality. Index-only
  scans are a major performance feature added to Postgres 9.2. They allow
  certain types of queries to be satisfied just by retrieving data from
  indexes, and not from tables. This feature for B-trees (implemented since
  PostgreSQL-9.2) allows doing certain types of queries significantly
 faster.
  This is achieved by reduction in the amount of I/O necessary to satisfy
  queries. I think it will be useful to implement this feature for user
  defined data types that use GiST index.
 
 
 
  *Benefits to the PostgreSQL Community*
 
  Faster GiST index search would be actual for many PostgreSQL applications
  (for example some GIS systems).
 
 
   *Quantifiable results*
 
  Acceleration of GiST index search.
 
 
  *Project details*
 
  1. The GiST is a balanced tree structure like a B-tree, containing key,
  pointer pairs. GiST key is a member of a user-defined class, and
  represents some property that is true of all data items reachable from
 the
  pointer associated with the key. The GiST provides a possibility to
 create
  custom data types with indexed access methods and extensible set of
 queries.
 
  There are seven methods that an index operator class for GiST must
 provide,
  and an eighth that is optional.
 
  -consistent
 
  -union
 
  -compress
 
  -decompress
 
  -penalty
 
  -picksplit
 
  -equal
 
  -distance (optional)
 
  I'm going to create new optional method fetch() in addition to existing.
  Thus user can create a method of retrieving data from the index without
  loss. It will be used when performing search queries to speed data
  retrieving.
 
 
 
  2. gistget algorithm (several parts omitted):
 
  Check the key
  gistindex_keytest() - does this index tuple satisfy the scan key(s)?
 
  Scan all items on the GiST index page and insert them into the queue (or
  directly to output areas)
 
  plain scan
 
  Heap tuple TIDs are returned into so-pageData[]
 
  ordered scan
 
  Heap tuple TIDs are pushed into individual search queue items
 
  If the fetch() is specified by the developer, then using it, algorithm
 can
  retrieve the data directly to output areas at this stage, without
 reference
  to the heap.
 
 
  3. Create function gistcanreturn to check whether fetch() is specified by
  user.
 
  Amcanreturn - Function to check whether index supports index-only scans,
 or
  zero if none
 
  There is the question of support index-only scans for multicolumn
 indexes.
  Probably it would require extend the interface to add separate columns
  checking.
 
  To solve this question I need the community's help.
 
 
  4. Add details for index only scans into gistcostestimate function
 
 
 
   *Links*
 
  1) Hellerstein J. M., Naughton J. F., Pfeffer A. Generalized search
  trees for database systems. - September, 1995.
 
  2) http://www.sai.msu.su/~megera/postgres/gist/
 
  3) PostgreSQL 9.3.3 Documentation: chapters 11. Indexes, 55. GiST
  Indexes.
 


 --
 Josh Berkus
 PostgreSQL Experts Inc.
 http://pgexperts.com



Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-03-15 Thread Alexander Korotkov
On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 03/15/2014 08:40 PM, Fujii Masao wrote:

 Hi,

 I executed the following statements in HEAD and 9.3, and compared
 the size of WAL which were generated by data insertion in GIN index.

 -
 CREATE EXTENSION pg_trgm;
 CREATE TABLE hoge (col1 text);
 CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
 (FASTUPDATE = off);

 CHECKPOINT;
 SELECT pg_switch_xlog();
 SELECT pg_switch_xlog();

 SELECT pg_current_xlog_location();
 INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 100);
 SELECT pg_current_xlog_location();
 -

 The results of WAL size are

  960 MB (9.3)
2113 MB (HEAD)

 The WAL size in HEAD was more than two times bigger than that in 9.3.
 Recently the source code of GIN index has been changed dramatically.
 Is the increase in GIN-related WAL intentional or a bug?


 It was somewhat expected. Updating individual items on the new-format GIN
 pages requires decompressing and recompressing the page, and the
 recompressed posting lists need to be WAL-logged. Which generates much
 larger WAL records.

 That said, I didn't expect the difference to be quite that big when you're
 appending to the end of the table. When the new entries go to the end of
 the posting lists, you only need to recompress and WAL-log the last posting
 list, which is max 256 bytes long. But I guess that's still a lot more WAL
 than in the old format.

 That could be optimized, but I figured we can live with it, thanks to the
 fastupdate feature. Fastupdate allows amortizing that cost over several
 insertions. But of course, you explicitly disabled that...


Let me know if you want me to write patch addressing this issue.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] jsonb and nested hstore

2014-03-13 Thread Alexander Korotkov
On Thu, Mar 13, 2014 at 1:21 PM, Greg Stark st...@mit.edu wrote:

 Well these are just normal gin and gist indexes. If we want to come up
 with new index operator classess we can still do that and keep the old
 ones if necessary. Even that seems pretty unlikely from past experience.

 I'm actually pretty sanguine even about keeping the GIST opclass. If
 it has bugs then the bugs only affect people who use this non-default
 opclass and we can fix them. It doesn't risk questioning any basic
 design choices in the patch.

 It does sound like the main question here is which opclass should be
 the default. From the discussion there's a jsonb_hash_ops which works
 on all input values but supports fewer operators and a jsonb_ops which
 supports more operators but can't handle json with larger individual
 elements. Perhaps it's better to make jsonb_hash_ops the default so at
 least it's always safe to create a default gin index?


A couple of thoughts from me:
1) We can evade length limitation if GIN index by truncating long values
and setting recheck flag. We can introduce some indicator of truncated
value like zero byte at the end.
2) jsonb_hash_ops can be extended to handle keys queries too. We can
preserve one bit in hash as flag indicating whether it's a hash of key or
hash of path to value. For sure, such index would be a bit larger. Also,
jsonb_hash_ops can be split into two: with and without keys.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-03-13 Thread Alexander Korotkov
On Thu, Mar 13, 2014 at 8:58 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 03/12/2014 07:52 PM, Alexander Korotkov wrote:

 
 * I just noticed that the dummy trueTriConsistentFn returns GIN_MAYBE,
 rather than GIN_TRUE. The equivalent boolean version returns 'true'
 without
 recheck. Is that a typo, or was there some reason for the discrepancy?
 

 Actually, there is not difference in current implementation, But I
 implemented it so that trueTriConsistentFn can correctly work
 with shimBoolConsistentFn. In this case it should return GIN_MAYBE in case
 when it have no GIN_MAYBE in the input (as analogue of setting recheck
 flag). So, it could return GIN_TRUE only if it checked that input has
 GIN_MAYBE. However, checking would be just wasting of cycles. So I end up
 with just GIN_MAYBE :-)


 I don't understand that. As it is, it's inconsistent with the boolean
 trueConsistent function. trueConsistent always returns TRUE with
 recheck=false. And in GIN_SEARCH_MODE_EVERYTHING mode, there are no regular
 scan keys.


Ok, I see. I just messed it up.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-03-12 Thread Alexander Korotkov
On Wed, Mar 12, 2014 at 8:29 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 03/12/2014 12:09 AM, Tomas Vondra wrote:

 Hi all,

 a quick question that just occured to me - do you plan to tweak the cost
 estimation fot GIN indexes, in this patch?

 IMHO it would be appropriate, given the improvements and gains, but it
 seems to me gincostestimate() was not touched by this patch.


 Good point. We have done two major changes to GIN in this release cycle:
 changed the data page format and made it possible to skip items without
 fetching all the keys (fast scan). gincostestimate doesn't know about
 either change.

 Adjusting gincostestimate for the more compact data page format seems
 easy. When I hacked on that, I assumed all along that gincostestimate
 doesn't need to be changed as the index will just be smaller, which will be
 taken into account automatically. But now that I look at gincostestimate,
 it assumes that the size of one item on a posting tree page is a constant 6
 bytes (SizeOfIptrData), which is no longer true. I'll go fix that.

 Adjusting for the effects of skipping is harder. gincostestimate needs to
 do the same preparation steps as startScanKey: sort the query keys by
 frequency, and call consistent function to split the keys intao required
 and additional sets. And then model that the additional entries only
 need to be fetched when the other keys match. That's doable in principle,
 but requires a bunch of extra code.

 Alexander, any thoughts on that? It's getting awfully late to add new code
 for that, but it sure would be nice somehow take fast scan into account.


Preparation we do in startScanKey requires knowledge of estimate size of
posting lists/trees. We do this estimate by traversal to leaf pages. I
think gincostestimate is expected to be way more cheap. So, we probably
need so more rough estimate there, don't we?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-03-12 Thread Alexander Korotkov
On Wed, Mar 12, 2014 at 8:02 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 02/26/2014 11:25 PM, Alexander Korotkov wrote:

 On Thu, Feb 27, 2014 at 1:07 AM, Alexander Korotkov aekorot...@gmail.com
 wrote:

  On Thu, Feb 20, 2014 at 1:48 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

  On 02/09/2014 12:11 PM, Alexander Korotkov wrote:

  I've rebased catalog changes with last master. Patch is attached. I've
 rerun my test suite with both last master ('committed') and attached
 patch ('ternary-consistent').


 Thanks!


 method |   sum

 +--
committed  | 143491.71501
fast-scan-11   | 126916.11199
fast-scan-light|   137321.211
fast-scan-light-heikki | 138168.02801
master |   446976.288
ternary-consistent |   125923.514

 I explain regression in last master by change of MAX_MAYBE_ENTRIES
 from 8
 to 4.


 Yeah, probably. I set MAX_MAYBE_ENTRIES to 8 in initial versions to make
 sure we get similar behavior in Tomas' tests that used 6 search terms.
 But
 I always felt that it was too large for real queries, once we have the
 catalog changes, that's why I lowered to 4 when committing. If an
 opclass
 benefits greatly from fast scan, it should provide the ternary
 consistent
 function, and not rely on the shim implementation.


   I'm not sure about decision to reserve separate procedure number for

 ternary consistent. Probably, it would be better to add ginConfig
 method.
 It would be useful for post 9.4 improvements.


 Hmm, it might be useful for an opclass to provide both, a boolean and
 ternary consistent function, if the boolean version is significantly
 more
 efficient when all the arguments are TRUE/FALSE. OTOH, you could also
 do a
 quick check through the array to see if there are any MAYBE arguments,
 within the consistent function. But I'm inclined to keep the
 possibility to
 provide both versions. As long as we support the boolean version at all,
 there's not much difference in terms of the amount of code to support
 having them both for the same opclass.

 A ginConfig could be useful for many other things, but I don't think
 it's
 worth adding it now.


 What's the difference between returning GIN_MAYBE and GIN_TRUE+recheck?
 We discussed that earlier, but didn't reach any conclusion. That needs
 to
 be clarified in the docs. One possibility is to document that they're
 equivalent. Another is to forbid one of them. Yet another is to assign a
 different meaning to each.

 I've been thinking that it might be useful to define them so that a
 MAYBE
 result from the tri-consistent function means that it cannot decide if
 you
 have a match or not, because some of the inputs were MAYBE. And
 TRUE+recheck means that even if all the MAYBE inputs were passed as
 TRUE or
 FALSE, the result would be the same, TRUE+recheck. The practical
 difference
 would be that if the tri-consistent function returns TRUE+recheck,
 ginget.c
 wouldn't need to bother fetching the other entries, it could just return
 the entry with recheck=true immediately. While with MAYBE result, it
 would
 fetch the other entries and call tri-consistent again. ginget.c doesn't
 currently use the tri-consistent function that way - it always fetches
 all
 the entries for a potential match before calling tri-consistent, but it
 could. I had it do that in some of the patch versions, but Tomas'
 testing
 showed that it was a big loss on some queries, because the consistent
 function was called much more often. Still, something like that might be
 sensible in the future, so it might be good to distinguish those cases
 in
 the API now. Note that ginarrayproc is already using the return values
 like
 that: in GinContainedStrategy, it always returns TRUE+recheck
 regardless of
 the inputs, but in other cases it uses GIN_MAYBE.



 Next revision of patch is attached.

 In this version opclass should provide at least one consistent function:
 binary or ternary. It's expected to achieve best performance when opclass
 provide both of them. However, tests shows opposite :( I've to recheck it
 carefully.


 However, it's not!
 This revision of patch completes my test case in 123330 ms. This is
 slightly faster than previous revision.


 Great. Committed, I finally got around to it.

 Some minor changes: I reworded the docs and also updated the table of
 support functions in xindex.sgml. I refactored the query in opr_sanity.sql,
 to make it easier to read, and to check more carefully that the required
 support functions are present. I also added a runtime check to avoid
 crashing if neither is present.

 A few things we ought to still discuss:

 * I just noticed that the dummy trueTriConsistentFn returns GIN_MAYBE,
 rather than GIN_TRUE. The equivalent boolean version returns 'true' without
 recheck. Is that a typo, or was there some reason for the discrepancy?


Actually

Re: [HACKERS] jsonb and nested hstore

2014-03-11 Thread Alexander Korotkov
On Tue, Mar 11, 2014 at 5:19 AM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Mar 10, 2014 at 4:19 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  Here it is.

 So it looks like what you have here is analogous to the other problems
 that I fixed with both GiST and GIN. That isn't surprising, and this
 does fix my test-case. I'm not terribly happy about the lack of
 explanation for the hashing in that loop, though. Why use COMP_CRC32()
 at all, for one thing?

 Why do this for non-primitive jsonb hashing?

 COMP_CRC32(stack-hash_state, PATH_SEPARATOR, 1);

 Where PATH_SEPARATOR is:

 #define PATH_SEPARATOR (\0)

 Actually, come to think of it, why not just use one hashing function
 everywhere? i.e., jsonb_hash(PG_FUNCTION_ARGS)? It's already very
 similar. Pretty much every hash operator support function 1 (i.e. a
 particular type's hash function) is implemented with hash_any(). Can't
 we just do the same here? In any case it isn't obvious why the
 requirements for those two things (the hashing mechanism used by the
 jsonb_hash_ops GIN opclass, and the hash operator class support
 function 1 hash function) cannot be the same thing.


It's because CRC32 interface allows incremental calculation while hash_any
requires single chunk of memory. I don't think that unfolding everything is
good idea. But we could implement incremental interface for hash_any.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] jsonb and nested hstore

2014-03-10 Thread Alexander Korotkov
On Mon, Mar 10, 2014 at 1:18 PM, Peter Geoghegan p...@heroku.com wrote:

 * The jsonb_hash_ops non-default GIN opclass is broken. It has its own
 idiosyncratic notion of what constitutes containment, that sees it
 only return, say, jsonb arrays that have a matching string as their
 leftmost element (if we ask it if it contains within it another
 array with the same string). Because of the limited number of
 indexable operators (only @), I'd put this opclass in the same
 category as GiST in terms of my willingness to forgo it for a release,
 even if it did receive a loud applause at pgConf.EU. Again, it might
 be some disparity between the opertors as they existed in hstore2 at
 one time, and as they exist in the core code now, but I doubt it, not
 least since the regression tests didn't pick this up, and it's such a
 basic thing. Perhaps Oleg and Teodor just need to explain this to me.


I din't get comment about leftmost element. There is absolutely no
distinguish between array elements. All elements are extracted into same
keys independent of their indexes. It seems to have no change since I wrote
hstore_hash_ops.  Could you share test case to illustrate what you mean?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] jsonb and nested hstore

2014-03-10 Thread Alexander Korotkov
On Mon, Mar 10, 2014 at 2:19 PM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Mar 10, 2014 at 3:00 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  I din't get comment about leftmost element. There is absolutely no
  distinguish between array elements. All elements are extracted into same
  keys independent of their indexes. It seems to have no change since I
 wrote
  hstore_hash_ops.  Could you share test case to illustrate what you mean?

 I don't have time to post that at the moment, but offhand I *think*
 your confusion may be due to the fact that the json_hash_ops opclass
 (as I call it) was previously consistent with the behavior of the
 other GIN opclass (the default). The problem is that they (well, at
 least the default GIN and GiST opclasses) were inconsistent with how
 the containment operator behaved in respect of jsonb array elements
 generally.

 Here is the commit on our feature branch where I fixed the problem for
 the default GIN opclass:


 https://github.com/feodor/postgres/commit/6f5e4fe9fc34f9512919b1c8b6a54952ab288640s

 If it doesn't explain the problem, you may still wish to comment on
 the correctness of this fix. I am still waiting on feedback from Oleg
 and Teodor.


Apparently, there is bug in calculation of hashes. Array elements were
hashed incrementally while each of them should be hashed separately. That
cause an effect of distinguishing array elements by their indexes. Not sure
about when this bug was added.
Fix is attached.

--
With best regards,
Alexander Korotkov.


jsonb-hash-ops-fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-03-10 Thread Alexander Korotkov
On Mon, Mar 10, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Mar 10, 2014 at 3:47 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  Fix is attached.

 Could you post a patch with regression tests, please?


Here it is.

--
With best regards,
Alexander Korotkov.


jsonb-hash-ops-fix-2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2014-03-01 Thread Alexander Korotkov
On Mon, Feb 10, 2014 at 1:01 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  I looked at this patch a bit.  It seems like this:
  +  *BLANK_COLOR_SIZE - How much blank character is more frequent
 than
  +  *   other character in average
  + #define BLANK_COLOR_SIZE  32
  is a drastic overestimate.

  OK, I would like to make more reasoning for penalty.
  Let's consider word of length n.
  It contains n+1 trigrams including:
  1 __x trigram
  1 _xx trigram
  1 xx_ trigram
  n - 2 xxx trigrams

  Assume alphabet of size m those trigrams have following average
 frequencies:
  __x 1/((n+1)*m)
  _xx 1/((n+1)*m^2)
  xx_ 1/((n+1)*m^2)
  xxx (n-2)/((n+1)*m^3)

 Hmm, but you're assuming that the m letters are all equally frequent,
 which is surely not true in normal text.

 I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
 but I do have a text file with the collected works of H.P. Lovecraft,
 so I tried analyzing the trigrams in that.  I find that
 * The average word length is 4.78 characters.
 * There are 5652 distinct trigrams (discounting some containing digits),
 the most common one ('  t') occurring 81222 times; the average
 occurrence count is 500.0.
 * Considering only trigrams not containing any blanks, there are
 4937 of them, the most common one ('the') occurring 45832 times,
 with an average count of 266.9.
 * There are (unsurprisingly) exactly 26 trigrams of the form '  x',
 with an average count of 19598.5.
 * There are 689 trigrams of the form ' xx' or 'xx ', the most common one
 (' th') occurring 58200 times, with an average count of 1450.1.

 So, we've rediscovered the fact that 'the' is the most common word in
 English text ;-).  But I think the significant conclusion for our purposes
 here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
 times more common than the average space-free trigram, and two-space
 trigrams about 19598.5/266.9 = 73.4 times more common.

 So this leads me to the conclusion that we should be using a
 BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
 a square-law rule for two-space trigrams).  Even using your numbers,
 it shouldn't be 32.


Next revision of patch is attached. Changes are so:
1) Notion penalty is used instead of size.
2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is
MAX_TRGM_COUNT total trigrams count.
3) Penalties are assigned to particular color trigram classes. I.e.
separate penalties for __a, _aa, _a_, aa_. It's based on analysis of
trigram frequencies in Oscar Wilde writings. We can end up with different
numbers, but I don't think they will be dramatically different.

--
With best regards,
Alexander Korotkov.


trgm-regex-optimize.3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part2: fast scan

2014-02-26 Thread Alexander Korotkov
On Thu, Feb 20, 2014 at 1:48 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 02/09/2014 12:11 PM, Alexander Korotkov wrote:

 I've rebased catalog changes with last master. Patch is attached. I've
 rerun my test suite with both last master ('committed') and attached
 patch ('ternary-consistent').


 Thanks!


method |   sum
 +--
   committed  | 143491.71501
   fast-scan-11   | 126916.11199
   fast-scan-light|   137321.211
   fast-scan-light-heikki | 138168.02801
   master |   446976.288
   ternary-consistent |   125923.514

 I explain regression in last master by change of MAX_MAYBE_ENTRIES from 8
 to 4.


 Yeah, probably. I set MAX_MAYBE_ENTRIES to 8 in initial versions to make
 sure we get similar behavior in Tomas' tests that used 6 search terms. But
 I always felt that it was too large for real queries, once we have the
 catalog changes, that's why I lowered to 4 when committing. If an opclass
 benefits greatly from fast scan, it should provide the ternary consistent
 function, and not rely on the shim implementation.


  I'm not sure about decision to reserve separate procedure number for
 ternary consistent. Probably, it would be better to add ginConfig method.
 It would be useful for post 9.4 improvements.


 Hmm, it might be useful for an opclass to provide both, a boolean and
 ternary consistent function, if the boolean version is significantly more
 efficient when all the arguments are TRUE/FALSE. OTOH, you could also do a
 quick check through the array to see if there are any MAYBE arguments,
 within the consistent function. But I'm inclined to keep the possibility to
 provide both versions. As long as we support the boolean version at all,
 there's not much difference in terms of the amount of code to support
 having them both for the same opclass.

 A ginConfig could be useful for many other things, but I don't think it's
 worth adding it now.


 What's the difference between returning GIN_MAYBE and GIN_TRUE+recheck? We
 discussed that earlier, but didn't reach any conclusion. That needs to be
 clarified in the docs. One possibility is to document that they're
 equivalent. Another is to forbid one of them. Yet another is to assign a
 different meaning to each.

 I've been thinking that it might be useful to define them so that a MAYBE
 result from the tri-consistent function means that it cannot decide if you
 have a match or not, because some of the inputs were MAYBE. And
 TRUE+recheck means that even if all the MAYBE inputs were passed as TRUE or
 FALSE, the result would be the same, TRUE+recheck. The practical difference
 would be that if the tri-consistent function returns TRUE+recheck, ginget.c
 wouldn't need to bother fetching the other entries, it could just return
 the entry with recheck=true immediately. While with MAYBE result, it would
 fetch the other entries and call tri-consistent again. ginget.c doesn't
 currently use the tri-consistent function that way - it always fetches all
 the entries for a potential match before calling tri-consistent, but it
 could. I had it do that in some of the patch versions, but Tomas' testing
 showed that it was a big loss on some queries, because the consistent
 function was called much more often. Still, something like that might be
 sensible in the future, so it might be good to distinguish those cases in
 the API now. Note that ginarrayproc is already using the return values like
 that: in GinContainedStrategy, it always returns TRUE+recheck regardless of
 the inputs, but in other cases it uses GIN_MAYBE.


Next revision of patch is attached.

In this version opclass should provide at least one consistent function:
binary or ternary. It's expected to achieve best performance when opclass
provide both of them. However, tests shows opposite :( I've to recheck it
carefully.

I've removed recheck flag from ternary consistent function. It have to
return GIN_MAYBE instead.

--
With best regards,
Alexander Korotkov.


gin-ternary-consistent-2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part2: fast scan

2014-02-26 Thread Alexander Korotkov
On Thu, Feb 27, 2014 at 1:07 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Thu, Feb 20, 2014 at 1:48 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 On 02/09/2014 12:11 PM, Alexander Korotkov wrote:

 I've rebased catalog changes with last master. Patch is attached. I've
 rerun my test suite with both last master ('committed') and attached
 patch ('ternary-consistent').


 Thanks!


method |   sum
 +--
   committed  | 143491.71501
   fast-scan-11   | 126916.11199
   fast-scan-light|   137321.211
   fast-scan-light-heikki | 138168.02801
   master |   446976.288
   ternary-consistent |   125923.514

 I explain regression in last master by change of MAX_MAYBE_ENTRIES from 8
 to 4.


 Yeah, probably. I set MAX_MAYBE_ENTRIES to 8 in initial versions to make
 sure we get similar behavior in Tomas' tests that used 6 search terms. But
 I always felt that it was too large for real queries, once we have the
 catalog changes, that's why I lowered to 4 when committing. If an opclass
 benefits greatly from fast scan, it should provide the ternary consistent
 function, and not rely on the shim implementation.


  I'm not sure about decision to reserve separate procedure number for
 ternary consistent. Probably, it would be better to add ginConfig method.
 It would be useful for post 9.4 improvements.


 Hmm, it might be useful for an opclass to provide both, a boolean and
 ternary consistent function, if the boolean version is significantly more
 efficient when all the arguments are TRUE/FALSE. OTOH, you could also do a
 quick check through the array to see if there are any MAYBE arguments,
 within the consistent function. But I'm inclined to keep the possibility to
 provide both versions. As long as we support the boolean version at all,
 there's not much difference in terms of the amount of code to support
 having them both for the same opclass.

 A ginConfig could be useful for many other things, but I don't think it's
 worth adding it now.


 What's the difference between returning GIN_MAYBE and GIN_TRUE+recheck?
 We discussed that earlier, but didn't reach any conclusion. That needs to
 be clarified in the docs. One possibility is to document that they're
 equivalent. Another is to forbid one of them. Yet another is to assign a
 different meaning to each.

 I've been thinking that it might be useful to define them so that a MAYBE
 result from the tri-consistent function means that it cannot decide if you
 have a match or not, because some of the inputs were MAYBE. And
 TRUE+recheck means that even if all the MAYBE inputs were passed as TRUE or
 FALSE, the result would be the same, TRUE+recheck. The practical difference
 would be that if the tri-consistent function returns TRUE+recheck, ginget.c
 wouldn't need to bother fetching the other entries, it could just return
 the entry with recheck=true immediately. While with MAYBE result, it would
 fetch the other entries and call tri-consistent again. ginget.c doesn't
 currently use the tri-consistent function that way - it always fetches all
 the entries for a potential match before calling tri-consistent, but it
 could. I had it do that in some of the patch versions, but Tomas' testing
 showed that it was a big loss on some queries, because the consistent
 function was called much more often. Still, something like that might be
 sensible in the future, so it might be good to distinguish those cases in
 the API now. Note that ginarrayproc is already using the return values like
 that: in GinContainedStrategy, it always returns TRUE+recheck regardless of
 the inputs, but in other cases it uses GIN_MAYBE.


 Next revision of patch is attached.

 In this version opclass should provide at least one consistent function:
 binary or ternary. It's expected to achieve best performance when opclass
 provide both of them. However, tests shows opposite :( I've to recheck it
 carefully.


However, it's not!
This revision of patch completes my test case in 123330 ms. This is
slightly faster than previous revision.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] PoC: Partial sort

2014-02-19 Thread Alexander Korotkov
On Thu, Feb 13, 2014 at 1:54 AM, Marti Raudsepp ma...@juffo.org wrote:

 I think the 1st patch now has a bug in initial_cost_mergejoin; you
 still pass the presorted_keys argument to cost_sort, making it
 calculate a partial sort cost, but generated plans never use partial
 sort. I think 0 should be passed instead. Patch attached, needs to be
 applied on top of partial-sort-basic-1 and then reverse-applied on
 partial-sort-merge-1.


It doesn't look so for me. Merge join doesn't find partial sort especially.
But if path with some presorted pathkeys will be accidentally selected then
partial sort will be used. See create_mergejoin_plan function. So, I think
this cost_sort call is relevant to create_mergejoin_plan. If we don't want
partial sort to be used in such rare cases then we should revert it from
both places. However, I doubt that it does any overhead, so we can leave it
as is.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Small GIN optimizations (after 9.4)

2014-02-11 Thread Alexander Korotkov
On Wed, Feb 12, 2014 at 2:58 AM, Bruce Momjian br...@momjian.us wrote:

 On Sun, Feb  9, 2014 at 02:17:12PM +0400, Alexander Korotkov wrote:
  On Thu, Feb 6, 2014 at 8:31 PM, PostgreSQL - Hans-J rgen Sch nig 
  postg...@cybertec.at wrote:
 
  i think there is one more thing which would be really good in GIN
 and which
  would solve a ton of issues.
  atm GIN entries are sorted by item pointer.
  if we could sort them by a column it would fix a couple of real
 work
  issues such as ...
 
  SELECT ... FROM foo WHERE tsearch_query ORDER BY price
 DESC LIMIT
  10
 
  ... or so.
  it many cases you want to search for a, say, product and find the
 cheapest
  / most expensive one.
  if the tsearch_query yields a high number of rows (which it often
 does) the
  subsequent sort will kill you.
 
 
  This is not intended to be a small change. However, some solution might
 be
  possible in post 9.4 gin improvements or in new secret indexing project
 which
  will be presented at PGCon :-)

 Would any of the listed changes cause backward-incompatible changes to
 the on-disk format, causing problems for pg_upgrade?


None of them.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] PoC: Partial sort

2014-02-10 Thread Alexander Korotkov
On Mon, Feb 10, 2014 at 2:33 PM, Marti Raudsepp ma...@juffo.org wrote:

 On Sun, Feb 9, 2014 at 7:37 PM, Alexander Korotkov aekorot...@gmail.com
 wrote:
  This is not only place that worry me about planning overhead. See
  get_cheapest_fractional_path_for_pathkeys. I had to estimate number of
  groups for each sorting column in order to get right fractional path.

 AFAICT this only happens once per plan and the overhead is O(n) to the
 number of pathkeys? I can't get worried about that, but I guess it's
 better to test anyway.

 PS: You didn't answer my questions about splitting the patch. I guess
 I'll have to do that anyway to run the tests.


Done. Patch is splitted.

--
With best regards,
Alexander Korotkov.


partial-sort-basic-1.patch.gz
Description: GNU Zip compressed data


partial-sort-merge-1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2014 - mentors, students and admins

2014-02-10 Thread Alexander Korotkov
Hi!

On Tue, Jan 28, 2014 at 9:34 PM, Thom Brown t...@linux.com wrote:

 And I'd be fine with being admin again this year, unless there's
 anyone else who would like to take up the mantle?


Thanks for your work. I would like to see you as admin this year again.

Who would be up for mentoring this year?  And are there any project
 ideas folk would like to suggest?


I would like to be mentor.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-02-09 Thread Alexander Korotkov
On Fri, Feb 7, 2014 at 5:33 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:

 On 02/06/2014 01:22 PM, Alexander Korotkov wrote:

 Difference is very small. For me, it looks ready for commit.


 Great, committed!

 Now, to review the catalog changes...


I've rebased catalog changes with last master. Patch is attached. I've
rerun my test suite with both last master ('committed') and attached
patch ('ternary-consistent').

 method |   sum
+--
 committed  | 143491.71501
 fast-scan-11   | 126916.11199
 fast-scan-light|   137321.211
 fast-scan-light-heikki | 138168.02801
 master |   446976.288
 ternary-consistent |   125923.514

I explain regression in last master by change of MAX_MAYBE_ENTRIES from 8
to 4. However I'm not sure why ternary-consistent show so good results.
Probably it's because some optimizations you did in committed version which
wasn't visible because of change of MAX_MAYBE_ENTRIES.
I'm not sure about decision to reserve separate procedure number for
ternary consistent. Probably, it would be better to add ginConfig method.
It would be useful for post 9.4 improvements.

--
With best regards,
Alexander Korotkov.


gin-ternary-consistent.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small GIN optimizations (after 9.4)

2014-02-09 Thread Alexander Korotkov
On Thu, Feb 6, 2014 at 8:31 PM, PostgreSQL - Hans-Jürgen Schönig 
postg...@cybertec.at wrote:

 i think there is one more thing which would be really good in GIN and
 which would solve a ton of issues.
 atm GIN entries are sorted by item pointer.
 if we could sort them by a column it would fix a couple of real work
 issues such as ...

 SELECT ... FROM foo WHERE tsearch_query ORDER BY price DESC
 LIMIT 10

 ... or so.
 it many cases you want to search for a, say, product and find the cheapest
 / most expensive one.
 if the tsearch_query yields a high number of rows (which it often does)
 the subsequent sort will kill you.


This is not intended to be a small change. However, some solution might be
possible in post 9.4 gin improvements or in new secret indexing project
which will be presented at PGCon :-)

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Small GIN optimizations (after 9.4)

2014-02-09 Thread Alexander Korotkov
On Thu, Feb 6, 2014 at 3:36 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:

 While hacking on the GIN patches, I've come up with a few different ideas
 for improving performance. It's too late for 9.4, but I'll list them here
 if someone wants to work on them later:

 * Represent ItemPointers as uint64's, to speed up comparisons.
 ginCompareItemPointers is inlined into only a few instructions, but it's
 still more expensive than a single-instruction 64-bit comparison.
 ginCompareItemPointers is called very heavily in a GIN scan, so even a
 small improvement there would make for a noticeable speedup. It might be an
 improvement in code clarity, too.

 * Keep the entry streams of a GinScanKey in a binary heap, to quickly find
 the minimum curItem among them.

 I did this in various versions of the fast scan patch, but then I realized
 that the straightforward way of doing it is wrong, because a single
 GinScanEntry can be part of multiple GinScanKeys. If an entry's curItem is
 updated as part of advancing one key, and the entry is in a heap of another
 key, updating the curItem can violate the heap property of the other
 entry's heap.

 * Build a truth table (or cache) of consistent-function's results, and use
 that instead of calling consistent for every item.


Caching seems more appropriate for me. Intuition tells me that when number
of entries is high then far not all consistent combinations will be used.
However, intuition might be false :-)

* Deduce AND or OR logic from the consistent function. Or have the opclass
 provide a tree of AND/OR/NOT nodes directly, instead of a consistent
 function. For example, if the query is foo  bar, we could avoid calling
 consistent function altogether, and only return items that match both.


I also had this idea. But this solution doesn't cover similarity queries.
If you have 20 entries and consistent function returns true when at least
10 of 20 are present then representation in AND/OR/NOT nodes would be too
enormous, so useless.

* Delay decoding segments during a scan. Currently, we decode all segments
 of a posting tree page into a single array at once. But with fast scan,
 we might be able to skip over all entries in some of the segments. So it
 would be better to copy the segments into backend-private memory in
 compressed format, and decode them one segment at a time (or maybe even one
 item at a time), when needed. That would avoid the unnecessary decoding of
 segments that can be skipped over, and would also reduce memory usage of a
 scan.


I would like to add another one as continue of fast-scan:
* Skip scanning of some entries at all forcing recheck instead. Correct
decision should be done based on costs. However, I'm not sure about design.
Because it's like a planning feature. How correct to do this inside of GIN?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] PoC: Partial sort

2014-02-09 Thread Alexander Korotkov
On Thu, Feb 6, 2014 at 12:39 PM, Marti Raudsepp ma...@juffo.org wrote:

 On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas robertmh...@gmail.com wrote:
  Hmm, sounds a little steep.  Why is it so expensive?  I'm probably
  missing something here, because I would have thought that planner
  support for partial sorts would consist mostly of considering the same
  sorts we consider today, but with the costs reduced by the batching.

 I guess it's because the patch undoes some optimizations in the
 mergejoin planner wrt caching merge clauses and adds a whole lot of
 code to find_mergeclauses_for_pathkeys. In other code paths the
 overhead does seem to be negligible.

 Notice the removal of:
 /* Select the right mergeclauses, if we didn't already */
 /*
  * Avoid rebuilding clause list if we already made one;
  * saves memory in big join trees...
  */


This is not only place that worry me about planning overhead.
See get_cheapest_fractional_path_for_pathkeys. I had to estimate number of
groups for each sorting column in order to get right fractional path. For
partial sort path, cost of first batch should be included into initial
cost.
If don't do so, optimizer can pick up strange plans basing on assumption
that it need only few rows from inner node. See an example.

create table test1 as (
select id,
(random()*100)::int as v1,
(random()*1)::int as v2
from generate_series(1,100) id);

create table test2 as (
select id,
(random()*100)::int as v1,
(random()*1)::int as v2
from generate_series(1,100) id);

create index test1_v1_idx on test1 (v1);

Plan without fraction estimation in
get_cheapest_fractional_path_for_pathkeys:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1
order by t1.v1, t1.id limit 10;
QUERY PLAN

--
 Limit  (cost=198956893.20..198956913.33 rows=10 width=24)
   -  Partial sort  (cost=198956893.20..19909637942.82 rows=9791031169
width=24)
 Sort Key: t1.v1, t1.id
 Presorted Key: t1.v1
 -  Nested Loop  (cost=0.42..19883065506.84 rows=9791031169
width=24)
   Join Filter: (t1.v1 = t2.v1)
   -  Index Scan using test1_v1_idx on test1 t1
 (cost=0.42..47600.84 rows=100 width=12)
   -  Materialize  (cost=0.00..25289.00 rows=100 width=12)
 -  Seq Scan on test2 t2  (cost=0.00..15406.00
rows=100 width=12)
(9 rows)

Current version of patch:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1
order by t1.v1, t1.id limit 10;
QUERY PLAN

--
 Limit  (cost=3699913.43..3699913.60 rows=10 width=24)
   -  Partial sort  (cost=3699913.43..173638549.67 rows=9791031169
width=24)
 Sort Key: t1.v1, t1.id
 Presorted Key: t1.v1
 -  Merge Join  (cost=150444.79..147066113.70 rows=9791031169
width=24)
   Merge Cond: (t1.v1 = t2.v1)
   -  Index Scan using test1_v1_idx on test1 t1
 (cost=0.42..47600.84 rows=100 width=12)
   -  Materialize  (cost=149244.84..154244.84 rows=100
width=12)
 -  Sort  (cost=149244.84..151744.84 rows=100
width=12)
   Sort Key: t2.v1
   -  Seq Scan on test2 t2  (cost=0.00..15406.00
rows=100 width=12)
(11 rows)

I don't compare actual execution times because I didn't wait until first
plan execution ends up :-)
But anyway costs are extraordinary and inner sequential scan of 100
rows is odd.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] trgm regex index peculiarity

2014-02-09 Thread Alexander Korotkov
On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  Revised version of patch with necessary comments.

 I looked at this patch a bit.  It seems like this:

 +  *BLANK_COLOR_SIZE - How much blank character is more frequent than
 +  *   other character in average
 + #define BLANK_COLOR_SIZE  32

 is a drastic overestimate.  Using the program Erik provided to generate
 some sample data, I find that blanks in trigrams are maybe about three
 times more common than other characters.  Possibly Erik's data isn't
 very representative of real text, but if that's the case we'd better
 get a more representative sample case before we start optimizing ...

 Now maybe the above number is okay for this purpose anyway, but if so
 it needs some different justification; maybe call it a penalty?
 But nonetheless it seems like a darn large penalty, particularly for  x
 type trigrams where the value would effectively be squared.  Compared to
 the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
 you might as well forget the sizing approach and just flat out reject
 trigrams containing COLOR_BLANK, because they'll never get through the
 size filter.


It seems to be right decision to me when we have other trigrams can reject
them. But there are cases when we can't.


  I'm inclined to think you need a larger MAX_TRGM_SIZE;
 and WISH_TRGM_SIZE seems mighty small as well, compared to what the
 code would have done before.


OK, I would like to make more reasoning for penalty.
Let's consider word of length n.
It contains n+1 trigrams including:
1 __x trigram
1 _xx trigram
1 xx_ trigram
n - 2 xxx trigrams

Assume alphabet of size m those trigrams have following average frequencies:
__x 1/((n+1)*m)
_xx 1/((n+1)*m^2)
xx_ 1/((n+1)*m^2)
xxx (n-2)/((n+1)*m^3)

The ratio of this frequencies with blanks to frequency of xxx is:
__x m^2/(n-2)
_xx and xx_ m/(n-2)

In order to estimate n I've analyzed some classics:
Ernest Hemingway A farewell to arms - 3.8
Leo Tolstoy War and Peace - 5.0

In english with alphabet size = 26 we have

__x m^2/(n-2) = 375
_xx and xx_ m/(n-2) = 14.4

In russian with alphabet size = 33 we have

__x m^2/(n-2) = 363
_xx and xx_ m/(n-2) = 11

Assuming these calculations is approximate, we can safely use same values
for both languages.
Does such reasoning make sense?

Also, surely this bit:

 !   trgmNFA-totalTrgmCount = (int) totalTrgmSize;

 is flat out wrong?  expandColorTrigrams() expects that
 trgmNFA-totalTrgmCount is the truth, not some penalty-bloated
 abstraction.  The fact that the patch hasn't failed on you completely
 demonstrates that you've not tested any cases where blank-containing
 trigrams got through the filter, reinforcing my feeling that it's
 probably too strict now.


Oh, I wonder how can I leave such weird bug in patch :-(


 I wonder if it would be more appropriate to continue to measure the limit
 MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
 penalized size as the sort key while determining which color trigrams
 to discard first.


Agree. But I would like to save some equivalent of WISH_TRGM_SIZE.

Another thought is that it's not clear that you should apply the same
 penalty to blanks in all three positions.  Because of the padding
 settings, any one word will normally lead to padded strings   a,
  ab, yz , so that spaces in the first position are about twice
 as common as spaces in the other positions.  (It's a little less
 than that because single-letter words produce only   a and  a ;
 but I'd think that such words aren't very common in text that trigrams
 are effective for.)  So I'm thinking the penalty ought to take that
 into account.

 I'm also inclined to think that it might be worth adding a size
 field to ColorTrgmInfo rather than repeatedly calculating the
 size estimate.  We only allow a thousand and some ColorTrgmInfos
 at most, so the extra space wouldn't be that large, and I'm concerned
 by the expense the current patch adds to the sort comparator.


Agree.



 Another point is that this comment:

  * Note: per-color-trigram counts cannot overflow an int so long as
  * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie
 about
  * 1290.  However, the grand total totalTrgmCount might conceivably
  * overflow an int, so we use int64 for that within this routine.

 is no longer valid, or at least it fails to demonstrate that the result
 of getColorTrigramSize() can't overflow an int; indeed I rather fear it
 can.
 The patch failed to even update the variable name in this comment, let
 alone address the substantive question.

 There are some other cosmetic things I didn't like, for instance
 colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
 rename it.  That's about it for substantive comments though.


Thanks, will be fixed.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-02-06 Thread Alexander Korotkov
On Thu, Feb 6, 2014 at 2:21 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:

 On 02/05/2014 12:42 PM, Alexander Korotkov wrote:

 Attached patch is light version of fast scan. It does extra consistent
 function calls only on startScanKey, no extra calls during scan of the
 index.
 It finds subset of rarest entries absence of which guarantee false
 consistent function result.


 Sounds good. Since the extra consistent calls are only performed at
 startup, it's unlikely to cause any noticeable performance regression, even
 when it's not helping.


  I've run real-life tests based of postgresql.org logs (33318 queries).
 Here
 is a table with summary time of running whole test case.

 =# select method, sum(time) from test_result group by method order by
 method;
 method|   sum
 -+--
   fast-scan-11| 126916.11199
   fast-scan-light |   137321.211
   heikki  |   466751.693
   heikki-true-ternary | 454113.62397
   master  |   446976.288
 (6 rows)

 where 'heikki' is gin-ternary-logic binary-heap
 preconsistent-only-on-new-page.patch and 'heikki-true-ternary' is version
 with my catalog changes promoting ternary consistent function to opclass.


 Looks good.


  We can see fast-scan-light gives almost same performance benefit on 
 queries. However, I test only CPU effect, not IO effect.
 Any thoughts?


 This test doesn't handle lossy-pages correctly:

  /*
  * Check if all items marked as scanEntryUseForSkip is
 not present in
  * minItem. If so, we can correct advancePast.
  */
 if (ginCompareItemPointers(minItem, minItemSkip)  0)
 {
 advancePast = minItemSkip;
 advancePast.ip_posid--;
 continue;
 }


 If minItemSkip is a lossy page, we skip all exact items on the same page.
 That's wrong, here's a test case to demonstrate:

 CREATE TABLE foo (ts tsvector);
 INSERT INTO foo SELECT to_tsvector('foo bar'||g) from generate_series(1,
 100) g;
 CREATE INDEX i_foo ON foo USING gin (ts);

 set work_mem='64'; -- to force lossy pages
 -- should return 11
 select count(*) from foo where ts @@ to_tsquery('foo  bar4:*');

 After some fiddling (including fixing the above), I ended up with the
 attached version of your patch. I still left out the catalog changes, again
 not because I don't think we should have them, but to split this into
 smaller patches that can be reviewed separately. I extended the both ways
 shim function to work with more than one maybe input. Of course, that is
 O(n^2), where n is the number of maybe arguments, so that won't scale,
 but it's OK for testing purposes. And many if not most real world queries,
 too.

 I had a very hard time understanding what it really means for an entry to
 be in the scanEntryUseForSkip array. What does it mean to use an entry
 for skipping?

 So I refactored that logic, to hopefully make it more clear. In essence,
 the entries are divided into two groups, required and other items. If none
 of the entries in the required set are true, then we know that the overall
 qual cannot match. See the comments for a more detailed explanation. I'm
 not wedded to the term required here; one might think that *all* the
 entries in the required set must be TRUE for a match, while it's actually
 that at least one of them must be TRUE.

 In keyGetItem, we fetch the minimum item among all the entries in the
 requiredEntries set. That's the next item we're going to return, regardless
 of any items in otherEntries set. But before calling the consistent
 function, we advance the other entries up to that point, so that we know
 whether to pass TRUE or FALSE to the consistent function. IOW, otherEntries
 only provide extra information for the consistent function to decide if we
 have a match or not - they don't affect which items we return to the caller.

 I think this is pretty close to committable state now. I'm slightly
 disappointed that we can't do the skipping in more scenarios. For example,
 in foo  bar, we can currently skip bar entry up to the next foo, but
 we cannot skip foo up to the next bar. But this is fairly simple, and
 since we sort the entries by estimated frequency, should already cover all
 the pathological rare  frequent type queries. So that's OK.


Sorry for my scanty explanation. Your version of patch looks good for me.
I've rerun my testcase:

=# select method, sum(time) from test_result group by method order by
method;
 method |   sum
+--
 fast-scan-11   | 126916.11199
 fast-scan-light|   137321.211
 fast-scan-light-heikki | 138168.02801
 heikki |   466751.693
 heikki-tru-ternary | 454113.62397
 master |   446976.288
 tri-state

Re: [HACKERS] GIN improvements part2: fast scan

2014-02-05 Thread Alexander Korotkov
On Wed, Feb 5, 2014 at 1:23 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Mon, Feb 3, 2014 at 6:31 PM, Alexander Korotkov 
 aekorot...@gmail.comwrote:

 On Mon, Jan 27, 2014 at 7:30 PM, Alexander Korotkov aekorot...@gmail.com
  wrote:

 On Mon, Jan 27, 2014 at 2:32 PM, Alexander Korotkov 
 aekorot...@gmail.com wrote:

 Every single change you did in fast scan seems to be reasonable, but
 testing shows that something went wrong. Simple test with 3 words of
 different selectivities.

 After applying your patches:

 # select count(*) from fts_test where fti @@ plainto_tsquery('english',
 'gin index select');
  count
 ───
627
 (1 row)

 Time: 21,252 ms

 In original fast-scan:

 # select count(*) from fts_test where fti @@ plainto_tsquery('english',
 'gin index select');
  count
 ───
627
 (1 row)

 Time: 3,382 ms

 I'm trying to get deeper into it.


 I had two guesses about why it's become so slower than in my original
 fast-scan:
 1) Not using native consistent function
 2) Not sorting entries

 I attach two patches which rollback these two features (sorry for awful
 quality of second). Native consistent function accelerates thing
 significantly, as expected. Tt seems that sorting entries have almost no
 effect. However it's still not as fast as initial fast-scan:

 # select count(*) from fts_test where fti @@ plainto_tsquery('english',
 'gin index select');
  count
 ───
627
 (1 row)

 Time: 5,381 ms

 Tomas, could you rerun your tests with first and both these patches
 applied against patches by Heikki?


 I found my patch 0005-Ternary-consistent-implementation.patch to be
 completely wrong. It introduces ternary consistent function to opclass, but
 don't uses it, because I forgot to include ginlogic.c change into patch.
 So, it shouldn't make any impact on performance. However, testing results
 with that patch significantly differs. That makes me very uneasy. Can we
 now reproduce exact same?

 Right version of these two patches in one against current head is
 attached. I've rerun tests with it, results are
 /mnt/sas-raid10/gin-testing/queries/9.4-fast-scan-10. Could you rerun
 postprocessing including graph drawing?

 Sometimes test cases are not what we expect. For example:

 =# explain SELECT id FROM messages WHERE body_tsvector @@
 to_tsquery('english','(5alpha1-initdb''d)');
QUERY PLAN


 ─
  Bitmap Heap Scan on messages  (cost=84.00..88.01 rows=1 width=4)
Recheck Cond: (body_tsvector @@ '''5alpha1-initdb''  ''5alpha1'' 
 ''initdb''  ''d'''::tsquery)
-  Bitmap Index Scan on messages_body_tsvector_idx  (cost=0.00..84.00
 rows=1 width=0)
  Index Cond: (body_tsvector @@ '''5alpha1-initdb''  ''5alpha1''
  ''initdb''  ''d'''::tsquery)
  Planning time: 0.257 ms
 (5 rows)

 5alpha1-initdb'd is 3 gin entries with different frequencies.

 Also, these patches are not intended to change relevance ordering speed.
 When number of results are high, most of time is relevance calculating and
 sorting. I propose to remove ORDER BY clause from test cases to see scan
 speed more clear.

 I've dump of postgresql.org search queries from Magnus. We can add them
 to our test case.


  It turns out that version 10 contained bug in ternary consistent function
 implementation for tsvector. Fixed in attached version.


Attached patch is light version of fast scan. It does extra consistent
function calls only on startScanKey, no extra calls during scan of the
index.
It finds subset of rarest entries absence of which guarantee false
consistent function result.

I've run real-life tests based of postgresql.org logs (33318 queries). Here
is a table with summary time of running whole test case.

=# select method, sum(time) from test_result group by method order by
method;
   method|   sum
-+--
 fast-scan-11| 126916.11199
 fast-scan-light |   137321.211
 heikki  |   466751.693
 heikki-true-ternary | 454113.62397
 master  |   446976.288
(6 rows)

where 'heikki' is gin-ternary-logic binary-heap
preconsistent-only-on-new-page.patch and 'heikki-true-ternary' is version
with my catalog changes promoting ternary consistent function to opclass.

We can see fast-scan-light gives almost same performance benefit on 
queries. However, I test only CPU effect, not IO effect.
Any thoughts?

--
With best regards,
Alexander Korotkov.




gin-fast-scan-light.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix picksplit with nan values

2014-02-05 Thread Alexander Korotkov
On Sat, Feb 1, 2014 at 7:50 AM, Bruce Momjian br...@momjian.us wrote:


 Where are we on this?


I found myself to have empty draft letter from November with new version of
patch attached. I'll return here when we have some solution in gin fast
scan challenge.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-02-04 Thread Alexander Korotkov
On Mon, Feb 3, 2014 at 6:31 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Mon, Jan 27, 2014 at 7:30 PM, Alexander Korotkov 
 aekorot...@gmail.comwrote:

 On Mon, Jan 27, 2014 at 2:32 PM, Alexander Korotkov aekorot...@gmail.com
  wrote:

 On Sun, Jan 26, 2014 at 8:14 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 On 01/26/2014 08:24 AM, Tomas Vondra wrote:

 Hi!

 On 25.1.2014 22:21, Heikki Linnakangas wrote:

 Attached is a new version of the patch set, with those bugs fixed.


 I've done a bunch of tests with all the 4 patches applied, and it seems
 to work now. I've done tests with various conditions (AND/OR, number of
 words, number of conditions) and I so far I did not get any crashes,
 infinite loops or anything like that.

 I've also compared the results to 9.3 - by dumping the database and
 running the same set of queries on both machines, and indeed I got 100%
 match.

 I also did some performance tests, and that's when I started to worry.

 For example, I generated and ran 1000 queries that look like this:

SELECT id FROM messages
 WHERE body_tsvector @@ to_tsquery('english','(header  53  32 
 useful  dropped)')
 ORDER BY ts_rank(body_tsvector, to_tsquery('english','(header  53
 
 32  useful  dropped)')) DESC;

 i.e. in this case the query always was 5 words connected by AND. This
 query is a pretty common pattern for fulltext search - sort by a list
 of
 words and give me the best ranked results.

 On 9.3, the script was running for ~23 seconds, on patched HEAD it was
 ~40. It's perfectly reproducible, I've repeated the test several times
 with exactly the same results. The test is CPU bound, there's no I/O
 activity at all. I got the same results with more queries (~100k).

 Attached is a simple chart with x-axis used for durations measured on
 9.3.2, y-axis used for durations measured on patched HEAD. It's obvious
 a vast majority of queries is up to 2x slower - that's pretty obvious
 from the chart.

 Only about 50 queries are faster on HEAD, and 700 queries are more
 than
 50% slower on HEAD (i.e. if the query took 100ms on 9.3, it takes
 150ms
 on HEAD).

 Typically, the EXPLAIN ANALYZE looks something like this (on 9.3):

   http://explain.depesz.com/s/5tv

 and on HEAD (same query):

   http://explain.depesz.com/s/1lI

 Clearly the main difference is in the Bitmap Index Scan which takes
 60ms on 9.3 and 120ms on HEAD.

 On 9.3 the perf top looks like this:

  34.79%  postgres [.] gingetbitmap
  28.96%  postgres [.] ginCompareItemPointers
   9.36%  postgres [.] TS_execute
   5.36%  postgres [.] check_stack_depth
   3.57%  postgres [.] FunctionCall8Coll

 while on 9.4 it looks like this:

  28.20%  postgres [.] gingetbitmap
  21.17%  postgres [.] TS_execute
   8.08%  postgres [.] check_stack_depth
   7.11%  postgres [.] FunctionCall8Coll
   4.34%  postgres [.] shimTriConsistentFn

 Not sure how to interpret that, though. For example where did the
 ginCompareItemPointers go? I suspect it's thanks to inlining, and that
 it might be related to the performance decrease. Or maybe not.


 Yeah, inlining makes it disappear from the profile, and spreads that
 time to the functions calling it.

 The profile tells us that the consistent function is called a lot more
 than before. That is expected - with the fast scan feature, we're calling
 consistent not only for potential matches, but also to refute TIDs based on
 just a few entries matching. If that's effective, it allows us to skip many
 TIDs and avoid consistent calls, which compensates, but if it's not
 effective, it's just overhead.

 I would actually expect it to be fairly effective for that query, so
 that's a bit surprising. I added counters to see where the calls are coming
 from, and it seems that about 80% of the calls are actually coming from
 this little the feature I explained earlier:


  In addition to that, I'm using the ternary consistent function to check
 if minItem is a match, even if we haven't loaded all the entries yet.
 That's less important, but I think for something like rare1 | (rare2 
 frequent) it might be useful. It would allow us to skip fetching
 'frequent', when we already know that 'rare1' matches for the current
 item. I'm not sure if that's worth the cycles, but it seemed like an
 obvious thing to do, now that we have the ternary consistent function.


 So, that clearly isn't worth the cycles :-). At least not with an
 expensive consistent function; it might be worthwhile if we pre-build the
 truth-table, or cache the results of the consistent function.

 Attached is a quick patch to remove that, on top of all the other
 patches, if you want to test the effect.


 Every single change you did in fast scan seems to be reasonable, but
 testing shows that something went wrong

Re: [HACKERS] GIN improvements part2: fast scan

2014-02-03 Thread Alexander Korotkov
On Mon, Jan 27, 2014 at 7:30 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Mon, Jan 27, 2014 at 2:32 PM, Alexander Korotkov 
 aekorot...@gmail.comwrote:

 On Sun, Jan 26, 2014 at 8:14 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 On 01/26/2014 08:24 AM, Tomas Vondra wrote:

 Hi!

 On 25.1.2014 22:21, Heikki Linnakangas wrote:

 Attached is a new version of the patch set, with those bugs fixed.


 I've done a bunch of tests with all the 4 patches applied, and it seems
 to work now. I've done tests with various conditions (AND/OR, number of
 words, number of conditions) and I so far I did not get any crashes,
 infinite loops or anything like that.

 I've also compared the results to 9.3 - by dumping the database and
 running the same set of queries on both machines, and indeed I got 100%
 match.

 I also did some performance tests, and that's when I started to worry.

 For example, I generated and ran 1000 queries that look like this:

SELECT id FROM messages
 WHERE body_tsvector @@ to_tsquery('english','(header  53  32 
 useful  dropped)')
 ORDER BY ts_rank(body_tsvector, to_tsquery('english','(header  53 
 32  useful  dropped)')) DESC;

 i.e. in this case the query always was 5 words connected by AND. This
 query is a pretty common pattern for fulltext search - sort by a list of
 words and give me the best ranked results.

 On 9.3, the script was running for ~23 seconds, on patched HEAD it was
 ~40. It's perfectly reproducible, I've repeated the test several times
 with exactly the same results. The test is CPU bound, there's no I/O
 activity at all. I got the same results with more queries (~100k).

 Attached is a simple chart with x-axis used for durations measured on
 9.3.2, y-axis used for durations measured on patched HEAD. It's obvious
 a vast majority of queries is up to 2x slower - that's pretty obvious
 from the chart.

 Only about 50 queries are faster on HEAD, and 700 queries are more than
 50% slower on HEAD (i.e. if the query took 100ms on 9.3, it takes 150ms
 on HEAD).

 Typically, the EXPLAIN ANALYZE looks something like this (on 9.3):

   http://explain.depesz.com/s/5tv

 and on HEAD (same query):

   http://explain.depesz.com/s/1lI

 Clearly the main difference is in the Bitmap Index Scan which takes
 60ms on 9.3 and 120ms on HEAD.

 On 9.3 the perf top looks like this:

  34.79%  postgres [.] gingetbitmap
  28.96%  postgres [.] ginCompareItemPointers
   9.36%  postgres [.] TS_execute
   5.36%  postgres [.] check_stack_depth
   3.57%  postgres [.] FunctionCall8Coll

 while on 9.4 it looks like this:

  28.20%  postgres [.] gingetbitmap
  21.17%  postgres [.] TS_execute
   8.08%  postgres [.] check_stack_depth
   7.11%  postgres [.] FunctionCall8Coll
   4.34%  postgres [.] shimTriConsistentFn

 Not sure how to interpret that, though. For example where did the
 ginCompareItemPointers go? I suspect it's thanks to inlining, and that
 it might be related to the performance decrease. Or maybe not.


 Yeah, inlining makes it disappear from the profile, and spreads that
 time to the functions calling it.

 The profile tells us that the consistent function is called a lot more
 than before. That is expected - with the fast scan feature, we're calling
 consistent not only for potential matches, but also to refute TIDs based on
 just a few entries matching. If that's effective, it allows us to skip many
 TIDs and avoid consistent calls, which compensates, but if it's not
 effective, it's just overhead.

 I would actually expect it to be fairly effective for that query, so
 that's a bit surprising. I added counters to see where the calls are coming
 from, and it seems that about 80% of the calls are actually coming from
 this little the feature I explained earlier:


  In addition to that, I'm using the ternary consistent function to check
 if minItem is a match, even if we haven't loaded all the entries yet.
 That's less important, but I think for something like rare1 | (rare2 
 frequent) it might be useful. It would allow us to skip fetching
 'frequent', when we already know that 'rare1' matches for the current
 item. I'm not sure if that's worth the cycles, but it seemed like an
 obvious thing to do, now that we have the ternary consistent function.


 So, that clearly isn't worth the cycles :-). At least not with an
 expensive consistent function; it might be worthwhile if we pre-build the
 truth-table, or cache the results of the consistent function.

 Attached is a quick patch to remove that, on top of all the other
 patches, if you want to test the effect.


 Every single change you did in fast scan seems to be reasonable, but
 testing shows that something went wrong. Simple test with 3 words of
 different selectivities.

 After applying your patches

Re: [HACKERS] GIN improvements part2: fast scan

2014-02-03 Thread Alexander Korotkov
On Thu, Jan 30, 2014 at 8:38 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 01/30/2014 01:53 AM, Tomas Vondra wrote:

 (3) A file with explain plans for 4 queries suffering ~2x slowdown,
  and explain plans with 9.4 master and Heikki's patches is available
  here:

http://www.fuzzy.cz/tmp/gin/queries.txt

  All the queries have 6 common words, and the explain plans look
  just fine to me - exactly like the plans for other queries.

  Two things now caught my eye. First some of these queries actually
  have words repeated - either exactly like database  database or
  in negated form like !anything  anything. Second, while
  generating the queries, I use dumb frequency, where only exact
  matches count. I.e. write != written etc. But the actual number
  of hits may be much higher - for example write matches exactly
  just 5% documents, but using @@ it matches more than 20%.

  I don't know if that's the actual cause though.


 I tried these queries with the data set you posted here:
 http://www.postgresql.org/message-id/52e4141e.60...@fuzzy.cz. The reason
 for the slowdown is the extra consistent calls it causes. That's expected -
 the patch certainly does call consistent in situations where it didn't
 before, and if the pre-consistent checks are not able to eliminate many
 tuples, you lose.

 So, what can we do to mitigate that? Some ideas:

 1. Implement the catalog changes from Alexander's patch. That ought to
 remove the overhead, as you only need to call the consistent function once,
 not both ways. OTOH, currently we only call the consistent function if
 there is only one unknown column. If with the catalog changes, we always
 call the consistent function even if there are more unknown columns, we
 might end up calling it even more often.

 2. Cache the result of the consistent function.

 3. Make the consistent function cheaper. (How? Magic?)

 4. Use some kind of a heuristic, and stop doing the pre-consistent checks
 if they're not effective. Not sure what the heuristic would look like.


I'm fiddling around with following idea of such heuristic:
1) Sort entries ascending by estimate of TIDs number. Assume number of
entries to be n.
2) Sequentially call ternary consistent with first m of GIN_FALSE and rest
n-m of GIN_MAYBE until it returns GIN_FALSE.
3) Decide whether to use fast-scan depending on number of TIDs in first m
entries and number of TIDs in rest (n-m) entries. Something like
number_of_TIDs_in_frist_m_entries * k  number_of_TIDs_in_rest_n-m_entries
where k is some quotient.
For sure, it rough method, but it should be fast. Without natural
implementation of ternary consistent function algorithm will be slightly
different: only m values close to n should be checked.
I'm going to implement this heuristic against last version of your patch.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] KNN-GiST with recheck

2014-02-03 Thread Alexander Korotkov
On Tue, Jan 28, 2014 at 9:32 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 01/28/2014 04:12 PM, Alexander Korotkov wrote:

 On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com

 wrote:


  4. (as you mentioned in the other thread: ) It's a modularity violation
 that you peek into the heap tuple from gist. I think the proper way to do
 this would be to extend the IndexScan executor node to perform the
 re-shuffling of tuples that come from the index in wrong order, or
 perhaps
 add a new node type for it.

 Of course that's exactly what your partial sort patch does :-). I haven't
 looked at that in detail, but I don't think the approach the partial sort
 patch takes will work here as is. In the KNN-GiST case, the index is
 returning tuples roughly in the right order, but a tuple that it returns
 might in reality belong somewhere later in the ordering. In the partial
 sort patch, the input stream of tuples is divided into non-overlapping
 groups, so that the tuples within the group are not sorted, but the
 groups
 are. I think the partial sort case is a special case of the KNN-GiST
 case,
 if you consider the lower bound of each tuple to be the leading keys that
 you don't need to sort.


 Yes. But, for instance btree accesses heap for unique checking. Is really
 it so crimilal? :-)


 Well, it is generally considered an ugly hack in b-tree too. I'm not 100%
 opposed to doing such a hack in GiST, but would very much prefer not to.


  This is not only question of a new node or extending existing node. We
 need
 to teach planner/executor access method can return value of some
 expression
 which is lower bound of another expression. AFICS now access method can
 return only original indexed datums and TIDs. So, I afraid that enormous
 infrastructure changes are required. And I can hardly imagine what they
 should look like.


 Yeah, I'm not sure either. Maybe a new field in IndexScanDesc, along with
 xs_itup. Or as an attribute of xs_itup itself.


This shouldn't look like a hack too. Otherwise I see no point of it: it's
better to have some isolated hack in access method than hack in
planner/executor. So I see following changes to be needed to implement this
right way:
1) Implement new relation between operators: operator1 is lower bound of
operator2.
2) Extend am interface to let it return values of operators.
3) Implement new node for knn-sorting.
However, it requires a lot of changes in PostgreSQL infrastructure and can
appear to be not enough general too (we don't know until we have another
application).

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-02-03 Thread Alexander Korotkov
On Mon, Feb 3, 2014 at 7:24 PM, Tomas Vondra t...@fuzzy.cz wrote:

 On 3 Únor 2014, 15:31, Alexander Korotkov wrote:
 
  I found my patch 0005-Ternary-consistent-implementation.patch to be
  completely wrong. It introduces ternary consistent function to opclass,
  but
  don't uses it, because I forgot to include ginlogic.c change into patch.
  So, it shouldn't make any impact on performance. However, testing results
  with that patch significantly differs. That makes me very uneasy. Can we
  now reproduce exact same?

 Do I understand it correctly that the 0005 patch should give exactly the
 same performance as the 9.4-heikki branch (as it was applied on it, and
 effectively did no change). This wasn't exactly what I measured, although
 the differences were not that significant.


Do I undestand correctly it's 9.4-heikki and 9.4-alex-1 here:
http://www.fuzzy.cz/tmp/gin/#
In some queries it differs in times. I wonder why.

I can rerun the tests, if that's what you're asking for. I'll improve the
 test a bit - e.g. I plan to average multiple runs, to filter out random
 noise (which might be significant for such short queries).

  Right version of these two patches in one against current head is
  attached.
  I've rerun tests with it, results are
  /mnt/sas-raid10/gin-testing/queries/9.4-fast-scan-10. Could you rerun
  postprocessing including graph drawing?

 Yes, I'll do that. However I'll have to rerun the other tests too, because
 the
 previous runs were done on a different machine.

 I'm a bit confused right now. The previous patches (0005 + 0007) were
 supposed
 to be applied on top of the 4 from Heikki (0001-0004), right? AFAIK those
 were
 not commited yet, so why is this version against HEAD?

 To summarize, I know of these patch sets:

 9.4-heikki (old version)
 0001-Optimize-GIN-multi-key-queries.patch
 0002-Further-optimize-the-multi-key-GIN-searches.patch
 0003-Further-optimize-GIN-multi-key-searches.patch
 0004-Add-the-concept-of-a-ternary-consistent-check-and-us.patch

 9.4-alex-1 (based on 9.4-heikki)
 0005-Ternary-consistent-implementation.patch

 9.4-alex-1 (based on 9.4-alex-1)
 0006-Sort-entries.patch


 From these patches I only need to compare 9.4-heikki (old version) and
9.4-alex-1 to release my doubts.

9.4-heikki (new version)
 gin-ternary-logic+binary-heap+preconsistent-only-on-new-page.patch

 9.4-alex-2 (new version)
 gin-fast-scan.10.patch.gz

 Or did I get that wrong?


 Only you mentioned 9.4-alex-1 twice. I afraid to have some mess in
numbering.


  Sometimes test cases are not what we expect. For example:
 
  =# explain SELECT id FROM messages WHERE body_tsvector @@
  to_tsquery('english','(5alpha1-initdb''d)');
 QUERY PLAN
 
 
 
   Bitmap Heap Scan on messages  (cost=84.00..88.01 rows=1 width=4)
 Recheck Cond: (body_tsvector @@ '''5alpha1-initdb''  ''5alpha1'' 
  ''initdb''  ''d'''::tsquery)
 -  Bitmap Index Scan on messages_body_tsvector_idx  (cost=0.00..84.00
  rows=1 width=0)
   Index Cond: (body_tsvector @@ '''5alpha1-initdb''  ''5alpha1''
 
  ''initdb''  ''d'''::tsquery)
   Planning time: 0.257 ms
  (5 rows)
 
  5alpha1-initdb'd is 3 gin entries with different frequencies.

 Why do you find that strange? The way the query is formed or the way it's
 evaluated?

 The query generator certainly is not perfect, so it may produce some
 strange queries.


I just mean that in this case 3 words doesn't mean 3 gin entries.

  Also, these patches are not intended to change relevance ordering speed.
  When number of results are high, most of time is relevance calculating
 and
  sorting. I propose to remove ORDER BY clause from test cases to see scan
  speed more clear.

 Sure, I can do that. Or maybe one set of queries with ORDER BY, the other
 one without it.


Good.

  I've dump of postgresql.org search queries from Magnus. We can add them
 to
  our test case.

 You mean search queries from the search for mailing list archives? Sure,
 we add that.


Yes. I'll transform it into tsquery and send you privately.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-02-03 Thread Alexander Korotkov
On Mon, Feb 3, 2014 at 8:19 PM, Tomas Vondra t...@fuzzy.cz wrote:

Sometimes test cases are not what we expect. For example:
  
   =# explain SELECT id FROM messages WHERE body_tsvector @@
   to_tsquery('english','(5alpha1-initdb''d)');
  QUERY PLAN
  
  
 
 
Bitmap Heap Scan on messages  (cost=84.00..88.01 rows=1 width=4)
  Recheck Cond: (body_tsvector @@ '''5alpha1-initdb''  ''5alpha1'' 
   ''initdb''  ''d'''::tsquery)
  -  Bitmap Index Scan on messages_body_tsvector_idx
  (cost=0.00..84.00
   rows=1 width=0)
Index Cond: (body_tsvector @@ '''5alpha1-initdb'' 
  ''5alpha1''
  
   ''initdb''  ''d'''::tsquery)
Planning time: 0.257 ms
   (5 rows)
  
   5alpha1-initdb'd is 3 gin entries with different frequencies.
 
  Why do you find that strange? The way the query is formed or the way
  it's
  evaluated?
 
  The query generator certainly is not perfect, so it may produce some
  strange queries.
 
 
  I just mean that in this case 3 words doesn't mean 3 gin entries.

 Isn't that expected? I mean, that's what to_tsquery may do, right?


Everything is absolutely correct. :-) It just may be not what do you expect
if you aren't getting into details.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] KNN-GiST with recheck

2014-01-28 Thread Alexander Korotkov
On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 01/13/2014 07:17 PM, Alexander Korotkov wrote:

 Here goes a desription of this patch same as in original thread.

 KNN-GiST provides ability to get ordered results from index, but this
 order
 is based only on index information. For instance, GiST index contains
 bounding rectangles for polygons, and we can't get exact distance to
 polygon from index (similar situation is in PostGIS). In attached patch,
 GiST distance method can set recheck flag (similar to consistent method).
 This flag means that distance method returned lower bound of distance and
 we should recheck it from heap.

 See an example.

 create table test as (select id, polygon(3+(random()*10)::int,
 circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
 generate_series(1,100) id);
 create index test_idx on test using gist (p);

 We can get results ordered by distance from polygon to point.

 postgres=# select id, p - point(0.5,0.5) from test order by p -
 point(0.5,0.5) limit 10;
 id   |   ?column?
 +--
   755611 | 0.000405855808916853
   807562 | 0.000464123777564343
   437778 | 0.000738524708741959
   947860 |  0.00076250998760724
   389843 | 0.000886362723569568
17586 | 0.000981960100555216
   411329 |  0.00145338112316853
   894191 |  0.00149399559703506
   391907 |   0.0016647896049741
   235381 |  0.00167554614889509
 (10 rows)

 It's fast using just index scan.

  QUERY PLAN

 
 --
   Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
 rows=10 loops=1)
 -  Index Scan using test_idx on test  (cost=0.29..157672.29
 rows=100 width=36) (actual time=0.179..0.228 rows=10 loops=1)
   Order By: (p - '(0.5,0.5)'::point)
   Total runtime: 0.305 ms
 (4 rows)


 Nice! Some thoughts:

 1. This patch introduces a new polygon - point operator. That seems
 useful on its own, with or without this patch.


Yeah, but exact-knn cant come with no one implementation. But it would
better come in a separate patch.

2. I wonder how useful it really is to allow mixing exact and non-exact
 return values from the distance function. The distance function included in
 the patch always returns recheck=true. I have a feeling that all other
 distance function will also always return either true or false.


For geometrical datatypes recheck variations in consistent methods are also
very rare (I can't remember any). But imagine opclass for arrays where keys
have different representation depending on array length. For such opclass
and knn on similarity recheck flag could be useful.



 3. A binary heap would be a better data structure to buffer the rechecked
 values. A Red-Black tree allows random insertions and deletions, but in
 this case you need to insert arbitrary values but only remove the minimum
 item. That's exactly what a binary heap excels at. We have a nice binary
 heap implementation in the backend that you can use, see
 src/backend/lib/binaryheap.c.


Hmm. For me binary heap would be a better data structure for KNN-GiST at
all :-)


 4. (as you mentioned in the other thread: ) It's a modularity violation
 that you peek into the heap tuple from gist. I think the proper way to do
 this would be to extend the IndexScan executor node to perform the
 re-shuffling of tuples that come from the index in wrong order, or perhaps
 add a new node type for it.

 Of course that's exactly what your partial sort patch does :-). I haven't
 looked at that in detail, but I don't think the approach the partial sort
 patch takes will work here as is. In the KNN-GiST case, the index is
 returning tuples roughly in the right order, but a tuple that it returns
 might in reality belong somewhere later in the ordering. In the partial
 sort patch, the input stream of tuples is divided into non-overlapping
 groups, so that the tuples within the group are not sorted, but the groups
 are. I think the partial sort case is a special case of the KNN-GiST case,
 if you consider the lower bound of each tuple to be the leading keys that
 you don't need to sort.


Yes. But, for instance btree accesses heap for unique checking. Is really
it so crimilal? :-)
This is not only question of a new node or extending existing node. We need
to teach planner/executor access method can return value of some expression
which is lower bound of another expression. AFICS now access method can
return only original indexed datums and TIDs. So, I afraid that enormous
infrastructure changes are required. And I can hardly imagine what they
should look like.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-01-27 Thread Alexander Korotkov
 count(*) from fts_test where fti @@ plainto_tsquery('english',
'gin index select');
 count
───
   627
(1 row)

Time: 3,382 ms

I'm trying to get deeper into it.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-01-27 Thread Alexander Korotkov
On Sun, Jan 26, 2014 at 8:14 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 In addition to that, I'm using the ternary consistent function to check
 if minItem is a match, even if we haven't loaded all the entries yet.
 That's less important, but I think for something like rare1 | (rare2 
 frequent) it might be useful. It would allow us to skip fetching
 'frequent', when we already know that 'rare1' matches for the current
 item. I'm not sure if that's worth the cycles, but it seemed like an
 obvious thing to do, now that we have the ternary consistent function.


 So, that clearly isn't worth the cycles :-). At least not with an
 expensive consistent function; it might be worthwhile if we pre-build the
 truth-table, or cache the results of the consistent function.


I believe cache consistent function results is quite same as lazy
truth-table. I think it's a good option to use with two-state consistent
function. However, I don't think it's a reason to refuse from three-state
consistent function because number of entries could be large.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] PoC: Partial sort

2014-01-27 Thread Alexander Korotkov
Hi!

On Tue, Jan 21, 2014 at 3:24 AM, Marti Raudsepp ma...@juffo.org wrote:

 On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov
 aekorot...@gmail.com wrote:
 On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp ma...@juffo.org wrote:
  I've been trying it out in a few situations. I implemented a new
  enable_partialsort GUC to make it easier to turn on/off, this way it's
 a lot
  easier to test. The attached patch applies on top of
 partial-sort-5.patch
 
  I though about such option. Generally not because of testing convenience,
  but because of overhead of planning. This way you implement it is quite
  naive :)

 I don't understand. I had another look at this and cost_sort still
 seems like the best place to implement this, since that's where the
 patch decides how many pre-sorted columns to use. Both mergejoin and
 simple order-by plans call into it. If enable_partialsort=false then I
 skip all pre-sorted options except full sort, making cost_sort behave
 pretty much like it did before the patch.

 I could change pathkeys_common to return 0, but that seems like a
 generic function that shouldn't be tied to partialsort. The old code
 paths called pathkeys_contained_in anyway, which has similar
 complexity. (Apart for initial_cost_mergejoin, but that doesn't seem
 special enough to make an exception for).

 Or should I use?:
   enable_partialsort ? pathkeys_common(...) : 0

  For instance, merge join rely on partial sort which will be
  replaced with simple sort.

 Are you saying that enable_partialsort=off should keep
 partialsort-based mergejoins enabled?

 Or are you saying that merge joins shouldn't use simple sort at all?
 But merge join was already able to use full Sort nodes before your
 patch.


Sorry that I didn't explained it. In particular I mean following:
1) With enable_partialsort = off all mergejoin logic should behave as
without partial sort patch.
2) With partial sort patch get_cheapest_fractional_path_for_pathkeys
function is much more expensive to execute. With enable_partialsort = off
it should be as cheap as without partial sort patch.
I'll try to implement this option in this week.
For now, I have attempt to fix extra columns in mergejoin problem. It would
be nice if you test it.

--
With best regards,
Alexander Korotkov.


partial-sort-7.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-01-27 Thread Alexander Korotkov
On Tue, Jan 28, 2014 at 7:41 AM, Marti Raudsepp ma...@juffo.org wrote:

 On Mon, Jan 27, 2014 at 9:26 PM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  For now, I have attempt to fix extra columns in mergejoin problem. It
 would
  be nice if you test it.

 Yes, it solves the test cases I was trying with, thanks.

  1) With enable_partialsort = off all mergejoin logic should behave as
  without partial sort patch.
  2) With partial sort patch get_cheapest_fractional_path_for_pathkeys
  function is much more expensive to execute. With enable_partialsort =
 off it
  should be as cheap as without partial sort patch.

 When it comes to planning time, I really don't think you should
 bother. The planner enable_* settings are meant for troubleshooting,
 debugging and learning about the planner. You should not expect people
 to disable them in a production setting. It's not worth complicating
 the code for that rare case.

 This is stated in the documentation
 (http://www.postgresql.org/docs/current/static/runtime-config-query.html)
 and repeatedly on the mailing lists.

 But some benchmarks of planning performance are certainly warranted.


I didn't test it, but I worry that overhead might be high.
If it's true then it could be like constraint_exclusion option which id off
by default because of planning overhead.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part 1: additional information

2014-01-24 Thread Alexander Korotkov
On Thu, Jan 23, 2014 at 9:27 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Wed, Jan 22, 2014 at 9:28 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 On 01/22/2014 02:17 PM, Alexander Korotkov wrote:

 We already spent a lot of time with compression. Now we need to figure
 out
 the result we want see. I spent quite long time debugging varbyte
 encoding
 without segments. Finally, it passed very many tests without any
 problems.
 Now, it is just piece of junk. I'm afraid that we will have to
 reimplement
 everything from scratch another two or three times because code doesn't
 look perfect. For sure, it's incompatible with getting something into
 9.4.


 That's a bit harsh :-). But yes, I understand what you're saying. It's
 quite common for large patches like this to be rewritten several times
 before being committed; you just don't know what works best until you've
 tried many designs.


  Remember we have also subsequent fast-scan which is very needed for
 hstore
 and other application.
 I propose to do final decisions now and concentrate forces on making
 committable patch with these decisions. And don't change these decisions
 even if somebody have idea how to improve code readability in 100 times
 and
 potential extendability in 1000 times.
 I propose following decisions:
 1) Leave uncompressed area, allow duplicates in it
 2) Don't introduce Items on page.


 Well, I got the insertions to work now without the uncompressed area, so
 in the spirit of Just Getting This Damn Patch Committed, I'm going to go
 ahead with that. We can add the uncompressed area later if performance
 testing shows it to be necessary. And I agree about the Items on page idea
 - we can come back to that too still in 9.4 timeframe if necessary, but
 probably not.

 So, committed. It's the same design as in the most recent patch I posted,
 but with a bunch of bugs fixed, and cleaned up all over. I'm going to move
 to the fast scan patch now, but please do test and review the committed
 version to see if I missed something.


 Great! Thanks a lot!
 Assertion in dataPlaceToPageLeafRecompress is too strong. Page can contain
 GinDataLeafMaxContentSize bytes. Patch is attached.
 My test-suite don't run correctly. I'm debugging now.


ITSM I found this bug. ginVacuumPostingTreeLeaf re-encodes only some
segments. Others are not even re-palloced. They are moved left
in dataPlaceToPageLeafRecompress by memcpy. But it's incorrect to with
memcpy, proper solution is memmove. Using memcpy platform dependently can
lead to page corruption. Another solution is to re-palloc segments in
ginVacuumPostingTreeLeaf.

--
With best regards,
Alexander Korotkov.


gin-memmove-fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part 1: additional information

2014-01-24 Thread Alexander Korotkov
On Fri, Jan 24, 2014 at 12:50 PM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 01/24/2014 10:03 AM, Alexander Korotkov wrote:

 ITSM I found this bug. ginVacuumPostingTreeLeaf re-encodes only some
 segments. Others are not even re-palloced. They are moved left
 in dataPlaceToPageLeafRecompress by memcpy. But it's incorrect to with
 memcpy, proper solution is memmove. Using memcpy platform dependently can
 lead to page corruption. Another solution is to re-palloc segments in
 ginVacuumPostingTreeLeaf.


 Good catch. Thanks, committed, changing memcpy to memmove. Will have to
 switch to pallocing everything in the future, if we make leafRepackItems
 smarter, so that it doesn't rewrite all the segments after the first
 modified one.


OK. What about previous fix in assert?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part 1: additional information

2014-01-24 Thread Alexander Korotkov
On Fri, Jan 24, 2014 at 1:19 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 01/24/2014 10:53 AM, Alexander Korotkov wrote:

 OK. What about previous fix in assert?


 Ah right, fixed that too now.


Good, now my test-suite passed. Results are so.

Time of operations
 event | period
---+-
 index_build   | 00:01:45.709546
 index_build_recovery  | 00:00:09
 index_update  | 00:05:45.732214
 index_update_recovery | 00:01:17
 search_new| 00:24:02.191049
 search_updated| 00:26:54.122852
(6 rows)

Index sizes
 label |   size
---+---
 new   | 387547136
 after_updates | 610877440
(2 rows)

Before compression results were following.

Time of operations
 event | period
---+-
 index_build   | 00:01:51.116722
 index_build_recovery  | 00:00:08
 index_update  | 00:05:12.124758
 index_update_recovery | 00:01:43
 search_new| 00:23:44.281424
 search_updated| 00:25:41.96251
(6 rows)

Index sizes
 label |size
---+
 new   |  884514816
 after_updates | 1595252736
(2 rows)

So, we can see little regression. However, I'm not sure if it's very
significant.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-01-24 Thread Alexander Korotkov
On Thu, Jan 23, 2014 at 8:22 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 01/14/2014 05:35 PM, Alexander Korotkov wrote:

 Attached version is rebased against last version of packed posting lists.


 Thanks!

 I think we're missing a trick with multi-key queries. We know that when
 multiple scan keys are used, they are ANDed together, so we can do the skip
 optimization even without the new tri-state consistent function.

 To get started, I propose the three attached patches. These only implement
 the optimization for the multi-key case, which doesn't require any changes
 to the consistent functions and hence no catalog changes. Admittedly this
 isn't anywhere near as useful in practice as the single key case, but let's
 go for the low-hanging fruit first. This nevertheless introduces some
 machinery that will be needed by the full patch anyway.

 I structured the code somewhat differently than your patch. There is no
 separate fast-path for the case where the optimization applies. Instead,
 I'm passing the advancePast variable all the way down to where the next
 batch of items are loaded from the posting tree. keyGetItem is now
 responsible for advancing the entry streams, and the logic in scanGetItem
 has been refactored so that it advances advancePast aggressively, as soon
 as one of the key streams let us conclude that no items  a certain point
 can match.

 scanGetItem might yet need to be refactored when we get to the full
 preconsistent check stuff, but one step at a time.


 The first patch is the most interesting one, and contains the scanGetItem
 changes. The second patch allows seeking to the right segment in a posting
 tree page, and the third allows starting the posting tree scan from root,
 when skipping items (instead of just following the right-links).

 Here are some simple performance test results, demonstrating the effect of
 each of these patches. This is a best-case scenario. I don't think these
 patches has any adverse effects even in the worst-case scenario, although I
 haven't actually tried hard to measure that. The used this to create a test
 table:

 create table foo (intarr int[]);
 -- Every row contains 0 (frequent term), and a unique number.
 insert into foo select array[0,g] from generate_series(1, 1000) g;
 -- Add another tuple with 0, 1 combo physically to the end of the table.
 insert into foo values (array[0,1]);

 The query I used is this:

 postgres=# select count(*) from foo where intarr @ array[0] and intarr @
 array[1];
  count
 ---
  2
 (1 row)

 I measured the time that query takes, and the number of pages hit, using
 explain (analyze, buffers true) 

 patches time (ms)   buffers
 ---
 unpatched   650 1316
 patch 1 0.521316
 patches 1+2 0.501316
 patches 1+2+3   0.1315

 So, the second patch isn't doing much in this particular case. But it's
 trivial, and I think it will make a difference in other queries where you
 have the opportunity skip, but return a lot of tuples overall.

 In summary, these are fairly small patches, and useful on their, so I
 think these should be committed now. But please take a look and see if the
 logic in scanGetItem/keyGetItem looks correct to you. After this, I think
 the main fast scan logic will go into keyGetItem.


Good, thanks! Now I can reimplement fast scan basing on this patches.

PS. I find it a bit surprising that in your patch, you're completely
 bailing out if there are any partial-match keys involved. Is there some
 fundamental reason for that, or just not implemented?


Just not implemented. I think there is two possible approaches to handle it:
1) Handle partial-match keys like OR on matching keys.
2) Implement keyAdvancePast for bitmap.
First approach seems to be useful with low number of keys. Probably, we
should implement automatic switching between them.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part2: fast scan

2014-01-23 Thread Alexander Korotkov
On Fri, Jan 24, 2014 at 6:48 AM, Tomas Vondra t...@fuzzy.cz wrote:

 I plan to do more thorough testing over the weekend, but I'd like to
 make sure I understand what to expect. My understanding is that this
 patch should:

 - give the same results as the current code (e.g. the fulltext should
   not return different rows / change the ts_rank etc.)

 - improve the performance of fulltext queries

 Are there any obvious rules what queries will benefit most from this?
 The queries generated by the tool I'm using for testing are mostly of
 this form:

   SELECT id FROM messages
WHERE body_tsvector @ plainto_tsquery('english', 'word1 word2 ...')
ORDER BY ts_rank(...) DESC LIMIT :n;

 with varying number of words and LIMIT values. During the testing today
 I haven't noticed any obvious performance difference, but I haven't
 spent much time on that.


These patches optimizes only query with multiple WHERE clauses. For
instance:

  SELECT id FROM messages
   WHERE body_tsvector @ plainto_tsquery('english', 'word1')
AND body_tsvector @ plainto_tsquery('english', 'word2')
   ORDER BY ts_rank(...) DESC LIMIT :n;

Optimizations inside single clause will be provided as separate patch.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part 1: additional information

2014-01-22 Thread Alexander Korotkov
On Wed, Jan 22, 2014 at 12:30 PM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 01/22/2014 09:25 AM, Alexander Korotkov wrote:

 On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas 
 hlinnakan...@vmware.com

 wrote:


  On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:

  Oh, I see what's going on. I had assumed that there cannot be duplicate
 insertions into the posting tree, but that's dead wrong. The fast
 insertion mechanism depends on a duplicate insertion to do nothing.


 Ok, this turned out to be a much bigger change than I thought...

 In principle, it's not difficult to eliminate duplicates. However, to
 detect a duplicate, you have to check if the item is present in the
 uncompressed items array, or in the compressed lists. That requires
 decoding the segment where it should be.

 But if we decode the segment, what's the purpose of the uncompressed
 items
 array? Its purpose was to speed up insertions, by buffering them so that
 we
 don't need to decode and re-encode the page/segment on every inserted
 item.
 But if we're paying the price of decoding it anyway, we might as well
 include the new item and re-encode the segment. The uncompressed area
 saves
 some effort in WAL-logging, as the record of inserting an entry into the
 uncompressed area is much smaller than that of re-encoding part of the
 page, but if that really is a concern, we could track more carefully
 which
 parts of the page is modified, and only WAL-log the required parts. And
 hopefully, the fast-update lists buffer inserts so that you insert many
 items at a time to the posting tree, and the price of re-encoding is only
 paid once.

 So, now that I think about this once more, I don't think the uncompressed
 area in every leaf page is a good idea.


 I didn't get why insertion of duplicate TIDs to uncompressed area and
 eliminate them of re-encoding? It would be somewhat more work during
 updates, more work during scan, more WAL records. But is it really
 significant for real-life workloads?


 Hmm, so you would merrily insert duplicate TIDs into the uncompressed
 area, and weed them out when reading or recompressing the page? I had not
 thought of that. Yeah, it might be a good trade-off, duplicates are not
 expected to happen very often.


  I refactored the way the recompression routine works again. It is now more

 clearly a multi-step process. First, the existing page is disassembled
 into an in-memory struct, which is a linked list of all the segments.
 In-memory each segment can be represented as an array of item pointers,
 or
 in the compressed format. In the next phase, all the new items are added
 to
 the segments where they belong. This naturally can lead to overly large
 segments; in the third phase, the items are redistributed among the
 segments, and compressed back to the compressed format. Finally, the
 finished segments are written back to the page, or pages if it had to be
 split.

 The same subroutines to disassemble and recompress are also used in
 vacuum.

 Attached is what I've got now. This is again quite heavily-changed from
 the previous version - sorry for repeatedly rewriting this. I'll continue
 testing and re-reviewing this tomorrow.



 Let's clarify where we are. We had quite debugged and tested version with
 hard-wired varbyte encoding. Now we're reimplementing and debugging
 segmented version of varbyte encoding in a hope that one day we can easily
 replace it with something better that meets our restrictions (but we
 didn't
 find it already). Is it right?


 The segmentation was a sensible change on code-readability grounds alone.
 Yes, it made it easier to experiment with different encodings, and will
 make it easier to replace the encoding in the future, but that wasn't the
 only reason for doing it. It's nice to have the encoding/decoding stuff in
 ginpostinglists.c, so that the rest of the code just passes around opaque
 GinPostingList structs (previously known as PostingListSegment).

 One thing I have been pondering, though: Instead of storing the posting
 lists one after each other on the leaf page, it might be better to store
 them as Items on the page, with normal ItemIds pointing to each. So the
 page layout would be more standard, and you could use 
 PageAddItem/PageIndexMultiDelete
 to add/remove posting lists to page. The immediate advantage of that would
 be that it would make it possible to do a binary search of the segments, to
 quickly locate the right segment where a tuple belongs to. That might not
 be significant in practice - linearly scanning 32 items is not slow either.
 And it would add some overhead, one line pointer per segment, 4 * 32 = 128
 bytes per page with the current segment size of 256 bytes. But then again,
 it might be a good idea just because it would make the pages look more like
 any other page, which is generally a good thing.


We already spent a lot of time with compression. Now we need to figure out
the result we want see. I spent quite

Re: [HACKERS] GIN improvements part 1: additional information

2014-01-22 Thread Alexander Korotkov
On Wed, Jan 22, 2014 at 9:28 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 01/22/2014 02:17 PM, Alexander Korotkov wrote:

 We already spent a lot of time with compression. Now we need to figure out
 the result we want see. I spent quite long time debugging varbyte encoding
 without segments. Finally, it passed very many tests without any problems.
 Now, it is just piece of junk. I'm afraid that we will have to reimplement
 everything from scratch another two or three times because code doesn't
 look perfect. For sure, it's incompatible with getting something into 9.4.


 That's a bit harsh :-). But yes, I understand what you're saying. It's
 quite common for large patches like this to be rewritten several times
 before being committed; you just don't know what works best until you've
 tried many designs.


  Remember we have also subsequent fast-scan which is very needed for hstore
 and other application.
 I propose to do final decisions now and concentrate forces on making
 committable patch with these decisions. And don't change these decisions
 even if somebody have idea how to improve code readability in 100 times
 and
 potential extendability in 1000 times.
 I propose following decisions:
 1) Leave uncompressed area, allow duplicates in it
 2) Don't introduce Items on page.


 Well, I got the insertions to work now without the uncompressed area, so
 in the spirit of Just Getting This Damn Patch Committed, I'm going to go
 ahead with that. We can add the uncompressed area later if performance
 testing shows it to be necessary. And I agree about the Items on page idea
 - we can come back to that too still in 9.4 timeframe if necessary, but
 probably not.

 So, committed. It's the same design as in the most recent patch I posted,
 but with a bunch of bugs fixed, and cleaned up all over. I'm going to move
 to the fast scan patch now, but please do test and review the committed
 version to see if I missed something.


Great! Thanks a lot!
Assertion in dataPlaceToPageLeafRecompress is too strong. Page can contain
GinDataLeafMaxContentSize bytes. Patch is attached.
My test-suite don't run correctly. I'm debugging now.

--
With best regards,
Alexander Korotkov.


gin-assert-fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part 1: additional information

2014-01-21 Thread Alexander Korotkov
On Mon, Jan 20, 2014 at 10:30 PM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 01/17/2014 08:49 PM, Alexander Korotkov wrote:

 On Fri, Jan 17, 2014 at 10:38 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

  On 01/17/2014 01:05 PM, Alexander Korotkov wrote:

  Seems to be fixed in the attached version of patch.
 Another improvement in this version: only left-most segments and further
 are re-encoded. Left part of page are left untouched.


 I'm looking into this now. A few quick notes:

 * Even when you don't re-encode the whole page, you still WAL-log the
 whole page. While correct, it'd be a pretty obvious optimization to only
 WAL-log the modified part.


 Oh, I overlooked it. I wrote code in ginRedoInsertData which finds
 appropriate place fro data but didn't notice that I still wrote whole page
 to xlog record.


 Yeah. I think ginRedoInsertData was too cute to be bug-free. I added an
 explicit unmodifiedsize field to the WAL record, so that ginRedoInsertData
 doesn't need to calculate it.


  * When new items are appended to the end of the page so that only the last
 existing compressed segment is re-encoded, and the page has to be split,
 the items aren't divided 50/50 on the pages. The loop that moves segments
 destined for the left page to the right won't move any existing,
 untouched,
 segments.


 I think this loop will move one segment. However, it's too small.


 Implemented this.

 I noticed that the gin vacuum redo routine is dead code, except for the
 data-leaf page handling, because we never remove entries or internal nodes
 (page deletion is a separate wal record type). And the data-leaf case is
 functionally equivalent to heap newpage records. I removed the dead code
 and made it more clear that it resembles heap newpage.

 Attached is a yet another version, with more bugs fixed and more comments
 added and updated. I would appreciate some heavy-testing of this patch now.
 If you could re-run the tests you've been using, that could be great. I've
 tested the WAL replay by replicating GIN operations over streaming
 replication. That doesn't guarantee it's correct, but it's a good smoke
 test.


I tried my test-suite but it hangs on index scan with infinite loop. I
re-tried it on my laptop with -O0. I found it to crash on update and vacuum
in some random places like:
Assert(GinPageIsData(page)); in xlogVacuumPage
Assert(ndecoded == totalpacked); in ginCompressPostingList
Trying to debug it.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part 1: additional information

2014-01-21 Thread Alexander Korotkov
On Tue, Jan 21, 2014 at 4:28 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 I noticed that the gin vacuum redo routine is dead code, except for the
 data-leaf page handling, because we never remove entries or internal nodes
 (page deletion is a separate wal record type). And the data-leaf case is
 functionally equivalent to heap newpage records. I removed the dead code
 and made it more clear that it resembles heap newpage.

 Attached is a yet another version, with more bugs fixed and more comments
 added and updated. I would appreciate some heavy-testing of this patch now.
 If you could re-run the tests you've been using, that could be great. I've
 tested the WAL replay by replicating GIN operations over streaming
 replication. That doesn't guarantee it's correct, but it's a good smoke
 test.


 I tried my test-suite but it hangs on index scan with infinite loop. I
 re-tried it on my laptop with -O0. I found it to crash on update and vacuum
 in some random places like:
 Assert(GinPageIsData(page)); in xlogVacuumPage
 Assert(ndecoded == totalpacked); in ginCompressPostingList
 Trying to debug it.


Another question is about dataPlaceToPageLeaf:

while ((Pointer) seg  segend)
{
if (ginCompareItemPointers(minNewItem, seg-first)  0)
break;

Shouldn't we adjust seg to previous segment? If minNewItem is less than
seg-first we should insert it to previous segment.

--
With best regards,
Alexander Korotkov.


<    1   2   3   4   5   6   7   8   9   10   >