Re: [HACKERS] KNN-GiST with recheck
On Tue, Dec 16, 2014 at 4:37 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Patch attached. It should be applied on top of my pairing heap patch at http://www.postgresql.org/message-id/548ffa2c.7060...@vmware.com. Some caveats: * The signature of the distance function is unchanged, it doesn't get a recheck argument. It is just assumed that if the consistent function sets the recheck flag, then the distance needs to be rechecked as well. We might want to add the recheck argument, like you Alexander did in your patch, but it's not important right now. I didn't get how that expected to work if we have only order by qual without filter qual. In this case consistent function just isn't called at all. * I used the distance term in the executor, although the ORDER BY expr machinery is more general than that. The value returned by the ORDER BY expression doesn't have to be a distance, although that's the only thing supported by GiST and the built-in opclasses. * I short-circuited the planner to assume that the ORDER BY expression always returns a float. That's true today for knn-GiST, but is obviously a bogus assumption in general. This needs some work to get into a committable state, but from a modularity point of view, this is much better than having the indexam to peek into the heap. Nice idea to put reordering into index scan node. Doesn't look like much of overengineering. I'm going to bring it to more commitable state. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Commit fest 2014-12, let's begin!
On Mon, Dec 15, 2014 at 4:12 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/15/2014 08:37 AM, Michael Paquier wrote: On Mon, Dec 15, 2014 at 3:09 PM, Tom Lane t...@sss.pgh.pa.us wrote: Michael Paquier michael.paqu...@gmail.com writes: - Point to polygon distance operator I looked at that briefly during the last fest, but was unsure whether it was too entangled with the GiST patches that Heikki was looking at. Recalling my memories of this morning, things are rather independent. Right. I also looked at it briefly, but I wasn't sure if we really want it. AFAICT, no-one has actually asked for that operator, it was written only to be an example of an operator that would benefit from the knn-gist with recheck patch. If there is some other, real, use for the knn-gist with recheck patch, then I'm OK with that, but otherwise it's dubious to add an operator just so that it can then be made faster by another patch. That said, it seems quite harmless, so might as well commit it. Lack of recheck is major limitation of KNN-GiST now. People are not asking for that because they don't know what is needed to implement exact KNN for PostGIS. Now they have to invent kluges like this: WITH closest_candidates AS ( SELECT streets.gid, streets.name, streets.geom FROM nyc_streets streets ORDER BY streets.geom - 'SRID=26918;POINT(583571.905921312 4506714.34119218)'::geometry LIMIT 100 ) SELECT gid, name FROM closest_candidates ORDER BY ST_Distance( geom, 'SRID=26918;POINT(583571.905921312 4506714.34119218)'::geometry ) LIMIT 1; See blog posts: http://blog.light42.com/wordpress/?p=102 http://workshops.boundlessgeo.com/postgis-intro/knn.html -- With best regards, Alexander Korotkov.
Re: [HACKERS] Commit fest 2014-12, let's begin!
On Mon, Dec 15, 2014 at 6:20 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: On Mon, Dec 15, 2014 at 4:12 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Right. I also looked at it briefly, but I wasn't sure if we really want it. AFAICT, no-one has actually asked for that operator, it was written only to be an example of an operator that would benefit from the knn-gist with recheck patch. Lack of recheck is major limitation of KNN-GiST now. People are not asking for that because they don't know what is needed to implement exact KNN for PostGIS. Now they have to invent kluges like this: [ query using ORDER BY ST_Distance ] It's not apparent to me that the proposed operator is a replacement for ST_Distance. The underlying data in an example like this won't be either points or polygons, it'll be PostGIS datatypes. In short, I believe that PostGIS could use what you're talking about, but I agree with Heikki's objection that nobody has asked for this particular operator. polygon - point is for sure not ST_Distance replacement. I was giving this argument about KNN-GiST with recheck itself. polygon - point is needed just as in-core example of KNN-GiST with recheck. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Commit fest 2014-12, let's begin!
On Mon, Dec 15, 2014 at 7:47 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/15/2014 05:22 PM, Alexander Korotkov wrote: On Mon, Dec 15, 2014 at 6:20 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: On Mon, Dec 15, 2014 at 4:12 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Right. I also looked at it briefly, but I wasn't sure if we really want it. AFAICT, no-one has actually asked for that operator, it was written only to be an example of an operator that would benefit from the knn-gist with recheck patch. Lack of recheck is major limitation of KNN-GiST now. People are not asking for that because they don't know what is needed to implement exact KNN for PostGIS. Now they have to invent kluges like this: [ query using ORDER BY ST_Distance ] It's not apparent to me that the proposed operator is a replacement for ST_Distance. The underlying data in an example like this won't be either points or polygons, it'll be PostGIS datatypes. In short, I believe that PostGIS could use what you're talking about, but I agree with Heikki's objection that nobody has asked for this particular operator. polygon - point is for sure not ST_Distance replacement. I was giving this argument about KNN-GiST with recheck itself. polygon - point is needed just as in-core example of KNN-GiST with recheck. Right. I don't think point - polygon is too useful by itself, but we need an example in core that could make use KNN-GiST recheck patch. We can't write a regression test for it otherwise, for starters. Actually, we probably could've used the circle - polygon for that just as well... Did you mean searching for circles or polygons in the last sentence? -- With best regards, Alexander Korotkov.
[HACKERS] Report search_path value back to the client.
Hi, As of now postgres is reporting only really important variables, but among them there are also not so important, like 'application_name'. The only reason to report it, was: help pgbouncer, and perhaps other connection poolers. Change was introduced by commit: 59ed94ad0c9f74a3f057f359316c845cedc4461e This fact makes me wonder, why 'search_path' value is not reported back to the client? Use-case is absolutely the same as with 'application_name' but a little bit more important. Our databases provides different version of stored procedures which are located in a different schemas: 'api_version1', 'api_version2', 'api_version5', etc... When application establish connection to the database it set search_path value for example to api_version1. At the same time, new version of the same application will set search_path value to api_version2. Sometimes we have hundreds of instances of applications which may use different versions of stored procedures which are located in different schemas. It's really crazy to keep so many (hundreds) connections to the database and it would be much better to have something like pgbouncer in front of postgres. Right now it's not possible, because pgbouncer is not aware of search_path and it's not really possible to fix pgbouncer, because there is no easy way to get current value of search_path. I would like to mark 'search_path' as GUC_REPORT: --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2904,7 +2904,7 @@ static struct config_string ConfigureNamesString[] = {search_path, PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop(Sets the schema search order for names that are not schema-qualified.), NULL, - GUC_LIST_INPUT | GUC_LIST_QUOTE + GUC_LIST_INPUT | GUC_LIST_QUOTE | GUC_REPORT }, namespace_search_path, \$user\,public, What do you think? Regards, -- Alexander Kukushkin
Re: [HACKERS] Fillfactor for GIN indexes
Hi! On Fri, Nov 21, 2014 at 8:12 AM, Michael Paquier michael.paqu...@gmail.com wrote: Please find attached a simple patch adding fillfactor as storage parameter for GIN indexes. The default value is the same as the one currently aka 100 to have the pages completely packed when a GIN index is created. That's not true. Let us discuss it a little bit. In btree access method index is created by sorting index tuples. Once they are sorted it can write a tree with any fullness of the pages. With completely filled pages you will get flood of index page splits on table updates or inserts. In order to avoid that the default fillfactor is 90. Besides index creation, btree access method tries to follow given fillfactor in the case of inserting ordered data. When rightmost page splits then fillfactor percent of data goes to the left page. Btree page life cycle is between being freshly branched by split (50% filled) and being full packed (100% filled) and then splitted. Thus we can roughly estimate average fullness of btree page to be 75%. This equilibrium state can be achieved by huge amount of inserts over btree index. Fillfactor was spreaded to other access methods. For instance, GiST. In GiST, tree is always constructed by page splits. So, freshly created GiST is already in its equilibrium state. Even more GiST doesn't splits page by halves. Split ration depends on opclass. So, equilibrium fullness of particular GiST index is even less than 75%. What does fillfactor do with GiST? It makes GiST pages split on fillfactor fullness on index build. So, GiST is even less fulled. But why? You anyway don't get flood of split on fresh GiST index because it's in equilibrium state. That's why I would rather delete fillfactor from GiST until we have some algorithm to construct GiST without page splits. What is going on with GIN? GIN is, basically, two-level nested btree. So, fillfactor should be applicable here. But GIN is not constructed like btree... GIN accumulates maintenance_work_mem of data and then inserts it into the tree. So fresh GIN is the result of insertion of maintenance_work_mem buckets. And each bucket is sorted. On small index (or large maintenance_work_mem) it would be the only one bucket. GIN has two levels of btree: entry tree and posting tree. Currently entry tree has on special logic to control fullness during index build. So, with only one bucket entry tree will be 50% filled. With many buckets entry tree will be at equilibrium state. In some specific cases (about two buckets) entry tree can be almost fully packed. Insertion into posting tree is always ordered during index build because posting tree is a tree of TIDs. When posting tree page splits it leaves left page fully packed during index build and 75% packed during insertion (because it expects some sequential inserts). Idea of GIN building algorithm is that entry tree is expected to be relatively small and fit to cache. Insertions into posting trees are orders. Thus this algorithm should be IO efficient and have low overhead. However, with large entry trees this algorithm can perform much worse. My summary is following: 1) In order to have fully correct support of fillfactor in GIN we need to rewrite GIN build algorithm. 2) Without rewriting GIN build algorithm, not much can be done with entry tree. However, you can implement some heuristics. 3) You definitely need to touch code that selects ratio of split in dataPlaceToPageLeaf (starting with if (!btree-isBuild)). 3) GIN data pages are always compressed excepts pg_upgraded indexes from pre-9.4. Take care about it in following code. if (GinPageIsCompressed(page)) freespace = GinDataLeafPageGetFreeSpace(page); + else if (btree-isBuild) + freespace = BLCKSZ * (100 - fillfactor) / 100; -- With best regards, Alexander Korotkov.
Re: [HACKERS] KNN-GiST with recheck
On Sat, Nov 22, 2014 at 2:20 AM, Peter Geoghegan p...@heroku.com wrote: On Mon, Jan 13, 2014 at 9:17 AM, Alexander Korotkov aekorot...@gmail.com wrote: This patch was split from thread: http://www.postgresql.org/message-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=gon-...@mail.gmail.com I've split it to separate thead, because it's related to partial sort only conceptually not technically. Also I renamed it to knn-gist-recheck from partial-knn as more appropriate name. In the attached version docs are updated. Possible weak point of this patch design is that it fetches heap tuple from GiST scan. However, I didn't receive any notes about its design, so, I'm going to put it to commitfest. The partial sort thing is not in the current 2014-10 commitfest (although this patch is). Is that intentional? It's not. I just didn't revise partial sort yet :( -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: Access method extendability
Hi, Heikki! Thank you for summarizing. In general, I agree with your notes with some exceptions. On Mon, Nov 24, 2014 at 1:52 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 11/10/2014 10:30 PM, Alexander Korotkov wrote: Don't allowing CREATE ACCESS METHOD command seems problematic for me. How could it work with pg_upgrade? pg_dump wouldn't dump extra pg_am records. So, pg_upgrade would break at creating operator classes on new cluster. So, I agree with dropping create am command only if we let pg_dump to dump extra pg_am records... pg_dump would dump the CREATE EXTENSION command, and the extension's installation script inserts the row to pg_am. pg_dump doesn't dump objects that are part of an extension, so that's how this would work with the CREATE ACCESS METHOD command, too. In binary upgrade mode pg_dump have to guarantee that all database objects will have same oids. That's why in binary upgrade mode pg_dump dumps extension contents instead of just CREATE EXTENSION command. Backtracking a bit, to summarize the discussion so far: * It would be nice to have indexes that are not WAL-logged, but are automatically re-created after a crash. But it's a completely different and orthogonal feature, so there's no need to discuss that further in this thread. * If an extension is buggy, it can crash and corrupt the whole database. There isn't really anything we can do about that, and this patch doesn't make that any better or worse. * CREATE ACCESS METHOD command isn't worth it. Taking into account my previous note, how can custom extensions survive pg_upgrade without CREATE ACCESS METHOD command? -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: Access method extendability
Hi! Thanks everybody for discussion. Sorry for delay. I'll try to address most of questions arised in this discussion. In general, I'm agree with concept of marking index as invalid in certain cases. I see following possible consequences of buggy WAL-logged custom AM: 1) Server crash during insert/update. 2) Server crash during select. 3) Invalid query answers. 4) Server crash during vacuum. 5) Server crash in recovery. #5 is totally unacceptable. I've tried to design generic WAL record so that it should be always possible to purely mechanically apply the record. It's always possible to move piece of memory inside the page. It's always possible to copy piece of memory from WAL-record to the page. Buggy WAL for sure could cause an index corruption as well as any other bug in AM. WAL support doesn't bring nothing special to this expect the fact that WAL is quite hard to debug. However, in current implementation I missed some hidden asserts about page structure. Page could be so corrupted that we can't, for instance, safely call XLogReadBufferForRedo(). All this cases must be worked out. If we can't update page during recovery, index must be marked as invalid. It's some amount of work, but it seems to be very feasible. #4 seems problematic too. If autovacuum crashes on some index, then postgres can enter a crash loop. So, if autovacuum crashes on extension provided AM, that index should be marked as invalid. Consequences #1, #3 and #3 could be easily caused by buggy opclass as well. The reason why we're not knee-deep in extension provied bugs in GiST/GIN opclasses is not easyness of writing correct GiST/GIN opclasses. Actual reason is that we have only few GiST/GIN opclasses which weren't written by GiST/GIN authors. So, I don't expect PostgreSQL to be flooded with buggy AMs once we add AM extendability. Our motivation behind this proposal is that we want to be able to develop custom extensions with AMs. We want to be able to provide our AMs to our customers whithout having to push that into PostgreSQL core or fork PostgreSQL. Bugs in that AMs in our responsibility to out customers. Some commercial companies could implement patented AMs (for instance, fractal tree), and that is their responsibility to their customers. Also, I think it's OK to put adopting custom AMs to changes of AM interface to authors of those custom AMs. AM extensions anyway should be adopted to each new major release. AFAIR, interface of RelationOpen() function has been changed not too long ago. Custom AM would use many functions which we use to access relations. Their interface can be changed in the next release. PostGIS GiST opclass has bugs which are reproducable, known and not fixed. This is responsibility of PostGIS to their customers. We can feel sorry for PostGIS that they are so slow on fixing this. But we shouldn't feel sorry for GiST extendability, that woulde be redicilous. Some recearches could write their own extensions. We can hope that they are smart enough to not recommend it for production use. We can back our hope with warning during installing extension provided AM. That warning could say that all corruption caused by extension provided AM is up to AM developer. This warning could make users to beware of using extension provided AMs in production unless they are fully trust extension developer (have support subscription if it's commercial). PostgreSQL doesn't have any kind of safe extensions. Every extension must be trusted. Every extension can break (almost) everything.When one types CREATE EXTENSION he must completely trust extension author. This applies to every extension. I would be very careful with discouraging commercial AM extensions. We should always keen in the mind how many of PostgreSQL developers are working for companies which own their commercial PostgreSQL forks and how big their contribution is. Patented extensions looks scary for sure. But it's up to software patents not up to PostgreSQL extendability... Particular steps I'm going to do on these patches: 1) Make generic_redo never crash on broken pages. 2) Make autovacuum launcher mark index as invalid if vacuum process crashed on custom AM index. Since, we can't write something into postgres cluster when one process has crushed, ITSM autovacuum should have some separate place to put this information. Thus, after restart postgres can read it and mark index as invalid. Don't allowing CREATE ACCESS METHOD command seems problematic for me. How could it work with pg_upgrade? pg_dump wouldn't dump extra pg_am records. So, pg_upgrade would break at creating operator classes on new cluster. So, I agree with dropping create am command only if we let pg_dump to dump extra pg_am records... -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: Access method extendability
On Tue, Oct 28, 2014 at 8:04 PM, Simon Riggs si...@2ndquadrant.com wrote: On 28 October 2014 14:22, Simon Riggs si...@2ndquadrant.com wrote: Or put it another way, it will be easier to write new index AMs because we'll be able to skip the WAL part until we know we want it. To be clear: I am suggesting you do *less* work, not more. By allowing AMs to avoid writing WAL we get * higher performance unlogged indexes * we get fewer bugs in early days of new AMs * writers of new AMs are OK to avoid majority of hard work and hard testing So overall, we get new AMs working faster because we can skip writing the WAL code until we are certain the new AM code is useful and bug free. For example, if GIN had avoided implementing WAL it would have been easier to change on-disk representation. Major problem of changing on-disk representation is that we have to support both binary formats because of pg_upgrade. This problem is even burdened with WAL, because WAL record redo function also have to support both formats. However, it's also quite independent of WAL. Having access methods as extensions can significantly improves situations here. Imagine, GIN was an extension. One day we decide to change its binary format. Then we can issue new extension, GIN2 for instance. User can upgrade from GIN to GIN2 in following steps: 1. CREATE EXTENSION gin2; 2. CREATE INDEX CONCURRENTLY [new_index] USING gin2 ([index_def]); 3. DROP INDEX CONCURRENTLY [old_index]; 4. DROP EXTENSION gin; No need to write and debug the code which reads both old and new binary format. For sure, we need to support old GIN extension for some time. But, we have to support old in-core versions too. Unfortunately, I didn't mention this in the first post because I consider this as a serious argument for extensible AMs. Also, I'm not sure that many users have enough of courage to use unlogged AMs. Absence of durability is only half of trouble, another half is lack of streaming replication. I think if we have unlogged GIN then external indexing engines would be used by majority of users instead of GIN. -- With best regards, Alexander Korotkov.
[HACKERS] WIP: Access method extendability
Hackers, Postgres was initially designed to support access methods extendability. This extendability lives to present day. However, this is mostly internal in-core extendability. One can quite easily add new access method into PostgreSQL core. But if one try to implement access method as external module, he will be faced with following difficulties: 1. Need to directly insert into pg_am, because of no CREATE ACCESS METHOD command. And no support of dependencies between am and opclasses etc. 2. Module can't define xlog records. So, new am would be not WAL-logged. The first problem is purely mechanical. Nothing prevents us to implement CREATE ACCESS METHOD and DROP ACCESS METHOD commands and support all required dependencies. Problem of WAL is a bit more complex. According to previous discussions, we don't want to let extensions declare their own xlog records. If we let them then recovery process will depend on extensions. That is much violates reliability. Solution is to implement some generic xlog record which is able to represent difference between blocks in some general manner. 3 patches are attached: 1. CREATE/DROP ACCESS METHOD commands. With support of dependencies. 2. New generic xlog record type. 3. Bloom contrib module as example of usage of previous two features. This module was posted few years ago by Teodor Sigaev. Now, it's wrapped as an extension and WAL-logged. Patches are in WIP state. No documentation and very little of comments. However, it believe that it's enough to do some general concept review. Some notes about generic xlog format. Generic xlog represent difference between pages as operations set of two kinds: 1. Move memory inside the page. Optional flag is to zero gap on a previous memory location. 2. Copy memory fragment of memory from xlog record to page. As an option bitwise logical operations are supported as well as plain copy. Generic xlog can open page in two modes: 1. Create mode: page is zeroed independent on its lsn. 2. Update mode: page is updated only if it's lsn is lower than record lsn Usually, xlog record is filled in critical sections when memory allocations is prohibited. Thus, user have to previously initialize it with knowledge of pages count, total operations count and total length of data. -- With best regards, Alexander Korotkov. create-am.1.patch.gz Description: GNU Zip compressed data generic-xlog.1.patch.gz Description: GNU Zip compressed data bloom-contrib.1.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] KNN-GiST with recheck
On Mon, Sep 29, 2014 at 6:16 AM, Bruce Momjian br...@momjian.us wrote: On Fri, Sep 26, 2014 at 10:49:42AM +0400, Alexander Korotkov wrote: Does this also fix the identical PostGIS problem or is there something PostGIS needs to do? This patch provides general infrastructure for recheck in KNN-GiST. PostGIS need corresponding change in its GiST opclass. Since PostGIS already define - and # operators as distance to bounding box border and bounding box center, it can't change their behaviour. it has to support new operator exact distance in opclass. Ah, OK, so they just need something that can be used for the recheck. I think they currently use ST_Distance() for that. Does it have to be an operator? If they defined an operator for ST_Distance(), would ST_Distance() work too for KNN-GiST? Currently, ST_Distance is a function, but it's no problem to make it an operator with KNN-GiST support. In summary, you still create a normal GiST index on the column: http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html CREATE INDEX planet_osm_line_ref_index ON planet_osm_line(ref); which indexes by the bounding box. The new code will allow ordered index hits to be filtered by something like ST_Distance(), rather than having to a LIMIT 50 in a CTE, then call ST_Distance(), like this: EXPLAIN ANALYZE WITH distance AS ( SELECT way AS road, ref AS route FROM planet_osm_line WHERE highway = 'secondary' ORDER BY ST_GeomFromText('POLYGON((14239931.42 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08 3053984.84,14239931.42 3054117.72))', 900913) # way LIMIT 50 ) SELECT ST_Distance(ST_GeomFromText('POLYGON((14239931.42 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08 3053984.84,14239931.42 3054117.72))', 900913), road) AS true_distance, route FROM distance ORDER BY true_distance LIMIT 1; Yeah. It this query 50 is pure empirical value. It could be both too low or too high. Too low value can cause wrong query answers. Too high value can cause lower performance. With patch simple KNN query will work like this query with always right value in LIMIT clause. Notice the CTE uses # (bounding box center), and then the outer query uses ST_Distance and LIMIT 1 to find the closest item. Excellent! Thanks. The main question now is design of this patch. Currently, it does all the work inside access method. We already have some discussion of pro and cons of this method. I would like to clarify alternatives now. I can see following way: 1. Implement new executor node which performs sorting by priority queue. Let's call it Priority queue. I think it should be separate node from Sort node. Despite Priority queue and Sort are essentially similar from user view, they would be completely different in implementation. 2. Implement some interface to transfer distance values from access method to Priority queue node. 3. Somehow tell the planner that it could use Priority queue in corresponding cases. I see two ways of doing this: - Add flag to operator in opclass indicating that index can only order by lower bound of col op value, not by col op value itself. - Define new relation between operators. Value of one operator could be lower bound for value of another operator. So, planner can put Priority queue node when lower bound ordering is possible from index. Also ALTER OPERATOR command would be reasonable, so extensions could upgrade. Besides overhead, this way makes significant infrastructural changes. So, it may be over-engineering. However, it's probably more clean and beautiful solution. I would like to get some feedback from people familiar with KNN-GiST like Heikki or Tom. What do you think about this? Any other ideas? -- With best regards, Alexander Korotkov.
Re: [HACKERS] KNN-GiST with recheck
On Fri, Sep 26, 2014 at 5:18 AM, Bruce Momjian br...@momjian.us wrote: On Sun, Sep 14, 2014 at 11:34:26PM +0400, Alexander Korotkov wrote: Cost estimation of GiST is a big problem anyway. It doesn't care (and can't) about amount of recheck for regular operators. In this patch, same would be for knn recheck. The problem is that touching heap from access This is very important work. While our existing KNN-GiST index code works fine for scalar values and point-to-point distance ordering, it doesn't work well for 2-dimensional objects because they are only indexed by their bounding boxes (a rectangle around the object). The indexed bounding box can't produce accurate distances to other objects. As an example, see this PostGIS blog post showing how to use LIMIT in a CTE to filter results and then compute the closest object (search for LIMIT 50): http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html This patch fixes our code for distances from a point to indexed 2-D objects. Does this also fix the identical PostGIS problem or is there something PostGIS needs to do? This patch provides general infrastructure for recheck in KNN-GiST. PostGIS need corresponding change in its GiST opclass. Since PostGIS already define - and # operators as distance to bounding box border and bounding box center, it can't change their behaviour. it has to support new operator exact distance in opclass. -- With best regards, Alexander Korotkov.
Re: [HACKERS] KNN-GiST with recheck
On Thu, Sep 25, 2014 at 9:00 PM, Emre Hasegeli e...@hasegeli.com wrote: Fixed, thanks. Here are my questions and comments about the code. doc/src/sgml/gist.sgml:812: be rechecked from heap tuple before tuple is returned. If literalrecheck/ flag isn't set then it's true by default for compatibility reasons. The literalrecheck/ flag can be used only Recheck flag is set to false on gistget.c so I think it should say false by default. On the other hand, it is true by default on the consistent function. It is written as the safest assumption on the code comments. I don't know why the safest is chosen over the backwards compatible for the consistent function. Agree. It should be clarified in docs. src/backend/access/gist/gistget.c:505: /* Recheck distance from heap tuple if needed */ if (GISTSearchItemIsHeap(*item) searchTreeItemNeedDistanceRecheck(scan, so-curTreeItem)) { searchTreeItemDistanceRecheck(scan, so-curTreeItem, item); continue; } Why so-curTreeItem is passed to these functions? They can use scan-opaque-curTreeItem. I didn't get the difference. Few lines before: GISTScanOpaque so = (GISTScanOpaque) scan-opaque; src/backend/access/gist/gistscan.c:49: /* * When all distance values are the same, items without recheck * can be immediately returned. So they are placed first. */ if (recheckCmp == 0 distance_a.recheck != distance_b.recheck) recheckCmp = distance_a.recheck ? 1 : -1; I don't understand why items without recheck can be immediately returned. Do you think it will work correctly when there is an operator class which will return recheck true and false for the items under the same page? Yes, I believe so. Item with recheck can't decrease it's distance, it can only increase it. In the corner case item can have same distance after recheck as it was before. Then anyway items which distances are the same can be returned in any order. src/backend/access/index/indexam.c:258: /* Prepare data structures for getting original indexed values from heap */ scan-indexInfo = BuildIndexInfo(scan-indexRelation); scan-estate = CreateExecutorState(); scan-slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation)); With the changes in indexam.c, heap access become legal for all index access methods. I think it is better than the previous version but I am leaving the judgement to someone experienced. I will try to summarize the pros and cons of sorting the rows in the GiST access method, as far as I understand. Pros: * It does not require another queue. It should be effective to sort the rows inside the queue the GiST access method already has. * It does not complicate index access method infrastructure. Cons: * It could be done without additional heap access. * Other access methods could make use of the sorting infrastructure one day. * It could be more transparent to the users. Sorting information could be shown on the explain output. It would be also nice to show some information about KNN itself. * A more suitable data structure like binary heap could be used for the queue to sort the rows. Binary heap seems to be better data structure for whole KNN-GiST. But it's a subject for a separate patch: replace RB-tree to heap in KNN-GiST. It's not related to recheck stuff. -- With best regards, Alexander Korotkov.
Re: [HACKERS] KNN-GiST with recheck
On Wed, Sep 17, 2014 at 12:30 PM, Emre Hasegeli e...@hasegeli.com wrote: I managed to break it again by ordering rows only by the second column of the index. Test script attached. I was confused. It is undefined behavior. Sorry for the noise. No problem. Thanks a lot for testing. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Collation-aware comparisons in GIN opclasses
On Mon, Sep 15, 2014 at 11:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: Peter Geoghegan p...@heroku.com writes: On Mon, Sep 15, 2014 at 8:28 AM, Alexander Korotkov aekorot...@gmail.com wrote: Rename such opclasses and make them not default. Create new default opclasses with bitwise comparison functions. Write recommendation to re-create indexes with default opclasses into documentation. I certainly think this should be fixed if at all possible, but I'm not sure about this plan. Can we really rename an opclass without consequence, including having that respected across pg_upgrade? No. And we don't know how to change the default opclass without breaking things, either. See previous discussions about how we might fix the totally-broken default gist opclass that btree_gist creates for the inet type [1]. The motivation for getting rid of that is *way* stronger than it might be slow, but there's no apparent way to make something else be the default without creating havoc. I've read thread about gist opclass for inet type. But that case is more difficult because conflict is between builtin opclass and contrib opclass. This case seems to be much simpler: we need to change builtin opclass to builtin opclass and contrib opclass to contrib opclass. I realized that it's problematic to rename builtin opclass due to pg_upgrade. However, it seems still possible to create new opclass and make it default. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Collation-aware comparisons in GIN opclasses
On Tue, Sep 16, 2014 at 11:29 AM, Emre Hasegeli e...@hasegeli.com wrote: Changing the default opclasses should work if we make pg_dump --binary-upgrade dump the default opclasses with indexes and exclusion constraints. I think it makes sense to do so in --binary-upgrade mode. I can try to come with a patch for this. Can you explain it a bit more detail? I didn't get it. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Collation-aware comparisons in GIN opclasses
On Tue, Sep 16, 2014 at 12:14 PM, Emre Hasegeli e...@hasegeli.com wrote: Changing the default opclasses should work if we make pg_dump --binary-upgrade dump the default opclasses with indexes and exclusion constraints. I think it makes sense to do so in --binary-upgrade mode. I can try to come with a patch for this. Can you explain it a bit more detail? I didn't get it. pg_upgrade uses pg_dump --binary-upgrade to dump the schema of the old database. Now, it generates CREATE INDEX statements without explicit opclass if opclass is the default. We can change pg_dump to generate the statements with opclass even if opclass is the default in --binary-upgrade mode. Thanks, I get it. I checked pg_dump implementation. It appears to be not as easy as it could be. pg_dump doesn't form index definition by itself. It calls pg_get_indexdef function. This function have no option to dump names of default opclasses. Since we can't change behaviour of old postgres version, we have to make pg_dump form index definition by itself. -- With best regards, Alexander Korotkov.
Re: [HACKERS] PoC: Partial sort
On Sun, Sep 14, 2014 at 7:39 AM, Peter Geoghegan p...@heroku.com wrote: On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov aekorot...@gmail.com wrote: Actually, higher cardinality skip columns is better. Sorting of smaller groups is faster than sorting larger groups of same size. Also, with smaller groups you achieve limit more accurate (in average), i.e. sort smaller amount of total rows. Higher cardinality leading key columns are better for what? Do you mean they're better in that those cases are more sympathetic to your patch, or better in the general sense that they'll perform better for the user? The first example query, that you chose to demonstrate your patch had a leading, indexed column (column v1) with much lower cardinality/n_distinct than the column that had to be sorted on (column v2). That was what my comments were based on. When this feature is used, there will often be a significantly lower cardinality in the leading, indexed column (as in your example). Otherwise, the user might well have been happy to just order on the indexed/leading column alone. That isn't definitely true, but it's often true. I just meant higher cardinality is cheaper for do partial sort. We could have some misunderstood because of notions high and low are relative. When cardinality is 1 then partial sort seems to be useless. When cardinality is few then it still could be cheaper to do sequential scan + sort rather than index scan + partial sort. When cardinality is close to number of rows then as you mentioned user probably don't need to sort by rest of columns. At least one exception is when user needs to force uniqueness of order. So, we need to target something in the middle of this two corner cases. I'm not sure if that's worth it to more or less duplicate heap_tuple_attr_equals() to save a mere n expensive comparisons, but it's something to think about (actually, there are probably less than even n comparisons in practice because there'll be a limit). Not correct. Smaller groups are not OK. Imagine that two representations of same skip column value exists. Index may return them in any order, even change them one by one. In this case sorting on other column never takes place, while it should. I think I explained this badly - it wouldn't be okay to make the grouping smaller, but as you say we could tie-break with a proper B-Tree support function 1 comparison to make sure we really had reached the end of our grouping. FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp() first, so the use of the equality operator with text in mind that you mention may soon not be useful at all. The evidence suggests that memcmp() is so much cheaper than special preparatory NUL-termination + strcoll() that we should always try it first when sorting text, even when we have no good reason to think it will work. The memcmp() is virtually free. And so, you see why it might be worth thinking about this when we already have reasonable confidence that many comparisons will indicate that datums are equal. Other datatypes will have expensive real comparators, not just text or numeric. When strings are not equal bttextcmp still needs to use strcoll while texteq doesn't need that. So, it would be still advantage of using equality operator over comparison function: equality operator don't have to compare unequal values. However, real cost of this advantage depends on presorted columns cardinality as well. -- With best regards, Alexander Korotkov.
Re: [HACKERS] PoC: Partial sort
On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan p...@heroku.com wrote: I think we might be better off if a tuplesort function was called shortly after tuplesort_begin_heap() is called. How top-n heap sorts work is something that largely lives in tuplesort's head. Today, we call tuplesort_set_bound() to hint to tuplesort By the way, this is a top-n heap sort applicable sort. I think that with this patch, we should then hint (where applicable) by the way, you won't actually be required to sort those first n indexed attributes; rather, you can expect to scan those in logical order. You may work the rest out yourself, and may be clever about exploiting the sorted-ness of the first few columns. The idea of managing a bunch of tiny sorts from with ExecSort(), and calling the new function tuplesort_reset() seems questionable. tuplesortstate is supposed to be private/opaque to nodeSort.c, and the current design strains that. I'd like to keep nodeSort.c simple. I think it's pretty clear that the guts of this do not belong within ExecSort(), in any case. Also, the additions there should be much better commented, wherever they finally end up. As I understand, you propose to incapsulate partial sort algorithm into tuplesort. However, in this case we anyway need some significant change of its interface: let tuplesort decide when it's able to return tuple. Otherwise, we would miss significant part of LIMIT clause optimization. tuplesort_set_bound() can't solve all the cases. There could be other planner nodes between the partial sort and LIMIT. -- With best regards, Alexander Korotkov.
[HACKERS] Triconsistent catalog declaration
Hi! 9.4 GIN introduces new triconsistent method which can return one of three values in spite of just consistent method. But in catalog declaration triconsistent still returns bool type. It doesn't affect anything because nobody calls triconsistent from SQL. But I think it would be correct to declare them returns int2. -- With best regards, Alexander Korotkov. tri-consistent-catalog.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Collation-aware comparisons in GIN opclasses
Hackers, some GIN opclasses uses collation-aware comparisons while they don't need to do especially collation-aware comparison. Examples are text[] and hstore opclasses. Depending on collation this may make them a much slower. See example. # show lc_collate ; lc_collate ─ ru_RU.UTF-8 (1 row) # create table test as (select array_agg(i::text) from generate_series(1,100) i group by (i-1)/10); SELECT 10 # create index test_idx on test using gin(array_agg); CREATE INDEX Time: *26930,423 ms* # create index test_idx2 on test using gin(array_agg collate C); CREATE INDEX Time: *5143,682 ms* Index creation with collation ru_RU.UTF-8 is 5 times slower while collation has absolutely no effect on index functionality. However, we can just replace comparison function for those opclasses because it would break binary compatibility for pg_upgrade. I see following solution: 1. Rename such opclasses and make them not default. 2. Create new default opclasses with bitwise comparison functions. 3. Write recommendation to re-create indexes with default opclasses into documentation. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Collation-aware comparisons in GIN opclasses
On Mon, Sep 15, 2014 at 10:51 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Sep 15, 2014 at 8:28 AM, Alexander Korotkov aekorot...@gmail.com wrote: some GIN opclasses uses collation-aware comparisons while they don't need to do especially collation-aware comparison. Examples are text[] and hstore opclasses. Depending on collation this may make them a much slower. I'm glad that I saw how pointless this was with the jsonb GIN default opclass during development. Rename such opclasses and make them not default. Create new default opclasses with bitwise comparison functions. Write recommendation to re-create indexes with default opclasses into documentation. I certainly think this should be fixed if at all possible, but I'm not sure about this plan. Can we really rename an opclass without consequence, including having that respected across pg_upgrade? Just rename doesn't seem to be safe. Since pg_upgrade uses pg_dump, all indexes are linked to opclasses using their names. Therefore existed indexes will be linked to new opclasses. It's likely we need to execute SQL script renaming opclasses before pg_upgrade. Another option is to don't rename old opclasses, just create new default opclasses with new names. Bruce, what is your opinion about pg_upgrade? Contrib opclasses would be safe to rename using migration script. -- With best regards, Alexander Korotkov.
Re: [HACKERS] KNN-GiST with recheck
On Sun, Sep 14, 2014 at 10:09 PM, Emre Hasegeli e...@hasegeli.com wrote: I added the point to polygon distance operator patch to the open CommitFest as ready for committer and added myself as reviewer to both of the patches. Thanks. Cost estimation of GiST is a big problem anyway. It doesn't care (and can't) about amount of recheck for regular operators. In this patch, same would be for knn recheck. The problem is that touching heap from access method breaks incapsulation. One idea about this is to do sorting in another nodes. However, I wonder if it would be an overengineering and overhead. In attached patch I propose a different approach: put code touching heap into separate index_get_heap_values function. Also new version of patch includes regression tests and some cleanup. While looking it at I found a bug. It returns the second column in wrong order when both of the distance functions return recheck = true. Test script attached to run on the regression database. I tried to fix but could not. searchTreeItemDistanceRecheck function is not very easy to follow. I think it deserves more comments. Fixed, thanks. It was logical error in comparison function implementation. -- With best regards, Alexander Korotkov. knn-gist-recheck-4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Partial sort
analyze select x,y,z from b where z=1 order by x,y limit 10; QUERY PLAN ── Limit (cost=169374.43..263471.04 rows=10 width=12) (actual time=2237.082..2237.083 rows=10 loops=1) - Partial sort (cost=169374.43..338748.34 rows=18 width=12) (actual time=2237.082..2237.083 rows=10 loops=1) Sort Key: x, y Presorted Key: x Sort Method: quicksort Memory: 25kB - Index Scan using b_x_idx on b (cost=0.43..338748.13 rows=18 width=12) (actual time=0.047..2237.062 rows=10 loops=1) Filter: (z = 1) Rows Removed by Filter: 990 Planning time: 0.089 ms Execution time: 2237.133 ms (10 rows) AFAICS wrong selectivity estimations are general problem which cause optimizer failures. But in your example x+y=1 if expression index on x+y would exist then statistics over x+y will be collected. So, in case of expression index estimation will be fine. -- With best regards, Alexander Korotkov. partial-sort-basic-2.patch Description: Binary data partial-sort-merge-2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] jsonb contains behaviour weirdness
Hi! Let's consider some examples. # select '[1,2]'::jsonb @ '[1,2,2]'::jsonb; ?column? -- f (1 row) One may think it's because second jsonb array contain two 2. So, contains takes care about count of equal elements. # select '[1,1,2]'::jsonb @ '[1,2,2]'::jsonb; ?column? -- t (1 row) But, it's not. Jsonb contains takes care only about length of array. # select '[[1,2]]'::jsonb @ '[[1,2,2]]'::jsonb; ?column? -- t (1 row) Even more weird :) The reason why jsonb contains behaves so is check in the beginning of jsonb_contains. It makes fast check of jsonb type and elements count before calling JsonbDeepContains. if (JB_ROOT_COUNT(val) JB_ROOT_COUNT(tmpl) || JB_ROOT_IS_OBJECT(val) != JB_ROOT_IS_OBJECT(tmpl)) PG_RETURN_BOOL(false); It's likely that JB_ROOT_COUNT(val) JB_ROOT_COUNT(tmpl) should be checked only for objects, not arrays. Also, should JsonbDeepContains does same fast check when it deals with nested objects? -- With best regards, Alexander Korotkov.
Re: [HACKERS] jsonb contains behaviour weirdness
On Fri, Sep 12, 2014 at 10:38 PM, Peter Geoghegan p...@heroku.com wrote: On Fri, Sep 12, 2014 at 11:21 AM, Josh Berkus j...@agliodbs.com wrote: # select '[1,1,2]'::jsonb @ '[1,2,2]'::jsonb; ?column? -- t (1 row) But, it's not. Jsonb contains takes care only about length of array. OK, now, that's messed up. To be clear: I don't think that this example is messed up (in isolation). I think it's the correct behavior. What I find objectionable is the inconsistency. I believe that this is Alexander's concern too. Alexander's first example exhibits broken behavior. Agree. I just tried to explain how current behaviour could look for user who sees it for the first time. -- With best regards, Alexander Korotkov.
Re: [HACKERS] PoC: Partial sort
On Sun, Jul 13, 2014 at 6:45 AM, Peter Geoghegan p...@heroku.com wrote: On Mon, Feb 10, 2014 at 10:59 AM, Alexander Korotkov aekorot...@gmail.com wrote: Done. Patch is splitted. I took a quick look at this. Have you thought about making your new cmpSortSkipCols() function not use real comparisons? Since in the circumstances in which this optimization is expected to be effective (e.g. your original example) we can also expect a relatively low cardinality for the first n indexed attributes (again, as in your original example), in general when cmpSortSkipCols() is called there is a high chance that it will return true. If any pair of tuples (logically adjacent tuples fed in to cmpSortSkipCols() by an index scan in logical order) are not fully equal (i.e. their leading, indexed attributes are not equal) then we don't care about the details -- we just know that a new sort grouping is required. Actually, higher cardinality skip columns is better. Sorting of smaller groups is faster than sorting larger groups of same size. Also, with smaller groups you achieve limit more accurate (in average), i.e. sort smaller amount of total rows. The idea here is that you can get away with simple binary equality comparisons, as we do when considering HOT-safety. Of course, you might find that two bitwise unequal values are equal according to their ordinary B-Tree support function 1 comparator (e.g. two numerics that differ only in their display scale). AFAICT this should be okay, since that just means that you have smaller sort groupings than strictly necessary. I'm not sure if that's worth it to more or less duplicate heap_tuple_attr_equals() to save a mere n expensive comparisons, but it's something to think about (actually, there are probably less than even n comparisons in practice because there'll be a limit). Not correct. Smaller groups are not OK. Imagine that two representations of same skip column value exists. Index may return them in any order, even change them one by one. In this case sorting on other column never takes place, while it should. But some optimizations are still possible: 1. Use bitwise comparison first, then recheck. But, no guarantees that acceleration will be achieved. 2. Use equality check instead of btree comparison. For text datatype it would be rather faster because of no locale-aware comparison. -- With best regards, Alexander Korotkov.
Re: [HACKERS] gist vacuum gist access
On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09/07/2014 05:11 PM, Костя Кузнецов wrote: hello. i recode vacuum for gist index. all tests is ok. also i test vacuum on table size 2 million rows. all is ok. on my machine old vaccum work about 9 second. this version work about 6-7 sec . review please. If I'm reading this correctly, the patch changes gistbulkdelete to scan the index in physical order, while the old code starts from the root and scans the index from left to right, in logical order. Scanning the index in physical order is wrong, if any index pages are split while vacuum runs. A page split could move some tuples to a lower-numbered page, so that the vacuum will not scan those tuples. In the b-tree code, we solved that problem back in 2006, so it can be done but requires a bit more code. In b-tree, we solved it with a vacuum cycle ID number that's set on the page halves when a page is split. That allows VACUUM to identify pages that have been split concurrently sees them, and jump back to vacuum them too. See commit http://git.postgresql.org/ gitweb/?p=postgresql.git;a=commit;h=5749f6ef0cc1c67ef9c9ad2108b3d9 7b82555c80. It should be possible to do something similar in GiST, and in fact you might be able to reuse the NSN field that's already set on the page halves on split, instead of adding a new vacuum cycle ID. Idea is right. But in fact, does GiST ever recycle any page? It has F_DELETED flag, but ISTM this flag is never set. So, I think it's possible that this patch is working correctly. However, probably GiST sometimes leaves new page unused due to server crash. Anyway, I'm not fan of committing patch in this shape. We need to let GiST recycle pages first, then implement VACUUM similar to b-tree. -- With best regards, Alexander Korotkov.
Re: [HACKERS] gist vacuum gist access
On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09/07/2014 05:11 PM, Костя Кузнецов wrote: hello. i recode vacuum for gist index. all tests is ok. also i test vacuum on table size 2 million rows. all is ok. on my machine old vaccum work about 9 second. this version work about 6-7 sec . review please. If I'm reading this correctly, the patch changes gistbulkdelete to scan the index in physical order, while the old code starts from the root and scans the index from left to right, in logical order. Scanning the index in physical order is wrong, if any index pages are split while vacuum runs. A page split could move some tuples to a lower-numbered page, so that the vacuum will not scan those tuples. In the b-tree code, we solved that problem back in 2006, so it can be done but requires a bit more code. In b-tree, we solved it with a vacuum cycle ID number that's set on the page halves when a page is split. That allows VACUUM to identify pages that have been split concurrently sees them, and jump back to vacuum them too. See commit http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h= 5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do something similar in GiST, and in fact you might be able to reuse the NSN field that's already set on the page halves on split, instead of adding a new vacuum cycle ID. Idea is right. But in fact, does GiST ever recycle any page? It has F_DELETED flag, but ISTM this flag is never set. So, I think it's possible that this patch is working correctly. However, probably GiST sometimes leaves new page unused due to server crash. Anyway, I'm not fan of committing patch in this shape. We need to let GiST recycle pages first, then implement VACUUM similar to b-tree. Another note. Assuming we have NSN which can play the role of vacuum cycle ID, can we implement sequential (with possible jump back) index scan for GiST? -- With best regards, Alexander Korotkov.
Re: [HACKERS] gist vacuum gist access
On Mon, Sep 8, 2014 at 12:51 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09/08/2014 11:19 AM, Alexander Korotkov wrote: On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: In the b-tree code, we solved that problem back in 2006, so it can be done but requires a bit more code. In b-tree, we solved it with a vacuum cycle ID number that's set on the page halves when a page is split. That allows VACUUM to identify pages that have been split concurrently sees them, and jump back to vacuum them too. See commit http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h= 5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do something similar in GiST, and in fact you might be able to reuse the NSN field that's already set on the page halves on split, instead of adding a new vacuum cycle ID. ... Another note. Assuming we have NSN which can play the role of vacuum cycle ID, can we implement sequential (with possible jump back) index scan for GiST? Yeah, I think it would work. It's pretty straightforward, the page split code already sets the NSN, just when we need it. Vacuum needs to memorize the current NSN when it begins, and whenever it sees a page with a higher NSN (or the FOLLOW_RIGHT flag is set), follow the right-link if it points to lower-numbered page. I mean full index scan feature for SELECT queries might be implemented as well as sequential VACUUM. -- With best regards, Alexander Korotkov.
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Aug 8, 2014 at 9:14 PM, Teodor Sigaev teo...@sigaev.ru wrote: Curious idea: we could swap JEntry array and values: values in the begining of type will be catched by pg_lzcompress. But we will need to know offset of JEntry array, so header will grow up till 8 bytes (actually, it will be a varlena header!) May be I wasn't clear:jsonb type will start from string collection instead of JEntry array, JEntry array will be placed at the end of object/array. so, pg_lzcompress will find repeatable 4-byte pieces in first 1024 bytes of jsonb. Another idea I have is that store offset in each JEntry is not necessary to have benefit of binary search. Namely what if we store offsets in each 8th JEntry and length in others? The speed of binary search will be about the same: overhead is only calculation offsets in the 8-entries chunk. But lengths will probably repeat. -- With best regards, Alexander Korotkov.
Re: [HACKERS] KNN-GiST with recheck
On Sun, Aug 3, 2014 at 5:48 PM, Emre Hasegeli e...@hasegeli.com wrote: 1. This patch introduces a new polygon - point operator. That seems useful on its own, with or without this patch. Yeah, but exact-knn cant come with no one implementation. But it would better come in a separate patch. I tried to split them. Separated patches are attached. I changed the order of the arguments as point - polygon, because point was the first one on all the others. Its commutator was required for the index, so I added it on the second patch. I also added tests for the operator. I think it is ready for committer as a separate patch. We can add it to the open CommitFest. I have made some cosmetic changes on the patches. I hope they are useful. I added support to point - circle operator with the same GiST distance function you added for polygon. I can change it, if it is not the right way. Great, thanks! 2. I wonder how useful it really is to allow mixing exact and non-exact return values from the distance function. The distance function included in the patch always returns recheck=true. I have a feeling that all other distance function will also always return either true or false. For geometrical datatypes recheck variations in consistent methods are also very rare (I can't remember any). But imagine opclass for arrays where keys have different representation depending on array length. For such opclass and knn on similarity recheck flag could be useful. I also wonder how useful it is. Your example is convincing, but maybe setting it index-wide will make the decisions on the framework easier. For example, how hard would it be to decide if further sorting is required or not on the planner? I think that for most use cases just some operators require further sorting and some of them not. But it could appear one day that some index gives part of its knn answers exact and part of them inexact. Same happen to recheck of regular operators. Initially recheck flag was defined in opclass. But later recheck became runtime flag. 4. (as you mentioned in the other thread: ) It's a modularity violation that you peek into the heap tuple from gist. I think the proper way to do this would be to extend the IndexScan executor node to perform the re-shuffling of tuples that come from the index in wrong order, or perhaps add a new node type for it. Of course that's exactly what your partial sort patch does :-). I haven't looked at that in detail, but I don't think the approach the partial sort patch takes will work here as is. In the KNN-GiST case, the index is returning tuples roughly in the right order, but a tuple that it returns might in reality belong somewhere later in the ordering. In the partial sort patch, the input stream of tuples is divided into non-overlapping groups, so that the tuples within the group are not sorted, but the groups are. I think the partial sort case is a special case of the KNN-GiST case, if you consider the lower bound of each tuple to be the leading keys that you don't need to sort. Yes. But, for instance btree accesses heap for unique checking. Is really it so crimilal? :-) This is not only question of a new node or extending existing node. We need to teach planner/executor access method can return value of some expression which is lower bound of another expression. AFICS now access method can return only original indexed datums and TIDs. So, I afraid that enormous infrastructure changes are required. And I can hardly imagine what they should look like. Unfortunately, I am not experienced enough to judge your implementation. As far as I understand the problem is partially sorting rows on the index scan node. It can lead the planner to choose non-optimal plans, because of not taking into account the cost of sorting. Cost estimation of GiST is a big problem anyway. It doesn't care (and can't) about amount of recheck for regular operators. In this patch, same would be for knn recheck. The problem is that touching heap from access method breaks incapsulation. One idea about this is to do sorting in another nodes. However, I wonder if it would be an overengineering and overhead. In attached patch I propose a different approach: put code touching heap into separate index_get_heap_values function. Also new version of patch includes regression tests and some cleanup. -- With best regards, Alexander Korotkov. knn-gist-recheck-3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Distance from point to box
Hackers, while reading code written by my GSoC student for knn-spgist I faced many questions about calculation distance from point to box. 1) dist_pb calls close_pb which calls on_pb, dist_ps_internal, close_ps and so on. So, it's very complex way to calculate very simple value. I see this way has some mathematical beauty, but is it acceptable in practice? 2) computeDistance in knn-gist uses it's own quite simplified implementation of this calculation. But why has it own implementation instead on simplifying dist_pb? 3) computeDistance implementation looks still very complex for me. Right way (simple and fast) to calculate point to box distance seems for me to be so: double dx = 0.0, dy = 0.0; if (point-x box-low.x) dx = box-low.x - point-x; if (point-x box-high.x) dx = point-x - box-high.x; if (point-y box-low.y) dy = box-low.y - point-y; if (point-y box-high.y) dy = point-y - box-high.y; return HYPOT(dx, dy); I feel myself quite tangled. Could anybody clarify it for me? Did I miss something? Thanks. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Distance from point to box
On Wed, Jul 30, 2014 at 4:06 PM, Fabien COELHO coe...@cri.ensmp.fr wrote: double dx = 0.0, dy = 0.0; if (point-x box-low.x) dx = box-low.x - point-x; if (point-x box-high.x) dx = point-x - box-high.x; if (point-y box-low.y) dy = box-low.y - point-y; if (point-y box-high.y) dy = point-y - box-high.y; return HYPOT(dx, dy); I feel myself quite tangled. Could anybody clarify it for me? Did I miss something? Thanks. ISTM that you miss the projection on the segment if dx=0 or dy=0. I don't need to find projection itself, I need only distance. When dx = 0 then nearest point is on horizontal line of box, so distance to it is dy. Same when dy = 0. When both of them are 0 then point is in the box. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Distance from point to box
On Wed, Jul 30, 2014 at 7:26 PM, Fabien COELHO coe...@cri.ensmp.fr wrote: ISTM that you miss the projection on the segment if dx=0 or dy=0. I don't need to find projection itself, I need only distance. When dx = 0 then nearest point is on horizontal line of box, so distance to it is dy. Same when dy = 0. When both of them are 0 then point is in the box. Indeed. I thought that the box sides where not parallel to the axis, but they are. So I do not see why it should be more complex. Maybe they is a general algorithm which works for polygons, and they use it for boxes? Yeah, this answers question #1, but not #2 and #3 :) -- With best regards, Alexander Korotkov.
Re: [HACKERS] 9.5 CF1
On Wed, Jul 2, 2014 at 10:08 AM, Abhijit Menon-Sen a...@2ndquadrant.com wrote: KNN-GiST with recheck Partial sort Some earlier discussion, but no conclusions, and as far as I know nobody is working on these patches at the moment. I'm now working on next revision of KNN-GiST with recheck. Also, I think partial sort need a look of somebody more aware of planner than me and Marti. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Negative imact of maintenance_work_mem to GIN size
On Tue, May 20, 2014 at 4:50 AM, Alexander Korotkov aekorot...@gmail.comwrote: I found that sometimes larger maintenance_work_mem leads to larger GIN index. That is quite strange. ISTM that it's related to posting lists compression but I can't figure out how exactly it is. It appears to be not related to posting lists compression. I did reproduce it on 9.2. create table test as (select array[(random()*1000)::int]::int[] as v from generate_series(1,1000) g); set maintenance_work_mem = '2GB'; create index test_idx1 on test using gin(v); set maintenance_work_mem = '16MB'; create index test_idx2 on test using gin(v); Schema |Name| Type | Owner |Table| Size | Description ++---+--+-+-+- public | test_idx1 | index | smagen | test| 392 MB | public | test_idx2 | index | smagen | test| 268 MB | (2 rows) The reason of it is that we filling entry tree with inserting without any special optimization. Large maintenance_work_mem gives us ascending order of insertion. Thus insertion is performed always into rightmost page leaving rest of pages half-empty. Small maintenance_work_mem gives us more random order of insertion. Such insertions makes pages about 75% filled in average. Posting trees has special optimization for this case while entry tree doesn't. -- With best regards, Alexander Korotkov.
[HACKERS] Negative imact of maintenance_work_mem to GIN size
Hi! I found that sometimes larger maintenance_work_mem leads to larger GIN index. That is quite strange. ISTM that it's related to posting lists compression but I can't figure out how exactly it is. See example on delicious bookmarks dataset. http://mira.sai.msu.ru/~megera/tmp/js.copy.gz set maintenance_work_mem = '1GB'; create index js_idx1 on js using gin(v jsonb_path_idx); set maintenance_work_mem = '16MB'; create index js_idx2 on js using gin(v jsonb_path_ops); List of relations Schema | Name | Type | Owner | Table | Size | Description +-+---++---++- public | js_idx1 | index | smagen | js| 432 MB | public | js_idx2 | index | smagen | js| 309 MB | (2 rows) -- With best regards, Alexander Korotkov.
[HACKERS] Lossy bitmap scan is broken in GIN
Hackers, GIN hangs on lossy bitmap scan. Here is test case: create extension btree_gin; create table test as (select random() v from generate_series(1,100)); create index test_idx on test using gin(v); set work_mem = '16MB'; select count(*) from test where v 0.9; count ─── 99916 (1 row) Time: 63,142 ms set work_mem = '64kB'; select count(*) from test where v 0.9; The last query hangs. I've debugged it. It's another bug in this cursed loop in entryGetItem. Fix is attached. -- With best regards, Alexander Korotkov. gin_lossy_bitmap_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Clean up jsonb code.
On Thu, May 8, 2014 at 11:45 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 05/08/2014 02:25 AM, Peter Geoghegan wrote: findJsonbValueFromSuperHeader()'s lowbound argument previously served to establish a low bound for searching when searching for multiple keys (so the second and subsequent user-supplied key could skip much of the object). Got that. In the case of jsonb_exists_any(), say, if you only have a reasonable expectation that about 1 key exists, and that happens to be the last key that the user passed to the text[] argument (to the existence/? operator), then n - 1 calls to what is now findJsonbValueFromContainer() (which now does not accept a lowbound) are wasted. Check. That's elem_count - 1 top-level binary searches of the entire jsonb. Or elem_count such calls rather than 1 call (plus 1 sort of the supplied array) in the common case where jsonb_exists_any() will return false. Ok, but I don't see any big difference in that regard. It still called findJsonbValueFromContainer() elem_count times, before this commit. Each call was somewhat cheaper, because the lower bound of the binary search was initialized to where the previous search ended, but you still had to perform the search. Granted, that might not be that bad right now, given that it's only ever (say) elem_count or elem_count - 1 wasted binary searches through the *top* level, but that might not always be true. If we are ever to perform a deep search, I think we'll want to do much more to optimize that than just keep track of the lower bound. Like, start an iterator of tree and check for all of the keys in one go. And even today, sorting a presumably much smaller user-passed lookup array once has to be cheaper than searching through the entire jsonb perhaps elem_count times per call. Well, the quick testing I did suggested otherwise. It's not a big difference, but sorting has all kinds of overhead, like pallocing the array to sort, copying the elements around etc. For a small array, the startup cost of sorting trumps the savings in the binary searches. Possibly the way the sorting was done was not optimal, and you could reduce the copying and allocations involved in that, but it's hardly worth the trouble. With current head I can't load delicious dataset into jsonb format. I got segfault. It looks like memory corruption. $ gzip -c -d js.copy.gz | psql postgres -c 'copy js from stdin;' WARNING: problem in alloc set ExprContext: bogus aset link in block 0x14a846000, chunk 0x14a84d278 CONTEXT: COPY js, line 246766 WARNING: problem in alloc set ExprContext: bad single-chunk 0x14a804b18 in block 0x14a7e6000 CONTEXT: COPY js, line 1009820 WARNING: problem in alloc set ExprContext: bogus aset link in block 0x14a7e6000, chunk 0x14a804b18 CONTEXT: COPY js, line 1009820 server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. You can get dataset here: http://mira.sai.msu.ru/~megera/tmp/js.copy.gz -- With best regards, Alexander Korotkov.
Re: [HACKERS] Sequential disk access during VACUUM for GiST/GIN
Hi! On Tue, Apr 29, 2014 at 2:34 PM, Костя Кузнецов chapae...@yandex.ru wrote: There is a task Sequential disk access during VACUUM for GiST/GIN in list GSOC14. Nobody is working on this task? I didn't hear anybody is working on it. Do I understand this task correctly? I must recode gistbulkdelete. GistBDItem *stack is must have items with sequential blkno as possible. Yes, make gistbulkdelete and ginbulkdelete access disk sequentially while now tree is traversed in logical order. So these functions need to be completely reworked: I'm not sure GistBDItem will survive :) The challenge is concurrency. Vacuum shouldn't block concurrent readers and writers. You can see btbulkdelete which supports sequential disk access now. I have a question: where are access to disk in this function? ReadBufferExtended? Yes, this function read buffer to shared memory (if it isn't already) and pins it. -- With best regards, Alexander Korotkov.
Re: [HACKERS] 9.4 Proposal: Initdb creates a single table
On Wed, Apr 23, 2014 at 10:11 AM, Simon Riggs si...@2ndquadrant.com wrote: We start with a database called Postgres and a schema called Public. Yet we don't start up with any usable tables. I propose we add a single table called Postgres when we Initdb CREATE TABLE Postgres (Id Integer, Data Jsonb); COMMENT ON TABLE Postgres IS 'Single table for quick start usage - design your database'; The purpose of this is to make the database immediately usable. By including this table in the default initdb it will mean that programs can rely on the existence of this table and begin working quicker. By now, some of you will be doubled over laughing as if this is an April fool joke. I don't mean it to be at all. I can propose contrib PostgreNoSQL providing following: 1) Table postgres as you proposed. 2) Functions: get_postgres(id intgeger) returns jsonb, set_postgres(id integer, data jsonb) returns void, search_postgres(query jsonb) returns setof postgres. search_postgres will have semantics of @ jsonb operator 3) Background workers which provides HTTP wrapper over those functions. -- With best regards, Alexander Korotkov.
[HACKERS] Partial match fix for fast scan
Hi, GIN partial match appears to be broken after fast scan. Following simple test case raises assertion failure. create extension btree_gin; create table test as (select id, random() as val from generate_series(1,100) id); create index test_idx on test using gin (val); vacuum test; select * from test where val between 0.1 and 0.9; Attached patch fixes bugs in entryGetItem function. I would especially point that continue; checks while condition even if it's postfix while. That's why I surrounded tbm_iterate with another while. -- With best regards, Alexander Korotkov. ginfastscanfix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partial match fix for fast scan
On Thu, Apr 10, 2014 at 8:22 PM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Thu, Apr 10, 2014 at 11:09 AM, Alexander Korotkov aekorot...@gmail.com wrote: Hi, GIN partial match appears to be broken after fast scan. Following simple test case raises assertion failure. create extension btree_gin; create table test as (select id, random() as val from generate_series(1,100) id); create index test_idx on test using gin (val); vacuum test; select * from test where val between 0.1 and 0.9; Attached patch fixes bugs in entryGetItem function. I would especially point that continue; checks while condition even if it's postfix while. That's why I surrounded tbm_iterate with another while. Interesting... to me (using current master) your test case doesn't fail... fabrizio=# select * from test where val between 0.1 and 0.9; id |val +--- 1 | 0.554413774050772 2 | 0.767866868525743 3 | 0.601187175605446 ... But fail if I change the values of between clause: fabrizio=# select * from test where val between 0.1 and 0.19; ERROR: tuple offset out of range: 8080 It must be compiled with --enable-cassert to fail on assertion. -- With best regards, Alexander Korotkov.
Re: default opclass for jsonb (was Re: [HACKERS] Call for GIST/GIN/SP-GIST opclass documentation)
On Wed, Apr 9, 2014 at 10:37 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The ship has cleatly sailed to add parameterized opclasses to 9.4, but let's keep it in mind when we decide on the defaults. In the absence of parameterizable opclasses, it would be much more flexible to have opclasses that index, keys, values, key-value pairs and paths separately, instead of the current json_ops and json_hash_ops opclasses which index all of those in the same index. That way, if you only e.g. ever query on the existence of a key, you'd only need to index the keys. I don't understand how we ended up with the current dichotomy of json_ops and json_hash_ops... +1 for parameterizable opclasses -- With best regards, Alexander Korotkov.
Re: default opclass for jsonb (was Re: [HACKERS] Call for GIST/GIN/SP-GIST opclass documentation)
On Wed, Apr 9, 2014 at 12:40 PM, Peter Geoghegan p...@heroku.com wrote: On Wed, Apr 9, 2014 at 1:21 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I didn't say that. On the contrary, I think the shotgun approach jsonb_ops and jsonb_hash_ops take is too broad. It should be possible to specify what to index in a more detailed fashion. It is - use an expression index. That's by far the most important way to specify what to index in a more detailed fashion. There are others, but that's the major one. Beyond that, yes, it's necessary to carefully write your query predicate a certain way. However, a similar situation exists in MongoDB, where there is a distinction between Indexes on embedded fields (which must be accessed using special dot notation) and indexes on subdocuments (which cannot be accessed using dot notation). It's late here, but I'm pretty sure that's a feature and not a limitation. I believe that serious limitation we now have is that we actually specify kind of index to be used in the SQL query. For example you need to find objects with active = true. You can write: js @ {active: true} then GIN index on js can be used. Also you can write: js-'active' = true then btree expression index on (js-'active') can be used. For sure, one can do js @ {active: true} AND js-'active' = true This query can use any of indexes, but it is: 1) Cluge 2) Excess recheck 3) If both indexes present, excess bitmap and. Having to choose index in SQL-query we make our SQL more imperative and less declarative. Similar things can happen without json/hstore (user have to rewrite SQL in order to use expression index), but now it could become very common. My opinion is that we have to do something in planner to make it understand at least this two kinds of queries to be equivalent. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GSoC 2014 proposal
On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The BIRCH algorithm as described in the paper describes building a tree in memory. If I understood correctly, you're suggesting to use a pre-built GiST index instead. Interesting idea! There are a couple of signifcant differences between the CF tree described in the paper and GiST: 1. In GiST, a leaf item always represents one heap tuple. In the CF tree, a leaf item represents a cluster, which consists of one or more tuples. So the CF tree doesn't store an entry for every input tuple, which makes it possible to keep it in memory. 2. In the CF tree, all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold value T: the diameter (or radius) has to be less than T. GiST imposes no such restrictions. An item can legally be placed anywhere in the tree; placing it badly will just lead to degraded search performance, but it's still a legal GiST tree. 3. A GiST index, like any other index in PostgreSQL, holds entries also for deleted tuples, until the index is vacuumed. So you cannot just use information from a non-leaf node and use it in the result, as the information summarized at a non-leaf level includes noise from the dead tuples. Can you elaborate how you are planning to use a GiST index to implement BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more strict in where in the tree an item can be stored, and lets the operator class to specify exactly when a node is split etc. Hmmm, it's likely I've imagined something quite outside of this paper, and even already suggested it to Ivan... :) I need a little time to rethink it. Using GiST we can implement BIRCH-like clustering like so: 1) Build a CF tree as GiST index without restriction of T threshold value. 2) Scan CF tree with threshold T with some auxiliary operator. If consistent method see CF entry which diameter is greater than T then it returns true. Otherwise it returns false and put this CF entry into output area (could be either in-memory or temporary table). 3) Process other steps of algorithm as usual. This modification would have following advantages: 1) User can build GiST index once and then try clustering with different parameters. Initial GiST index build would be slowest operation while other steps is expected to be fast. 2) Use GiST infrastructure and automatically get buffering build. The drawback is that building GiST index is more expensive than building in-memory CF tree with given threshold T (assuming T is well chosen). Does it make any sense? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GSoC 2014 proposal
On Thu, Apr 3, 2014 at 11:21 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 04/03/2014 04:15 PM, Alexander Korotkov wrote: On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The BIRCH algorithm as described in the paper describes building a tree in memory. If I understood correctly, you're suggesting to use a pre-built GiST index instead. Interesting idea! There are a couple of signifcant differences between the CF tree described in the paper and GiST: 1. In GiST, a leaf item always represents one heap tuple. In the CF tree, a leaf item represents a cluster, which consists of one or more tuples. So the CF tree doesn't store an entry for every input tuple, which makes it possible to keep it in memory. 2. In the CF tree, all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold value T: the diameter (or radius) has to be less than T. GiST imposes no such restrictions. An item can legally be placed anywhere in the tree; placing it badly will just lead to degraded search performance, but it's still a legal GiST tree. 3. A GiST index, like any other index in PostgreSQL, holds entries also for deleted tuples, until the index is vacuumed. So you cannot just use information from a non-leaf node and use it in the result, as the information summarized at a non-leaf level includes noise from the dead tuples. Can you elaborate how you are planning to use a GiST index to implement BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more strict in where in the tree an item can be stored, and lets the operator class to specify exactly when a node is split etc. Hmmm, it's likely I've imagined something quite outside of this paper, and even already suggested it to Ivan... :) I need a little time to rethink it. Using GiST we can implement BIRCH-like clustering like so: 1) Build a CF tree as GiST index without restriction of T threshold value. 2) Scan CF tree with threshold T with some auxiliary operator. If consistent method see CF entry which diameter is greater than T then it returns true. Otherwise it returns false and put this CF entry into output area (could be either in-memory or temporary table). 3) Process other steps of algorithm as usual. I still don't understand how that would work. You can't trust the non-leaf entries, because their CF value can contain deleted entries. So you have to scan every tuple anyway. Once you do that, what's the point of the index? Yeah, it is limitation of this idea. It's not going to be auto-updatable CF. User can build index on freshly vacuumed table and play with clustering some time. Updates on table during that time would be either allowed clustering error or prohibited. Another potential solution is to let this index to hold some snapshot of data. But it seems not possible to do in extension now. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GSoC 2014 proposal
On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: The BIRCH algorithm as described in the paper describes building a tree in memory. If I understood correctly, you're suggesting to use a pre-built GiST index instead. Interesting idea! There are a couple of signifcant differences between the CF tree described in the paper and GiST: 1. In GiST, a leaf item always represents one heap tuple. In the CF tree, a leaf item represents a cluster, which consists of one or more tuples. So the CF tree doesn't store an entry for every input tuple, which makes it possible to keep it in memory. 2. In the CF tree, all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold value T: the diameter (or radius) has to be less than T. GiST imposes no such restrictions. An item can legally be placed anywhere in the tree; placing it badly will just lead to degraded search performance, but it's still a legal GiST tree. 3. A GiST index, like any other index in PostgreSQL, holds entries also for deleted tuples, until the index is vacuumed. So you cannot just use information from a non-leaf node and use it in the result, as the information summarized at a non-leaf level includes noise from the dead tuples. Can you elaborate how you are planning to use a GiST index to implement BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more strict in where in the tree an item can be stored, and lets the operator class to specify exactly when a node is split etc. Hmmm, it's likely I've imagined something quite outside of this paper, and even already suggested it to Ivan... :) I need a little time to rethink it. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Cube extension kNN support
On Mon, Mar 31, 2014 at 10:01 PM, Sergey Konoplev gray...@gmail.com wrote: On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev gray...@gmail.com wrote: On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich stas.kelv...@gmail.com wrote: Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances. What is the status of this patch? Referring to our private conversation with Alexander Korotkov, the patch is in WIP state currently, and, hopefully, will be ready by 9.5. I'm ready to actively participate in its testing on a real world production set of data. I'm not sure if it is doable at all, but are there any possibility to implement here, or, what would be just great, any ready/half ready solutions of a Hamming distance based kNN search? Cube dealing with float8 numbers. There is another patch making it work with other number types. But Hamming distance is for bit vectors, isn't it? With best regards, Alexander Korotkov.
[HACKERS] Proposal: fix range queries in btree_gin
Hackers, after reading Heikki Linnakangas presentation about GIN from Nordic PGDay, I figure out that btree_gin became much more attractive. http://hlinnaka.iki.fi/presentations/NordicPGDay2014-GIN.pdf But it have one weird behaviour: when you have range restriction like col const1 AND col const2 it's handled as intersection of two separate partial matches col const1 and col const2. So, it's awfully slow :( The opclass can't handle it better because it only deals with col const1 and col const2 separately. This two restrictions are separately passed to gin_extract_query. This problem is known, but now I can propose some solution. We have range types, and restriction col @ range can be correctly handled by gin_extract_query, because it will be passed there as single restriction. This is workaround itself, but it's weird to force users express queries like this. We can add some logic to gin. If it sees: 1) Query contain both col const1 and col const2 restrictions. 2) There is a range type for this type and comparison operator. 3) opclass supports col @ range then rewrite this two restrictions as col @ range(const1, const2) -- With best regards, Alexander Korotkov.
Re: [HACKERS] gsoc knn spgist
On Tue, Mar 25, 2014 at 8:16 PM, Костя Кузнецов chapae...@yandex.ru wrote: Hello. I submit a proposal. But Heikki Linnakangas write comments that i dont have a plan of implementation. My project is knn for spgist. Can I ask you a few questions? 1. I research a commit gist knn implementation. in gist implementation in role of queue is ised rtree(with distance comparator) , in spgist implementation this is List. Can i use rtree in spgist ? if i cant then i can use. KNN-GiST uses RB-tree for queue. RB-tree is very different from R-tree. And yes, it can be used in SP-GiST. However, alternative is heap (in-memory structure, not table heap). I don't know why GiST doesn't use heap instead of RB-tree. With best regards, Alexander Korotkov.
Re: [HACKERS] jsonb and nested hstore
I've noticed two commits on github. commit b8199ee3c2506ab81b47a0b440363fc90c0d6956 Author: Peter Geoghegan p...@heroku.com Date: Wed Mar 19 02:02:16 2014 -0700 For jsonb_hash_ops, hash less By limiting the GIN entries to the least-nested level, the delicious.com sample JSON dataset index shrinks in size from 382MB to 255MB without any apparent downside. commit 2cea5213dba011625fc0d5c6b447e838080087b1 Author: Peter Geoghegan p...@heroku.com Date: Wed Mar 19 02:13:42 2014 -0700 Revert For jsonb_hash_ops, hash less This might be workable with another approach, but leave it for now. This reverts commit b8199ee3c2506ab81b47a0b440363fc90c0d6956. Besides implementation, what the idea was here? For me, it's impossible to skip any single element, because it's possible for query to include only this element. If we skip that element, we can't answer corresponding query no more. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GSoC proposal. Index-only scans for GIST
On Tue, Mar 18, 2014 at 5:12 PM, Anastasia Lubennikova lubennikov...@gmail.com wrote: 2. gistget algorithm (several parts omitted): Check the key gistindex_keytest() - does this index tuple satisfy the scan key(s)? Scan all items on the GiST index page and insert them into the queue (or directly to output areas) plain scan Heap tuple TIDs are returned into so-pageData[] ordered scan Heap tuple TIDs are pushed into individual search queue items If the fetch() is specified by the developer, then using it, algorithm can retrieve the data directly to output areas at this stage, without reference to the heap. I think there are following changes to be made to GiST code: 1) When next consistent IndexTuple is found extract original Datums using fetch method. 2) Put those original Datums to queue. 3) When returning next ItemPointer from queue put original Datums to IndexScanDesc (into xs_itup). 3. Create function gistcanreturn to check whether fetch() is specified by user. Amcanreturn - Function to check whether index supports index-only scans, or zero if none There is the question of support index-only scans for multicolumn indexes. Probably it would require extend the interface to add separate columns checking. To solve this question I need the community's help. 4. Add details for index only scans into gistcostestimate function Also, another part of work to be mentioned is to implement fetch function for all suitable opclasses. With best regards, Alexander Korotkov.
Re: [HACKERS] GSoC proposal. Index-only scans for GIST
Josh, Anastasia has already consulted to me in person. It is not big proposal. But for newbie who is not familiar with PostgreSQL code base and especially GiST it seems fair enough. With best regards, Alexander Korotkov. On Tue, Mar 18, 2014 at 9:16 PM, Josh Berkus j...@agliodbs.com wrote: Alexander, Can you comment on the below proposal? I'd like your opinion on how difficult it will be. On 03/18/2014 06:12 AM, Anastasia Lubennikova wrote: Hello! Here is the text of my proposal which I've applied to GSoC. (and link http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/lubennikovaav/5629499534213120 ) Any suggestions and comments are welcome. *Project name* Support for index-only scans for GIST index *Brief review* Currently GiST index don't have index-only scan functionality. Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This feature for B-trees (implemented since PostgreSQL-9.2) allows doing certain types of queries significantly faster. This is achieved by reduction in the amount of I/O necessary to satisfy queries. I think it will be useful to implement this feature for user defined data types that use GiST index. *Benefits to the PostgreSQL Community* Faster GiST index search would be actual for many PostgreSQL applications (for example some GIS systems). *Quantifiable results* Acceleration of GiST index search. *Project details* 1. The GiST is a balanced tree structure like a B-tree, containing key, pointer pairs. GiST key is a member of a user-defined class, and represents some property that is true of all data items reachable from the pointer associated with the key. The GiST provides a possibility to create custom data types with indexed access methods and extensible set of queries. There are seven methods that an index operator class for GiST must provide, and an eighth that is optional. -consistent -union -compress -decompress -penalty -picksplit -equal -distance (optional) I'm going to create new optional method fetch() in addition to existing. Thus user can create a method of retrieving data from the index without loss. It will be used when performing search queries to speed data retrieving. 2. gistget algorithm (several parts omitted): Check the key gistindex_keytest() - does this index tuple satisfy the scan key(s)? Scan all items on the GiST index page and insert them into the queue (or directly to output areas) plain scan Heap tuple TIDs are returned into so-pageData[] ordered scan Heap tuple TIDs are pushed into individual search queue items If the fetch() is specified by the developer, then using it, algorithm can retrieve the data directly to output areas at this stage, without reference to the heap. 3. Create function gistcanreturn to check whether fetch() is specified by user. Amcanreturn - Function to check whether index supports index-only scans, or zero if none There is the question of support index-only scans for multicolumn indexes. Probably it would require extend the interface to add separate columns checking. To solve this question I need the community's help. 4. Add details for index only scans into gistcostestimate function *Links* 1) Hellerstein J. M., Naughton J. F., Pfeffer A. Generalized search trees for database systems. - September, 1995. 2) http://www.sai.msu.su/~megera/postgres/gist/ 3) PostgreSQL 9.3.3 Documentation: chapters 11. Indexes, 55. GiST Indexes. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 03/15/2014 08:40 PM, Fujii Masao wrote: Hi, I executed the following statements in HEAD and 9.3, and compared the size of WAL which were generated by data insertion in GIN index. - CREATE EXTENSION pg_trgm; CREATE TABLE hoge (col1 text); CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH (FASTUPDATE = off); CHECKPOINT; SELECT pg_switch_xlog(); SELECT pg_switch_xlog(); SELECT pg_current_xlog_location(); INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 100); SELECT pg_current_xlog_location(); - The results of WAL size are 960 MB (9.3) 2113 MB (HEAD) The WAL size in HEAD was more than two times bigger than that in 9.3. Recently the source code of GIN index has been changed dramatically. Is the increase in GIN-related WAL intentional or a bug? It was somewhat expected. Updating individual items on the new-format GIN pages requires decompressing and recompressing the page, and the recompressed posting lists need to be WAL-logged. Which generates much larger WAL records. That said, I didn't expect the difference to be quite that big when you're appending to the end of the table. When the new entries go to the end of the posting lists, you only need to recompress and WAL-log the last posting list, which is max 256 bytes long. But I guess that's still a lot more WAL than in the old format. That could be optimized, but I figured we can live with it, thanks to the fastupdate feature. Fastupdate allows amortizing that cost over several insertions. But of course, you explicitly disabled that... Let me know if you want me to write patch addressing this issue. -- With best regards, Alexander Korotkov.
Re: [HACKERS] jsonb and nested hstore
On Thu, Mar 13, 2014 at 1:21 PM, Greg Stark st...@mit.edu wrote: Well these are just normal gin and gist indexes. If we want to come up with new index operator classess we can still do that and keep the old ones if necessary. Even that seems pretty unlikely from past experience. I'm actually pretty sanguine even about keeping the GIST opclass. If it has bugs then the bugs only affect people who use this non-default opclass and we can fix them. It doesn't risk questioning any basic design choices in the patch. It does sound like the main question here is which opclass should be the default. From the discussion there's a jsonb_hash_ops which works on all input values but supports fewer operators and a jsonb_ops which supports more operators but can't handle json with larger individual elements. Perhaps it's better to make jsonb_hash_ops the default so at least it's always safe to create a default gin index? A couple of thoughts from me: 1) We can evade length limitation if GIN index by truncating long values and setting recheck flag. We can introduce some indicator of truncated value like zero byte at the end. 2) jsonb_hash_ops can be extended to handle keys queries too. We can preserve one bit in hash as flag indicating whether it's a hash of key or hash of path to value. For sure, such index would be a bit larger. Also, jsonb_hash_ops can be split into two: with and without keys. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
On Thu, Mar 13, 2014 at 8:58 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 03/12/2014 07:52 PM, Alexander Korotkov wrote: * I just noticed that the dummy trueTriConsistentFn returns GIN_MAYBE, rather than GIN_TRUE. The equivalent boolean version returns 'true' without recheck. Is that a typo, or was there some reason for the discrepancy? Actually, there is not difference in current implementation, But I implemented it so that trueTriConsistentFn can correctly work with shimBoolConsistentFn. In this case it should return GIN_MAYBE in case when it have no GIN_MAYBE in the input (as analogue of setting recheck flag). So, it could return GIN_TRUE only if it checked that input has GIN_MAYBE. However, checking would be just wasting of cycles. So I end up with just GIN_MAYBE :-) I don't understand that. As it is, it's inconsistent with the boolean trueConsistent function. trueConsistent always returns TRUE with recheck=false. And in GIN_SEARCH_MODE_EVERYTHING mode, there are no regular scan keys. Ok, I see. I just messed it up. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
On Wed, Mar 12, 2014 at 8:29 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 03/12/2014 12:09 AM, Tomas Vondra wrote: Hi all, a quick question that just occured to me - do you plan to tweak the cost estimation fot GIN indexes, in this patch? IMHO it would be appropriate, given the improvements and gains, but it seems to me gincostestimate() was not touched by this patch. Good point. We have done two major changes to GIN in this release cycle: changed the data page format and made it possible to skip items without fetching all the keys (fast scan). gincostestimate doesn't know about either change. Adjusting gincostestimate for the more compact data page format seems easy. When I hacked on that, I assumed all along that gincostestimate doesn't need to be changed as the index will just be smaller, which will be taken into account automatically. But now that I look at gincostestimate, it assumes that the size of one item on a posting tree page is a constant 6 bytes (SizeOfIptrData), which is no longer true. I'll go fix that. Adjusting for the effects of skipping is harder. gincostestimate needs to do the same preparation steps as startScanKey: sort the query keys by frequency, and call consistent function to split the keys intao required and additional sets. And then model that the additional entries only need to be fetched when the other keys match. That's doable in principle, but requires a bunch of extra code. Alexander, any thoughts on that? It's getting awfully late to add new code for that, but it sure would be nice somehow take fast scan into account. Preparation we do in startScanKey requires knowledge of estimate size of posting lists/trees. We do this estimate by traversal to leaf pages. I think gincostestimate is expected to be way more cheap. So, we probably need so more rough estimate there, don't we? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
On Wed, Mar 12, 2014 at 8:02 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/26/2014 11:25 PM, Alexander Korotkov wrote: On Thu, Feb 27, 2014 at 1:07 AM, Alexander Korotkov aekorot...@gmail.com wrote: On Thu, Feb 20, 2014 at 1:48 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/09/2014 12:11 PM, Alexander Korotkov wrote: I've rebased catalog changes with last master. Patch is attached. I've rerun my test suite with both last master ('committed') and attached patch ('ternary-consistent'). Thanks! method | sum +-- committed | 143491.71501 fast-scan-11 | 126916.11199 fast-scan-light| 137321.211 fast-scan-light-heikki | 138168.02801 master | 446976.288 ternary-consistent | 125923.514 I explain regression in last master by change of MAX_MAYBE_ENTRIES from 8 to 4. Yeah, probably. I set MAX_MAYBE_ENTRIES to 8 in initial versions to make sure we get similar behavior in Tomas' tests that used 6 search terms. But I always felt that it was too large for real queries, once we have the catalog changes, that's why I lowered to 4 when committing. If an opclass benefits greatly from fast scan, it should provide the ternary consistent function, and not rely on the shim implementation. I'm not sure about decision to reserve separate procedure number for ternary consistent. Probably, it would be better to add ginConfig method. It would be useful for post 9.4 improvements. Hmm, it might be useful for an opclass to provide both, a boolean and ternary consistent function, if the boolean version is significantly more efficient when all the arguments are TRUE/FALSE. OTOH, you could also do a quick check through the array to see if there are any MAYBE arguments, within the consistent function. But I'm inclined to keep the possibility to provide both versions. As long as we support the boolean version at all, there's not much difference in terms of the amount of code to support having them both for the same opclass. A ginConfig could be useful for many other things, but I don't think it's worth adding it now. What's the difference between returning GIN_MAYBE and GIN_TRUE+recheck? We discussed that earlier, but didn't reach any conclusion. That needs to be clarified in the docs. One possibility is to document that they're equivalent. Another is to forbid one of them. Yet another is to assign a different meaning to each. I've been thinking that it might be useful to define them so that a MAYBE result from the tri-consistent function means that it cannot decide if you have a match or not, because some of the inputs were MAYBE. And TRUE+recheck means that even if all the MAYBE inputs were passed as TRUE or FALSE, the result would be the same, TRUE+recheck. The practical difference would be that if the tri-consistent function returns TRUE+recheck, ginget.c wouldn't need to bother fetching the other entries, it could just return the entry with recheck=true immediately. While with MAYBE result, it would fetch the other entries and call tri-consistent again. ginget.c doesn't currently use the tri-consistent function that way - it always fetches all the entries for a potential match before calling tri-consistent, but it could. I had it do that in some of the patch versions, but Tomas' testing showed that it was a big loss on some queries, because the consistent function was called much more often. Still, something like that might be sensible in the future, so it might be good to distinguish those cases in the API now. Note that ginarrayproc is already using the return values like that: in GinContainedStrategy, it always returns TRUE+recheck regardless of the inputs, but in other cases it uses GIN_MAYBE. Next revision of patch is attached. In this version opclass should provide at least one consistent function: binary or ternary. It's expected to achieve best performance when opclass provide both of them. However, tests shows opposite :( I've to recheck it carefully. However, it's not! This revision of patch completes my test case in 123330 ms. This is slightly faster than previous revision. Great. Committed, I finally got around to it. Some minor changes: I reworded the docs and also updated the table of support functions in xindex.sgml. I refactored the query in opr_sanity.sql, to make it easier to read, and to check more carefully that the required support functions are present. I also added a runtime check to avoid crashing if neither is present. A few things we ought to still discuss: * I just noticed that the dummy trueTriConsistentFn returns GIN_MAYBE, rather than GIN_TRUE. The equivalent boolean version returns 'true' without recheck. Is that a typo, or was there some reason for the discrepancy? Actually
Re: [HACKERS] jsonb and nested hstore
On Tue, Mar 11, 2014 at 5:19 AM, Peter Geoghegan p...@heroku.com wrote: On Mon, Mar 10, 2014 at 4:19 AM, Alexander Korotkov aekorot...@gmail.com wrote: Here it is. So it looks like what you have here is analogous to the other problems that I fixed with both GiST and GIN. That isn't surprising, and this does fix my test-case. I'm not terribly happy about the lack of explanation for the hashing in that loop, though. Why use COMP_CRC32() at all, for one thing? Why do this for non-primitive jsonb hashing? COMP_CRC32(stack-hash_state, PATH_SEPARATOR, 1); Where PATH_SEPARATOR is: #define PATH_SEPARATOR (\0) Actually, come to think of it, why not just use one hashing function everywhere? i.e., jsonb_hash(PG_FUNCTION_ARGS)? It's already very similar. Pretty much every hash operator support function 1 (i.e. a particular type's hash function) is implemented with hash_any(). Can't we just do the same here? In any case it isn't obvious why the requirements for those two things (the hashing mechanism used by the jsonb_hash_ops GIN opclass, and the hash operator class support function 1 hash function) cannot be the same thing. It's because CRC32 interface allows incremental calculation while hash_any requires single chunk of memory. I don't think that unfolding everything is good idea. But we could implement incremental interface for hash_any. -- With best regards, Alexander Korotkov.
Re: [HACKERS] jsonb and nested hstore
On Mon, Mar 10, 2014 at 1:18 PM, Peter Geoghegan p...@heroku.com wrote: * The jsonb_hash_ops non-default GIN opclass is broken. It has its own idiosyncratic notion of what constitutes containment, that sees it only return, say, jsonb arrays that have a matching string as their leftmost element (if we ask it if it contains within it another array with the same string). Because of the limited number of indexable operators (only @), I'd put this opclass in the same category as GiST in terms of my willingness to forgo it for a release, even if it did receive a loud applause at pgConf.EU. Again, it might be some disparity between the opertors as they existed in hstore2 at one time, and as they exist in the core code now, but I doubt it, not least since the regression tests didn't pick this up, and it's such a basic thing. Perhaps Oleg and Teodor just need to explain this to me. I din't get comment about leftmost element. There is absolutely no distinguish between array elements. All elements are extracted into same keys independent of their indexes. It seems to have no change since I wrote hstore_hash_ops. Could you share test case to illustrate what you mean? -- With best regards, Alexander Korotkov.
Re: [HACKERS] jsonb and nested hstore
On Mon, Mar 10, 2014 at 2:19 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Mar 10, 2014 at 3:00 AM, Alexander Korotkov aekorot...@gmail.com wrote: I din't get comment about leftmost element. There is absolutely no distinguish between array elements. All elements are extracted into same keys independent of their indexes. It seems to have no change since I wrote hstore_hash_ops. Could you share test case to illustrate what you mean? I don't have time to post that at the moment, but offhand I *think* your confusion may be due to the fact that the json_hash_ops opclass (as I call it) was previously consistent with the behavior of the other GIN opclass (the default). The problem is that they (well, at least the default GIN and GiST opclasses) were inconsistent with how the containment operator behaved in respect of jsonb array elements generally. Here is the commit on our feature branch where I fixed the problem for the default GIN opclass: https://github.com/feodor/postgres/commit/6f5e4fe9fc34f9512919b1c8b6a54952ab288640s If it doesn't explain the problem, you may still wish to comment on the correctness of this fix. I am still waiting on feedback from Oleg and Teodor. Apparently, there is bug in calculation of hashes. Array elements were hashed incrementally while each of them should be hashed separately. That cause an effect of distinguishing array elements by their indexes. Not sure about when this bug was added. Fix is attached. -- With best regards, Alexander Korotkov. jsonb-hash-ops-fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb and nested hstore
On Mon, Mar 10, 2014 at 3:04 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Mar 10, 2014 at 3:47 AM, Alexander Korotkov aekorot...@gmail.com wrote: Fix is attached. Could you post a patch with regression tests, please? Here it is. -- With best regards, Alexander Korotkov. jsonb-hash-ops-fix-2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] trgm regex index peculiarity
On Mon, Feb 10, 2014 at 1:01 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane t...@sss.pgh.pa.us wrote: I looked at this patch a bit. It seems like this: + *BLANK_COLOR_SIZE - How much blank character is more frequent than + * other character in average + #define BLANK_COLOR_SIZE 32 is a drastic overestimate. OK, I would like to make more reasoning for penalty. Let's consider word of length n. It contains n+1 trigrams including: 1 __x trigram 1 _xx trigram 1 xx_ trigram n - 2 xxx trigrams Assume alphabet of size m those trigrams have following average frequencies: __x 1/((n+1)*m) _xx 1/((n+1)*m^2) xx_ 1/((n+1)*m^2) xxx (n-2)/((n+1)*m^3) Hmm, but you're assuming that the m letters are all equally frequent, which is surely not true in normal text. I didn't have a machine-readable copy of Hemingway or Tolstoy at hand, but I do have a text file with the collected works of H.P. Lovecraft, so I tried analyzing the trigrams in that. I find that * The average word length is 4.78 characters. * There are 5652 distinct trigrams (discounting some containing digits), the most common one (' t') occurring 81222 times; the average occurrence count is 500.0. * Considering only trigrams not containing any blanks, there are 4937 of them, the most common one ('the') occurring 45832 times, with an average count of 266.9. * There are (unsurprisingly) exactly 26 trigrams of the form ' x', with an average count of 19598.5. * There are 689 trigrams of the form ' xx' or 'xx ', the most common one (' th') occurring 58200 times, with an average count of 1450.1. So, we've rediscovered the fact that 'the' is the most common word in English text ;-). But I think the significant conclusion for our purposes here is that single-space trigrams are on average about 1450.1/266.9 = 5.4 times more common than the average space-free trigram, and two-space trigrams about 19598.5/266.9 = 73.4 times more common. So this leads me to the conclusion that we should be using a BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than a square-law rule for two-space trigrams). Even using your numbers, it shouldn't be 32. Next revision of patch is attached. Changes are so: 1) Notion penalty is used instead of size. 2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is MAX_TRGM_COUNT total trigrams count. 3) Penalties are assigned to particular color trigram classes. I.e. separate penalties for __a, _aa, _a_, aa_. It's based on analysis of trigram frequencies in Oscar Wilde writings. We can end up with different numbers, but I don't think they will be dramatically different. -- With best regards, Alexander Korotkov. trgm-regex-optimize.3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part2: fast scan
On Thu, Feb 20, 2014 at 1:48 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/09/2014 12:11 PM, Alexander Korotkov wrote: I've rebased catalog changes with last master. Patch is attached. I've rerun my test suite with both last master ('committed') and attached patch ('ternary-consistent'). Thanks! method | sum +-- committed | 143491.71501 fast-scan-11 | 126916.11199 fast-scan-light| 137321.211 fast-scan-light-heikki | 138168.02801 master | 446976.288 ternary-consistent | 125923.514 I explain regression in last master by change of MAX_MAYBE_ENTRIES from 8 to 4. Yeah, probably. I set MAX_MAYBE_ENTRIES to 8 in initial versions to make sure we get similar behavior in Tomas' tests that used 6 search terms. But I always felt that it was too large for real queries, once we have the catalog changes, that's why I lowered to 4 when committing. If an opclass benefits greatly from fast scan, it should provide the ternary consistent function, and not rely on the shim implementation. I'm not sure about decision to reserve separate procedure number for ternary consistent. Probably, it would be better to add ginConfig method. It would be useful for post 9.4 improvements. Hmm, it might be useful for an opclass to provide both, a boolean and ternary consistent function, if the boolean version is significantly more efficient when all the arguments are TRUE/FALSE. OTOH, you could also do a quick check through the array to see if there are any MAYBE arguments, within the consistent function. But I'm inclined to keep the possibility to provide both versions. As long as we support the boolean version at all, there's not much difference in terms of the amount of code to support having them both for the same opclass. A ginConfig could be useful for many other things, but I don't think it's worth adding it now. What's the difference between returning GIN_MAYBE and GIN_TRUE+recheck? We discussed that earlier, but didn't reach any conclusion. That needs to be clarified in the docs. One possibility is to document that they're equivalent. Another is to forbid one of them. Yet another is to assign a different meaning to each. I've been thinking that it might be useful to define them so that a MAYBE result from the tri-consistent function means that it cannot decide if you have a match or not, because some of the inputs were MAYBE. And TRUE+recheck means that even if all the MAYBE inputs were passed as TRUE or FALSE, the result would be the same, TRUE+recheck. The practical difference would be that if the tri-consistent function returns TRUE+recheck, ginget.c wouldn't need to bother fetching the other entries, it could just return the entry with recheck=true immediately. While with MAYBE result, it would fetch the other entries and call tri-consistent again. ginget.c doesn't currently use the tri-consistent function that way - it always fetches all the entries for a potential match before calling tri-consistent, but it could. I had it do that in some of the patch versions, but Tomas' testing showed that it was a big loss on some queries, because the consistent function was called much more often. Still, something like that might be sensible in the future, so it might be good to distinguish those cases in the API now. Note that ginarrayproc is already using the return values like that: in GinContainedStrategy, it always returns TRUE+recheck regardless of the inputs, but in other cases it uses GIN_MAYBE. Next revision of patch is attached. In this version opclass should provide at least one consistent function: binary or ternary. It's expected to achieve best performance when opclass provide both of them. However, tests shows opposite :( I've to recheck it carefully. I've removed recheck flag from ternary consistent function. It have to return GIN_MAYBE instead. -- With best regards, Alexander Korotkov. gin-ternary-consistent-2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part2: fast scan
On Thu, Feb 27, 2014 at 1:07 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Thu, Feb 20, 2014 at 1:48 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 02/09/2014 12:11 PM, Alexander Korotkov wrote: I've rebased catalog changes with last master. Patch is attached. I've rerun my test suite with both last master ('committed') and attached patch ('ternary-consistent'). Thanks! method | sum +-- committed | 143491.71501 fast-scan-11 | 126916.11199 fast-scan-light| 137321.211 fast-scan-light-heikki | 138168.02801 master | 446976.288 ternary-consistent | 125923.514 I explain regression in last master by change of MAX_MAYBE_ENTRIES from 8 to 4. Yeah, probably. I set MAX_MAYBE_ENTRIES to 8 in initial versions to make sure we get similar behavior in Tomas' tests that used 6 search terms. But I always felt that it was too large for real queries, once we have the catalog changes, that's why I lowered to 4 when committing. If an opclass benefits greatly from fast scan, it should provide the ternary consistent function, and not rely on the shim implementation. I'm not sure about decision to reserve separate procedure number for ternary consistent. Probably, it would be better to add ginConfig method. It would be useful for post 9.4 improvements. Hmm, it might be useful for an opclass to provide both, a boolean and ternary consistent function, if the boolean version is significantly more efficient when all the arguments are TRUE/FALSE. OTOH, you could also do a quick check through the array to see if there are any MAYBE arguments, within the consistent function. But I'm inclined to keep the possibility to provide both versions. As long as we support the boolean version at all, there's not much difference in terms of the amount of code to support having them both for the same opclass. A ginConfig could be useful for many other things, but I don't think it's worth adding it now. What's the difference between returning GIN_MAYBE and GIN_TRUE+recheck? We discussed that earlier, but didn't reach any conclusion. That needs to be clarified in the docs. One possibility is to document that they're equivalent. Another is to forbid one of them. Yet another is to assign a different meaning to each. I've been thinking that it might be useful to define them so that a MAYBE result from the tri-consistent function means that it cannot decide if you have a match or not, because some of the inputs were MAYBE. And TRUE+recheck means that even if all the MAYBE inputs were passed as TRUE or FALSE, the result would be the same, TRUE+recheck. The practical difference would be that if the tri-consistent function returns TRUE+recheck, ginget.c wouldn't need to bother fetching the other entries, it could just return the entry with recheck=true immediately. While with MAYBE result, it would fetch the other entries and call tri-consistent again. ginget.c doesn't currently use the tri-consistent function that way - it always fetches all the entries for a potential match before calling tri-consistent, but it could. I had it do that in some of the patch versions, but Tomas' testing showed that it was a big loss on some queries, because the consistent function was called much more often. Still, something like that might be sensible in the future, so it might be good to distinguish those cases in the API now. Note that ginarrayproc is already using the return values like that: in GinContainedStrategy, it always returns TRUE+recheck regardless of the inputs, but in other cases it uses GIN_MAYBE. Next revision of patch is attached. In this version opclass should provide at least one consistent function: binary or ternary. It's expected to achieve best performance when opclass provide both of them. However, tests shows opposite :( I've to recheck it carefully. However, it's not! This revision of patch completes my test case in 123330 ms. This is slightly faster than previous revision. -- With best regards, Alexander Korotkov.
Re: [HACKERS] PoC: Partial sort
On Thu, Feb 13, 2014 at 1:54 AM, Marti Raudsepp ma...@juffo.org wrote: I think the 1st patch now has a bug in initial_cost_mergejoin; you still pass the presorted_keys argument to cost_sort, making it calculate a partial sort cost, but generated plans never use partial sort. I think 0 should be passed instead. Patch attached, needs to be applied on top of partial-sort-basic-1 and then reverse-applied on partial-sort-merge-1. It doesn't look so for me. Merge join doesn't find partial sort especially. But if path with some presorted pathkeys will be accidentally selected then partial sort will be used. See create_mergejoin_plan function. So, I think this cost_sort call is relevant to create_mergejoin_plan. If we don't want partial sort to be used in such rare cases then we should revert it from both places. However, I doubt that it does any overhead, so we can leave it as is. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Small GIN optimizations (after 9.4)
On Wed, Feb 12, 2014 at 2:58 AM, Bruce Momjian br...@momjian.us wrote: On Sun, Feb 9, 2014 at 02:17:12PM +0400, Alexander Korotkov wrote: On Thu, Feb 6, 2014 at 8:31 PM, PostgreSQL - Hans-J rgen Sch nig postg...@cybertec.at wrote: i think there is one more thing which would be really good in GIN and which would solve a ton of issues. atm GIN entries are sorted by item pointer. if we could sort them by a column it would fix a couple of real work issues such as ... SELECT ... FROM foo WHERE tsearch_query ORDER BY price DESC LIMIT 10 ... or so. it many cases you want to search for a, say, product and find the cheapest / most expensive one. if the tsearch_query yields a high number of rows (which it often does) the subsequent sort will kill you. This is not intended to be a small change. However, some solution might be possible in post 9.4 gin improvements or in new secret indexing project which will be presented at PGCon :-) Would any of the listed changes cause backward-incompatible changes to the on-disk format, causing problems for pg_upgrade? None of them. -- With best regards, Alexander Korotkov.
Re: [HACKERS] PoC: Partial sort
On Mon, Feb 10, 2014 at 2:33 PM, Marti Raudsepp ma...@juffo.org wrote: On Sun, Feb 9, 2014 at 7:37 PM, Alexander Korotkov aekorot...@gmail.com wrote: This is not only place that worry me about planning overhead. See get_cheapest_fractional_path_for_pathkeys. I had to estimate number of groups for each sorting column in order to get right fractional path. AFAICT this only happens once per plan and the overhead is O(n) to the number of pathkeys? I can't get worried about that, but I guess it's better to test anyway. PS: You didn't answer my questions about splitting the patch. I guess I'll have to do that anyway to run the tests. Done. Patch is splitted. -- With best regards, Alexander Korotkov. partial-sort-basic-1.patch.gz Description: GNU Zip compressed data partial-sort-merge-1.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2014 - mentors, students and admins
Hi! On Tue, Jan 28, 2014 at 9:34 PM, Thom Brown t...@linux.com wrote: And I'd be fine with being admin again this year, unless there's anyone else who would like to take up the mantle? Thanks for your work. I would like to see you as admin this year again. Who would be up for mentoring this year? And are there any project ideas folk would like to suggest? I would like to be mentor. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
On Fri, Feb 7, 2014 at 5:33 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: On 02/06/2014 01:22 PM, Alexander Korotkov wrote: Difference is very small. For me, it looks ready for commit. Great, committed! Now, to review the catalog changes... I've rebased catalog changes with last master. Patch is attached. I've rerun my test suite with both last master ('committed') and attached patch ('ternary-consistent'). method | sum +-- committed | 143491.71501 fast-scan-11 | 126916.11199 fast-scan-light| 137321.211 fast-scan-light-heikki | 138168.02801 master | 446976.288 ternary-consistent | 125923.514 I explain regression in last master by change of MAX_MAYBE_ENTRIES from 8 to 4. However I'm not sure why ternary-consistent show so good results. Probably it's because some optimizations you did in committed version which wasn't visible because of change of MAX_MAYBE_ENTRIES. I'm not sure about decision to reserve separate procedure number for ternary consistent. Probably, it would be better to add ginConfig method. It would be useful for post 9.4 improvements. -- With best regards, Alexander Korotkov. gin-ternary-consistent.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small GIN optimizations (after 9.4)
On Thu, Feb 6, 2014 at 8:31 PM, PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at wrote: i think there is one more thing which would be really good in GIN and which would solve a ton of issues. atm GIN entries are sorted by item pointer. if we could sort them by a column it would fix a couple of real work issues such as ... SELECT ... FROM foo WHERE tsearch_query ORDER BY price DESC LIMIT 10 ... or so. it many cases you want to search for a, say, product and find the cheapest / most expensive one. if the tsearch_query yields a high number of rows (which it often does) the subsequent sort will kill you. This is not intended to be a small change. However, some solution might be possible in post 9.4 gin improvements or in new secret indexing project which will be presented at PGCon :-) -- With best regards, Alexander Korotkov.
Re: [HACKERS] Small GIN optimizations (after 9.4)
On Thu, Feb 6, 2014 at 3:36 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: While hacking on the GIN patches, I've come up with a few different ideas for improving performance. It's too late for 9.4, but I'll list them here if someone wants to work on them later: * Represent ItemPointers as uint64's, to speed up comparisons. ginCompareItemPointers is inlined into only a few instructions, but it's still more expensive than a single-instruction 64-bit comparison. ginCompareItemPointers is called very heavily in a GIN scan, so even a small improvement there would make for a noticeable speedup. It might be an improvement in code clarity, too. * Keep the entry streams of a GinScanKey in a binary heap, to quickly find the minimum curItem among them. I did this in various versions of the fast scan patch, but then I realized that the straightforward way of doing it is wrong, because a single GinScanEntry can be part of multiple GinScanKeys. If an entry's curItem is updated as part of advancing one key, and the entry is in a heap of another key, updating the curItem can violate the heap property of the other entry's heap. * Build a truth table (or cache) of consistent-function's results, and use that instead of calling consistent for every item. Caching seems more appropriate for me. Intuition tells me that when number of entries is high then far not all consistent combinations will be used. However, intuition might be false :-) * Deduce AND or OR logic from the consistent function. Or have the opclass provide a tree of AND/OR/NOT nodes directly, instead of a consistent function. For example, if the query is foo bar, we could avoid calling consistent function altogether, and only return items that match both. I also had this idea. But this solution doesn't cover similarity queries. If you have 20 entries and consistent function returns true when at least 10 of 20 are present then representation in AND/OR/NOT nodes would be too enormous, so useless. * Delay decoding segments during a scan. Currently, we decode all segments of a posting tree page into a single array at once. But with fast scan, we might be able to skip over all entries in some of the segments. So it would be better to copy the segments into backend-private memory in compressed format, and decode them one segment at a time (or maybe even one item at a time), when needed. That would avoid the unnecessary decoding of segments that can be skipped over, and would also reduce memory usage of a scan. I would like to add another one as continue of fast-scan: * Skip scanning of some entries at all forcing recheck instead. Correct decision should be done based on costs. However, I'm not sure about design. Because it's like a planning feature. How correct to do this inside of GIN? -- With best regards, Alexander Korotkov.
Re: [HACKERS] PoC: Partial sort
On Thu, Feb 6, 2014 at 12:39 PM, Marti Raudsepp ma...@juffo.org wrote: On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas robertmh...@gmail.com wrote: Hmm, sounds a little steep. Why is it so expensive? I'm probably missing something here, because I would have thought that planner support for partial sorts would consist mostly of considering the same sorts we consider today, but with the costs reduced by the batching. I guess it's because the patch undoes some optimizations in the mergejoin planner wrt caching merge clauses and adds a whole lot of code to find_mergeclauses_for_pathkeys. In other code paths the overhead does seem to be negligible. Notice the removal of: /* Select the right mergeclauses, if we didn't already */ /* * Avoid rebuilding clause list if we already made one; * saves memory in big join trees... */ This is not only place that worry me about planning overhead. See get_cheapest_fractional_path_for_pathkeys. I had to estimate number of groups for each sorting column in order to get right fractional path. For partial sort path, cost of first batch should be included into initial cost. If don't do so, optimizer can pick up strange plans basing on assumption that it need only few rows from inner node. See an example. create table test1 as ( select id, (random()*100)::int as v1, (random()*1)::int as v2 from generate_series(1,100) id); create table test2 as ( select id, (random()*100)::int as v1, (random()*1)::int as v2 from generate_series(1,100) id); create index test1_v1_idx on test1 (v1); Plan without fraction estimation in get_cheapest_fractional_path_for_pathkeys: postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1 order by t1.v1, t1.id limit 10; QUERY PLAN -- Limit (cost=198956893.20..198956913.33 rows=10 width=24) - Partial sort (cost=198956893.20..19909637942.82 rows=9791031169 width=24) Sort Key: t1.v1, t1.id Presorted Key: t1.v1 - Nested Loop (cost=0.42..19883065506.84 rows=9791031169 width=24) Join Filter: (t1.v1 = t2.v1) - Index Scan using test1_v1_idx on test1 t1 (cost=0.42..47600.84 rows=100 width=12) - Materialize (cost=0.00..25289.00 rows=100 width=12) - Seq Scan on test2 t2 (cost=0.00..15406.00 rows=100 width=12) (9 rows) Current version of patch: postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1 order by t1.v1, t1.id limit 10; QUERY PLAN -- Limit (cost=3699913.43..3699913.60 rows=10 width=24) - Partial sort (cost=3699913.43..173638549.67 rows=9791031169 width=24) Sort Key: t1.v1, t1.id Presorted Key: t1.v1 - Merge Join (cost=150444.79..147066113.70 rows=9791031169 width=24) Merge Cond: (t1.v1 = t2.v1) - Index Scan using test1_v1_idx on test1 t1 (cost=0.42..47600.84 rows=100 width=12) - Materialize (cost=149244.84..154244.84 rows=100 width=12) - Sort (cost=149244.84..151744.84 rows=100 width=12) Sort Key: t2.v1 - Seq Scan on test2 t2 (cost=0.00..15406.00 rows=100 width=12) (11 rows) I don't compare actual execution times because I didn't wait until first plan execution ends up :-) But anyway costs are extraordinary and inner sequential scan of 100 rows is odd. -- With best regards, Alexander Korotkov.
Re: [HACKERS] trgm regex index peculiarity
On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: Revised version of patch with necessary comments. I looked at this patch a bit. It seems like this: + *BLANK_COLOR_SIZE - How much blank character is more frequent than + * other character in average + #define BLANK_COLOR_SIZE 32 is a drastic overestimate. Using the program Erik provided to generate some sample data, I find that blanks in trigrams are maybe about three times more common than other characters. Possibly Erik's data isn't very representative of real text, but if that's the case we'd better get a more representative sample case before we start optimizing ... Now maybe the above number is okay for this purpose anyway, but if so it needs some different justification; maybe call it a penalty? But nonetheless it seems like a darn large penalty, particularly for x type trigrams where the value would effectively be squared. Compared to the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like you might as well forget the sizing approach and just flat out reject trigrams containing COLOR_BLANK, because they'll never get through the size filter. It seems to be right decision to me when we have other trigrams can reject them. But there are cases when we can't. I'm inclined to think you need a larger MAX_TRGM_SIZE; and WISH_TRGM_SIZE seems mighty small as well, compared to what the code would have done before. OK, I would like to make more reasoning for penalty. Let's consider word of length n. It contains n+1 trigrams including: 1 __x trigram 1 _xx trigram 1 xx_ trigram n - 2 xxx trigrams Assume alphabet of size m those trigrams have following average frequencies: __x 1/((n+1)*m) _xx 1/((n+1)*m^2) xx_ 1/((n+1)*m^2) xxx (n-2)/((n+1)*m^3) The ratio of this frequencies with blanks to frequency of xxx is: __x m^2/(n-2) _xx and xx_ m/(n-2) In order to estimate n I've analyzed some classics: Ernest Hemingway A farewell to arms - 3.8 Leo Tolstoy War and Peace - 5.0 In english with alphabet size = 26 we have __x m^2/(n-2) = 375 _xx and xx_ m/(n-2) = 14.4 In russian with alphabet size = 33 we have __x m^2/(n-2) = 363 _xx and xx_ m/(n-2) = 11 Assuming these calculations is approximate, we can safely use same values for both languages. Does such reasoning make sense? Also, surely this bit: ! trgmNFA-totalTrgmCount = (int) totalTrgmSize; is flat out wrong? expandColorTrigrams() expects that trgmNFA-totalTrgmCount is the truth, not some penalty-bloated abstraction. The fact that the patch hasn't failed on you completely demonstrates that you've not tested any cases where blank-containing trigrams got through the filter, reinforcing my feeling that it's probably too strict now. Oh, I wonder how can I leave such weird bug in patch :-( I wonder if it would be more appropriate to continue to measure the limit MAX_TRGM_COUNT in terms of actual trigram counts, and just use the penalized size as the sort key while determining which color trigrams to discard first. Agree. But I would like to save some equivalent of WISH_TRGM_SIZE. Another thought is that it's not clear that you should apply the same penalty to blanks in all three positions. Because of the padding settings, any one word will normally lead to padded strings a, ab, yz , so that spaces in the first position are about twice as common as spaces in the other positions. (It's a little less than that because single-letter words produce only a and a ; but I'd think that such words aren't very common in text that trigrams are effective for.) So I'm thinking the penalty ought to take that into account. I'm also inclined to think that it might be worth adding a size field to ColorTrgmInfo rather than repeatedly calculating the size estimate. We only allow a thousand and some ColorTrgmInfos at most, so the extra space wouldn't be that large, and I'm concerned by the expense the current patch adds to the sort comparator. Agree. Another point is that this comment: * Note: per-color-trigram counts cannot overflow an int so long as * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about * 1290. However, the grand total totalTrgmCount might conceivably * overflow an int, so we use int64 for that within this routine. is no longer valid, or at least it fails to demonstrate that the result of getColorTrigramSize() can't overflow an int; indeed I rather fear it can. The patch failed to even update the variable name in this comment, let alone address the substantive question. There are some other cosmetic things I didn't like, for instance colorTrgmInfoCountCmp() is no longer comparing counts but you didn't rename it. That's about it for substantive comments though. Thanks, will be fixed. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
On Thu, Feb 6, 2014 at 2:21 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: On 02/05/2014 12:42 PM, Alexander Korotkov wrote: Attached patch is light version of fast scan. It does extra consistent function calls only on startScanKey, no extra calls during scan of the index. It finds subset of rarest entries absence of which guarantee false consistent function result. Sounds good. Since the extra consistent calls are only performed at startup, it's unlikely to cause any noticeable performance regression, even when it's not helping. I've run real-life tests based of postgresql.org logs (33318 queries). Here is a table with summary time of running whole test case. =# select method, sum(time) from test_result group by method order by method; method| sum -+-- fast-scan-11| 126916.11199 fast-scan-light | 137321.211 heikki | 466751.693 heikki-true-ternary | 454113.62397 master | 446976.288 (6 rows) where 'heikki' is gin-ternary-logic binary-heap preconsistent-only-on-new-page.patch and 'heikki-true-ternary' is version with my catalog changes promoting ternary consistent function to opclass. Looks good. We can see fast-scan-light gives almost same performance benefit on queries. However, I test only CPU effect, not IO effect. Any thoughts? This test doesn't handle lossy-pages correctly: /* * Check if all items marked as scanEntryUseForSkip is not present in * minItem. If so, we can correct advancePast. */ if (ginCompareItemPointers(minItem, minItemSkip) 0) { advancePast = minItemSkip; advancePast.ip_posid--; continue; } If minItemSkip is a lossy page, we skip all exact items on the same page. That's wrong, here's a test case to demonstrate: CREATE TABLE foo (ts tsvector); INSERT INTO foo SELECT to_tsvector('foo bar'||g) from generate_series(1, 100) g; CREATE INDEX i_foo ON foo USING gin (ts); set work_mem='64'; -- to force lossy pages -- should return 11 select count(*) from foo where ts @@ to_tsquery('foo bar4:*'); After some fiddling (including fixing the above), I ended up with the attached version of your patch. I still left out the catalog changes, again not because I don't think we should have them, but to split this into smaller patches that can be reviewed separately. I extended the both ways shim function to work with more than one maybe input. Of course, that is O(n^2), where n is the number of maybe arguments, so that won't scale, but it's OK for testing purposes. And many if not most real world queries, too. I had a very hard time understanding what it really means for an entry to be in the scanEntryUseForSkip array. What does it mean to use an entry for skipping? So I refactored that logic, to hopefully make it more clear. In essence, the entries are divided into two groups, required and other items. If none of the entries in the required set are true, then we know that the overall qual cannot match. See the comments for a more detailed explanation. I'm not wedded to the term required here; one might think that *all* the entries in the required set must be TRUE for a match, while it's actually that at least one of them must be TRUE. In keyGetItem, we fetch the minimum item among all the entries in the requiredEntries set. That's the next item we're going to return, regardless of any items in otherEntries set. But before calling the consistent function, we advance the other entries up to that point, so that we know whether to pass TRUE or FALSE to the consistent function. IOW, otherEntries only provide extra information for the consistent function to decide if we have a match or not - they don't affect which items we return to the caller. I think this is pretty close to committable state now. I'm slightly disappointed that we can't do the skipping in more scenarios. For example, in foo bar, we can currently skip bar entry up to the next foo, but we cannot skip foo up to the next bar. But this is fairly simple, and since we sort the entries by estimated frequency, should already cover all the pathological rare frequent type queries. So that's OK. Sorry for my scanty explanation. Your version of patch looks good for me. I've rerun my testcase: =# select method, sum(time) from test_result group by method order by method; method | sum +-- fast-scan-11 | 126916.11199 fast-scan-light| 137321.211 fast-scan-light-heikki | 138168.02801 heikki | 466751.693 heikki-tru-ternary | 454113.62397 master | 446976.288 tri-state
Re: [HACKERS] GIN improvements part2: fast scan
On Wed, Feb 5, 2014 at 1:23 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Mon, Feb 3, 2014 at 6:31 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Mon, Jan 27, 2014 at 7:30 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Mon, Jan 27, 2014 at 2:32 PM, Alexander Korotkov aekorot...@gmail.com wrote: Every single change you did in fast scan seems to be reasonable, but testing shows that something went wrong. Simple test with 3 words of different selectivities. After applying your patches: # select count(*) from fts_test where fti @@ plainto_tsquery('english', 'gin index select'); count ─── 627 (1 row) Time: 21,252 ms In original fast-scan: # select count(*) from fts_test where fti @@ plainto_tsquery('english', 'gin index select'); count ─── 627 (1 row) Time: 3,382 ms I'm trying to get deeper into it. I had two guesses about why it's become so slower than in my original fast-scan: 1) Not using native consistent function 2) Not sorting entries I attach two patches which rollback these two features (sorry for awful quality of second). Native consistent function accelerates thing significantly, as expected. Tt seems that sorting entries have almost no effect. However it's still not as fast as initial fast-scan: # select count(*) from fts_test where fti @@ plainto_tsquery('english', 'gin index select'); count ─── 627 (1 row) Time: 5,381 ms Tomas, could you rerun your tests with first and both these patches applied against patches by Heikki? I found my patch 0005-Ternary-consistent-implementation.patch to be completely wrong. It introduces ternary consistent function to opclass, but don't uses it, because I forgot to include ginlogic.c change into patch. So, it shouldn't make any impact on performance. However, testing results with that patch significantly differs. That makes me very uneasy. Can we now reproduce exact same? Right version of these two patches in one against current head is attached. I've rerun tests with it, results are /mnt/sas-raid10/gin-testing/queries/9.4-fast-scan-10. Could you rerun postprocessing including graph drawing? Sometimes test cases are not what we expect. For example: =# explain SELECT id FROM messages WHERE body_tsvector @@ to_tsquery('english','(5alpha1-initdb''d)'); QUERY PLAN ─ Bitmap Heap Scan on messages (cost=84.00..88.01 rows=1 width=4) Recheck Cond: (body_tsvector @@ '''5alpha1-initdb'' ''5alpha1'' ''initdb'' ''d'''::tsquery) - Bitmap Index Scan on messages_body_tsvector_idx (cost=0.00..84.00 rows=1 width=0) Index Cond: (body_tsvector @@ '''5alpha1-initdb'' ''5alpha1'' ''initdb'' ''d'''::tsquery) Planning time: 0.257 ms (5 rows) 5alpha1-initdb'd is 3 gin entries with different frequencies. Also, these patches are not intended to change relevance ordering speed. When number of results are high, most of time is relevance calculating and sorting. I propose to remove ORDER BY clause from test cases to see scan speed more clear. I've dump of postgresql.org search queries from Magnus. We can add them to our test case. It turns out that version 10 contained bug in ternary consistent function implementation for tsvector. Fixed in attached version. Attached patch is light version of fast scan. It does extra consistent function calls only on startScanKey, no extra calls during scan of the index. It finds subset of rarest entries absence of which guarantee false consistent function result. I've run real-life tests based of postgresql.org logs (33318 queries). Here is a table with summary time of running whole test case. =# select method, sum(time) from test_result group by method order by method; method| sum -+-- fast-scan-11| 126916.11199 fast-scan-light | 137321.211 heikki | 466751.693 heikki-true-ternary | 454113.62397 master | 446976.288 (6 rows) where 'heikki' is gin-ternary-logic binary-heap preconsistent-only-on-new-page.patch and 'heikki-true-ternary' is version with my catalog changes promoting ternary consistent function to opclass. We can see fast-scan-light gives almost same performance benefit on queries. However, I test only CPU effect, not IO effect. Any thoughts? -- With best regards, Alexander Korotkov. gin-fast-scan-light.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix picksplit with nan values
On Sat, Feb 1, 2014 at 7:50 AM, Bruce Momjian br...@momjian.us wrote: Where are we on this? I found myself to have empty draft letter from November with new version of patch attached. I'll return here when we have some solution in gin fast scan challenge. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
On Mon, Feb 3, 2014 at 6:31 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Mon, Jan 27, 2014 at 7:30 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Mon, Jan 27, 2014 at 2:32 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Sun, Jan 26, 2014 at 8:14 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/26/2014 08:24 AM, Tomas Vondra wrote: Hi! On 25.1.2014 22:21, Heikki Linnakangas wrote: Attached is a new version of the patch set, with those bugs fixed. I've done a bunch of tests with all the 4 patches applied, and it seems to work now. I've done tests with various conditions (AND/OR, number of words, number of conditions) and I so far I did not get any crashes, infinite loops or anything like that. I've also compared the results to 9.3 - by dumping the database and running the same set of queries on both machines, and indeed I got 100% match. I also did some performance tests, and that's when I started to worry. For example, I generated and ran 1000 queries that look like this: SELECT id FROM messages WHERE body_tsvector @@ to_tsquery('english','(header 53 32 useful dropped)') ORDER BY ts_rank(body_tsvector, to_tsquery('english','(header 53 32 useful dropped)')) DESC; i.e. in this case the query always was 5 words connected by AND. This query is a pretty common pattern for fulltext search - sort by a list of words and give me the best ranked results. On 9.3, the script was running for ~23 seconds, on patched HEAD it was ~40. It's perfectly reproducible, I've repeated the test several times with exactly the same results. The test is CPU bound, there's no I/O activity at all. I got the same results with more queries (~100k). Attached is a simple chart with x-axis used for durations measured on 9.3.2, y-axis used for durations measured on patched HEAD. It's obvious a vast majority of queries is up to 2x slower - that's pretty obvious from the chart. Only about 50 queries are faster on HEAD, and 700 queries are more than 50% slower on HEAD (i.e. if the query took 100ms on 9.3, it takes 150ms on HEAD). Typically, the EXPLAIN ANALYZE looks something like this (on 9.3): http://explain.depesz.com/s/5tv and on HEAD (same query): http://explain.depesz.com/s/1lI Clearly the main difference is in the Bitmap Index Scan which takes 60ms on 9.3 and 120ms on HEAD. On 9.3 the perf top looks like this: 34.79% postgres [.] gingetbitmap 28.96% postgres [.] ginCompareItemPointers 9.36% postgres [.] TS_execute 5.36% postgres [.] check_stack_depth 3.57% postgres [.] FunctionCall8Coll while on 9.4 it looks like this: 28.20% postgres [.] gingetbitmap 21.17% postgres [.] TS_execute 8.08% postgres [.] check_stack_depth 7.11% postgres [.] FunctionCall8Coll 4.34% postgres [.] shimTriConsistentFn Not sure how to interpret that, though. For example where did the ginCompareItemPointers go? I suspect it's thanks to inlining, and that it might be related to the performance decrease. Or maybe not. Yeah, inlining makes it disappear from the profile, and spreads that time to the functions calling it. The profile tells us that the consistent function is called a lot more than before. That is expected - with the fast scan feature, we're calling consistent not only for potential matches, but also to refute TIDs based on just a few entries matching. If that's effective, it allows us to skip many TIDs and avoid consistent calls, which compensates, but if it's not effective, it's just overhead. I would actually expect it to be fairly effective for that query, so that's a bit surprising. I added counters to see where the calls are coming from, and it seems that about 80% of the calls are actually coming from this little the feature I explained earlier: In addition to that, I'm using the ternary consistent function to check if minItem is a match, even if we haven't loaded all the entries yet. That's less important, but I think for something like rare1 | (rare2 frequent) it might be useful. It would allow us to skip fetching 'frequent', when we already know that 'rare1' matches for the current item. I'm not sure if that's worth the cycles, but it seemed like an obvious thing to do, now that we have the ternary consistent function. So, that clearly isn't worth the cycles :-). At least not with an expensive consistent function; it might be worthwhile if we pre-build the truth-table, or cache the results of the consistent function. Attached is a quick patch to remove that, on top of all the other patches, if you want to test the effect. Every single change you did in fast scan seems to be reasonable, but testing shows that something went wrong
Re: [HACKERS] GIN improvements part2: fast scan
On Mon, Jan 27, 2014 at 7:30 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Mon, Jan 27, 2014 at 2:32 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Sun, Jan 26, 2014 at 8:14 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/26/2014 08:24 AM, Tomas Vondra wrote: Hi! On 25.1.2014 22:21, Heikki Linnakangas wrote: Attached is a new version of the patch set, with those bugs fixed. I've done a bunch of tests with all the 4 patches applied, and it seems to work now. I've done tests with various conditions (AND/OR, number of words, number of conditions) and I so far I did not get any crashes, infinite loops or anything like that. I've also compared the results to 9.3 - by dumping the database and running the same set of queries on both machines, and indeed I got 100% match. I also did some performance tests, and that's when I started to worry. For example, I generated and ran 1000 queries that look like this: SELECT id FROM messages WHERE body_tsvector @@ to_tsquery('english','(header 53 32 useful dropped)') ORDER BY ts_rank(body_tsvector, to_tsquery('english','(header 53 32 useful dropped)')) DESC; i.e. in this case the query always was 5 words connected by AND. This query is a pretty common pattern for fulltext search - sort by a list of words and give me the best ranked results. On 9.3, the script was running for ~23 seconds, on patched HEAD it was ~40. It's perfectly reproducible, I've repeated the test several times with exactly the same results. The test is CPU bound, there's no I/O activity at all. I got the same results with more queries (~100k). Attached is a simple chart with x-axis used for durations measured on 9.3.2, y-axis used for durations measured on patched HEAD. It's obvious a vast majority of queries is up to 2x slower - that's pretty obvious from the chart. Only about 50 queries are faster on HEAD, and 700 queries are more than 50% slower on HEAD (i.e. if the query took 100ms on 9.3, it takes 150ms on HEAD). Typically, the EXPLAIN ANALYZE looks something like this (on 9.3): http://explain.depesz.com/s/5tv and on HEAD (same query): http://explain.depesz.com/s/1lI Clearly the main difference is in the Bitmap Index Scan which takes 60ms on 9.3 and 120ms on HEAD. On 9.3 the perf top looks like this: 34.79% postgres [.] gingetbitmap 28.96% postgres [.] ginCompareItemPointers 9.36% postgres [.] TS_execute 5.36% postgres [.] check_stack_depth 3.57% postgres [.] FunctionCall8Coll while on 9.4 it looks like this: 28.20% postgres [.] gingetbitmap 21.17% postgres [.] TS_execute 8.08% postgres [.] check_stack_depth 7.11% postgres [.] FunctionCall8Coll 4.34% postgres [.] shimTriConsistentFn Not sure how to interpret that, though. For example where did the ginCompareItemPointers go? I suspect it's thanks to inlining, and that it might be related to the performance decrease. Or maybe not. Yeah, inlining makes it disappear from the profile, and spreads that time to the functions calling it. The profile tells us that the consistent function is called a lot more than before. That is expected - with the fast scan feature, we're calling consistent not only for potential matches, but also to refute TIDs based on just a few entries matching. If that's effective, it allows us to skip many TIDs and avoid consistent calls, which compensates, but if it's not effective, it's just overhead. I would actually expect it to be fairly effective for that query, so that's a bit surprising. I added counters to see where the calls are coming from, and it seems that about 80% of the calls are actually coming from this little the feature I explained earlier: In addition to that, I'm using the ternary consistent function to check if minItem is a match, even if we haven't loaded all the entries yet. That's less important, but I think for something like rare1 | (rare2 frequent) it might be useful. It would allow us to skip fetching 'frequent', when we already know that 'rare1' matches for the current item. I'm not sure if that's worth the cycles, but it seemed like an obvious thing to do, now that we have the ternary consistent function. So, that clearly isn't worth the cycles :-). At least not with an expensive consistent function; it might be worthwhile if we pre-build the truth-table, or cache the results of the consistent function. Attached is a quick patch to remove that, on top of all the other patches, if you want to test the effect. Every single change you did in fast scan seems to be reasonable, but testing shows that something went wrong. Simple test with 3 words of different selectivities. After applying your patches
Re: [HACKERS] GIN improvements part2: fast scan
On Thu, Jan 30, 2014 at 8:38 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/30/2014 01:53 AM, Tomas Vondra wrote: (3) A file with explain plans for 4 queries suffering ~2x slowdown, and explain plans with 9.4 master and Heikki's patches is available here: http://www.fuzzy.cz/tmp/gin/queries.txt All the queries have 6 common words, and the explain plans look just fine to me - exactly like the plans for other queries. Two things now caught my eye. First some of these queries actually have words repeated - either exactly like database database or in negated form like !anything anything. Second, while generating the queries, I use dumb frequency, where only exact matches count. I.e. write != written etc. But the actual number of hits may be much higher - for example write matches exactly just 5% documents, but using @@ it matches more than 20%. I don't know if that's the actual cause though. I tried these queries with the data set you posted here: http://www.postgresql.org/message-id/52e4141e.60...@fuzzy.cz. The reason for the slowdown is the extra consistent calls it causes. That's expected - the patch certainly does call consistent in situations where it didn't before, and if the pre-consistent checks are not able to eliminate many tuples, you lose. So, what can we do to mitigate that? Some ideas: 1. Implement the catalog changes from Alexander's patch. That ought to remove the overhead, as you only need to call the consistent function once, not both ways. OTOH, currently we only call the consistent function if there is only one unknown column. If with the catalog changes, we always call the consistent function even if there are more unknown columns, we might end up calling it even more often. 2. Cache the result of the consistent function. 3. Make the consistent function cheaper. (How? Magic?) 4. Use some kind of a heuristic, and stop doing the pre-consistent checks if they're not effective. Not sure what the heuristic would look like. I'm fiddling around with following idea of such heuristic: 1) Sort entries ascending by estimate of TIDs number. Assume number of entries to be n. 2) Sequentially call ternary consistent with first m of GIN_FALSE and rest n-m of GIN_MAYBE until it returns GIN_FALSE. 3) Decide whether to use fast-scan depending on number of TIDs in first m entries and number of TIDs in rest (n-m) entries. Something like number_of_TIDs_in_frist_m_entries * k number_of_TIDs_in_rest_n-m_entries where k is some quotient. For sure, it rough method, but it should be fast. Without natural implementation of ternary consistent function algorithm will be slightly different: only m values close to n should be checked. I'm going to implement this heuristic against last version of your patch. -- With best regards, Alexander Korotkov.
Re: [HACKERS] KNN-GiST with recheck
On Tue, Jan 28, 2014 at 9:32 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/28/2014 04:12 PM, Alexander Korotkov wrote: On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: 4. (as you mentioned in the other thread: ) It's a modularity violation that you peek into the heap tuple from gist. I think the proper way to do this would be to extend the IndexScan executor node to perform the re-shuffling of tuples that come from the index in wrong order, or perhaps add a new node type for it. Of course that's exactly what your partial sort patch does :-). I haven't looked at that in detail, but I don't think the approach the partial sort patch takes will work here as is. In the KNN-GiST case, the index is returning tuples roughly in the right order, but a tuple that it returns might in reality belong somewhere later in the ordering. In the partial sort patch, the input stream of tuples is divided into non-overlapping groups, so that the tuples within the group are not sorted, but the groups are. I think the partial sort case is a special case of the KNN-GiST case, if you consider the lower bound of each tuple to be the leading keys that you don't need to sort. Yes. But, for instance btree accesses heap for unique checking. Is really it so crimilal? :-) Well, it is generally considered an ugly hack in b-tree too. I'm not 100% opposed to doing such a hack in GiST, but would very much prefer not to. This is not only question of a new node or extending existing node. We need to teach planner/executor access method can return value of some expression which is lower bound of another expression. AFICS now access method can return only original indexed datums and TIDs. So, I afraid that enormous infrastructure changes are required. And I can hardly imagine what they should look like. Yeah, I'm not sure either. Maybe a new field in IndexScanDesc, along with xs_itup. Or as an attribute of xs_itup itself. This shouldn't look like a hack too. Otherwise I see no point of it: it's better to have some isolated hack in access method than hack in planner/executor. So I see following changes to be needed to implement this right way: 1) Implement new relation between operators: operator1 is lower bound of operator2. 2) Extend am interface to let it return values of operators. 3) Implement new node for knn-sorting. However, it requires a lot of changes in PostgreSQL infrastructure and can appear to be not enough general too (we don't know until we have another application). -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
On Mon, Feb 3, 2014 at 7:24 PM, Tomas Vondra t...@fuzzy.cz wrote: On 3 Únor 2014, 15:31, Alexander Korotkov wrote: I found my patch 0005-Ternary-consistent-implementation.patch to be completely wrong. It introduces ternary consistent function to opclass, but don't uses it, because I forgot to include ginlogic.c change into patch. So, it shouldn't make any impact on performance. However, testing results with that patch significantly differs. That makes me very uneasy. Can we now reproduce exact same? Do I understand it correctly that the 0005 patch should give exactly the same performance as the 9.4-heikki branch (as it was applied on it, and effectively did no change). This wasn't exactly what I measured, although the differences were not that significant. Do I undestand correctly it's 9.4-heikki and 9.4-alex-1 here: http://www.fuzzy.cz/tmp/gin/# In some queries it differs in times. I wonder why. I can rerun the tests, if that's what you're asking for. I'll improve the test a bit - e.g. I plan to average multiple runs, to filter out random noise (which might be significant for such short queries). Right version of these two patches in one against current head is attached. I've rerun tests with it, results are /mnt/sas-raid10/gin-testing/queries/9.4-fast-scan-10. Could you rerun postprocessing including graph drawing? Yes, I'll do that. However I'll have to rerun the other tests too, because the previous runs were done on a different machine. I'm a bit confused right now. The previous patches (0005 + 0007) were supposed to be applied on top of the 4 from Heikki (0001-0004), right? AFAIK those were not commited yet, so why is this version against HEAD? To summarize, I know of these patch sets: 9.4-heikki (old version) 0001-Optimize-GIN-multi-key-queries.patch 0002-Further-optimize-the-multi-key-GIN-searches.patch 0003-Further-optimize-GIN-multi-key-searches.patch 0004-Add-the-concept-of-a-ternary-consistent-check-and-us.patch 9.4-alex-1 (based on 9.4-heikki) 0005-Ternary-consistent-implementation.patch 9.4-alex-1 (based on 9.4-alex-1) 0006-Sort-entries.patch From these patches I only need to compare 9.4-heikki (old version) and 9.4-alex-1 to release my doubts. 9.4-heikki (new version) gin-ternary-logic+binary-heap+preconsistent-only-on-new-page.patch 9.4-alex-2 (new version) gin-fast-scan.10.patch.gz Or did I get that wrong? Only you mentioned 9.4-alex-1 twice. I afraid to have some mess in numbering. Sometimes test cases are not what we expect. For example: =# explain SELECT id FROM messages WHERE body_tsvector @@ to_tsquery('english','(5alpha1-initdb''d)'); QUERY PLAN Bitmap Heap Scan on messages (cost=84.00..88.01 rows=1 width=4) Recheck Cond: (body_tsvector @@ '''5alpha1-initdb'' ''5alpha1'' ''initdb'' ''d'''::tsquery) - Bitmap Index Scan on messages_body_tsvector_idx (cost=0.00..84.00 rows=1 width=0) Index Cond: (body_tsvector @@ '''5alpha1-initdb'' ''5alpha1'' ''initdb'' ''d'''::tsquery) Planning time: 0.257 ms (5 rows) 5alpha1-initdb'd is 3 gin entries with different frequencies. Why do you find that strange? The way the query is formed or the way it's evaluated? The query generator certainly is not perfect, so it may produce some strange queries. I just mean that in this case 3 words doesn't mean 3 gin entries. Also, these patches are not intended to change relevance ordering speed. When number of results are high, most of time is relevance calculating and sorting. I propose to remove ORDER BY clause from test cases to see scan speed more clear. Sure, I can do that. Or maybe one set of queries with ORDER BY, the other one without it. Good. I've dump of postgresql.org search queries from Magnus. We can add them to our test case. You mean search queries from the search for mailing list archives? Sure, we add that. Yes. I'll transform it into tsquery and send you privately. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
On Mon, Feb 3, 2014 at 8:19 PM, Tomas Vondra t...@fuzzy.cz wrote: Sometimes test cases are not what we expect. For example: =# explain SELECT id FROM messages WHERE body_tsvector @@ to_tsquery('english','(5alpha1-initdb''d)'); QUERY PLAN Bitmap Heap Scan on messages (cost=84.00..88.01 rows=1 width=4) Recheck Cond: (body_tsvector @@ '''5alpha1-initdb'' ''5alpha1'' ''initdb'' ''d'''::tsquery) - Bitmap Index Scan on messages_body_tsvector_idx (cost=0.00..84.00 rows=1 width=0) Index Cond: (body_tsvector @@ '''5alpha1-initdb'' ''5alpha1'' ''initdb'' ''d'''::tsquery) Planning time: 0.257 ms (5 rows) 5alpha1-initdb'd is 3 gin entries with different frequencies. Why do you find that strange? The way the query is formed or the way it's evaluated? The query generator certainly is not perfect, so it may produce some strange queries. I just mean that in this case 3 words doesn't mean 3 gin entries. Isn't that expected? I mean, that's what to_tsquery may do, right? Everything is absolutely correct. :-) It just may be not what do you expect if you aren't getting into details. -- With best regards, Alexander Korotkov.
Re: [HACKERS] KNN-GiST with recheck
On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/13/2014 07:17 PM, Alexander Korotkov wrote: Here goes a desription of this patch same as in original thread. KNN-GiST provides ability to get ordered results from index, but this order is based only on index information. For instance, GiST index contains bounding rectangles for polygons, and we can't get exact distance to polygon from index (similar situation is in PostGIS). In attached patch, GiST distance method can set recheck flag (similar to consistent method). This flag means that distance method returned lower bound of distance and we should recheck it from heap. See an example. create table test as (select id, polygon(3+(random()*10)::int, circle(point(random(), random()), 0.0003 + random()*0.001)) as p from generate_series(1,100) id); create index test_idx on test using gist (p); We can get results ordered by distance from polygon to point. postgres=# select id, p - point(0.5,0.5) from test order by p - point(0.5,0.5) limit 10; id | ?column? +-- 755611 | 0.000405855808916853 807562 | 0.000464123777564343 437778 | 0.000738524708741959 947860 | 0.00076250998760724 389843 | 0.000886362723569568 17586 | 0.000981960100555216 411329 | 0.00145338112316853 894191 | 0.00149399559703506 391907 | 0.0016647896049741 235381 | 0.00167554614889509 (10 rows) It's fast using just index scan. QUERY PLAN -- Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230 rows=10 loops=1) - Index Scan using test_idx on test (cost=0.29..157672.29 rows=100 width=36) (actual time=0.179..0.228 rows=10 loops=1) Order By: (p - '(0.5,0.5)'::point) Total runtime: 0.305 ms (4 rows) Nice! Some thoughts: 1. This patch introduces a new polygon - point operator. That seems useful on its own, with or without this patch. Yeah, but exact-knn cant come with no one implementation. But it would better come in a separate patch. 2. I wonder how useful it really is to allow mixing exact and non-exact return values from the distance function. The distance function included in the patch always returns recheck=true. I have a feeling that all other distance function will also always return either true or false. For geometrical datatypes recheck variations in consistent methods are also very rare (I can't remember any). But imagine opclass for arrays where keys have different representation depending on array length. For such opclass and knn on similarity recheck flag could be useful. 3. A binary heap would be a better data structure to buffer the rechecked values. A Red-Black tree allows random insertions and deletions, but in this case you need to insert arbitrary values but only remove the minimum item. That's exactly what a binary heap excels at. We have a nice binary heap implementation in the backend that you can use, see src/backend/lib/binaryheap.c. Hmm. For me binary heap would be a better data structure for KNN-GiST at all :-) 4. (as you mentioned in the other thread: ) It's a modularity violation that you peek into the heap tuple from gist. I think the proper way to do this would be to extend the IndexScan executor node to perform the re-shuffling of tuples that come from the index in wrong order, or perhaps add a new node type for it. Of course that's exactly what your partial sort patch does :-). I haven't looked at that in detail, but I don't think the approach the partial sort patch takes will work here as is. In the KNN-GiST case, the index is returning tuples roughly in the right order, but a tuple that it returns might in reality belong somewhere later in the ordering. In the partial sort patch, the input stream of tuples is divided into non-overlapping groups, so that the tuples within the group are not sorted, but the groups are. I think the partial sort case is a special case of the KNN-GiST case, if you consider the lower bound of each tuple to be the leading keys that you don't need to sort. Yes. But, for instance btree accesses heap for unique checking. Is really it so crimilal? :-) This is not only question of a new node or extending existing node. We need to teach planner/executor access method can return value of some expression which is lower bound of another expression. AFICS now access method can return only original indexed datums and TIDs. So, I afraid that enormous infrastructure changes are required. And I can hardly imagine what they should look like. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
count(*) from fts_test where fti @@ plainto_tsquery('english', 'gin index select'); count ─── 627 (1 row) Time: 3,382 ms I'm trying to get deeper into it. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
On Sun, Jan 26, 2014 at 8:14 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: In addition to that, I'm using the ternary consistent function to check if minItem is a match, even if we haven't loaded all the entries yet. That's less important, but I think for something like rare1 | (rare2 frequent) it might be useful. It would allow us to skip fetching 'frequent', when we already know that 'rare1' matches for the current item. I'm not sure if that's worth the cycles, but it seemed like an obvious thing to do, now that we have the ternary consistent function. So, that clearly isn't worth the cycles :-). At least not with an expensive consistent function; it might be worthwhile if we pre-build the truth-table, or cache the results of the consistent function. I believe cache consistent function results is quite same as lazy truth-table. I think it's a good option to use with two-state consistent function. However, I don't think it's a reason to refuse from three-state consistent function because number of entries could be large. -- With best regards, Alexander Korotkov.
Re: [HACKERS] PoC: Partial sort
Hi! On Tue, Jan 21, 2014 at 3:24 AM, Marti Raudsepp ma...@juffo.org wrote: On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp ma...@juffo.org wrote: I've been trying it out in a few situations. I implemented a new enable_partialsort GUC to make it easier to turn on/off, this way it's a lot easier to test. The attached patch applies on top of partial-sort-5.patch I though about such option. Generally not because of testing convenience, but because of overhead of planning. This way you implement it is quite naive :) I don't understand. I had another look at this and cost_sort still seems like the best place to implement this, since that's where the patch decides how many pre-sorted columns to use. Both mergejoin and simple order-by plans call into it. If enable_partialsort=false then I skip all pre-sorted options except full sort, making cost_sort behave pretty much like it did before the patch. I could change pathkeys_common to return 0, but that seems like a generic function that shouldn't be tied to partialsort. The old code paths called pathkeys_contained_in anyway, which has similar complexity. (Apart for initial_cost_mergejoin, but that doesn't seem special enough to make an exception for). Or should I use?: enable_partialsort ? pathkeys_common(...) : 0 For instance, merge join rely on partial sort which will be replaced with simple sort. Are you saying that enable_partialsort=off should keep partialsort-based mergejoins enabled? Or are you saying that merge joins shouldn't use simple sort at all? But merge join was already able to use full Sort nodes before your patch. Sorry that I didn't explained it. In particular I mean following: 1) With enable_partialsort = off all mergejoin logic should behave as without partial sort patch. 2) With partial sort patch get_cheapest_fractional_path_for_pathkeys function is much more expensive to execute. With enable_partialsort = off it should be as cheap as without partial sort patch. I'll try to implement this option in this week. For now, I have attempt to fix extra columns in mergejoin problem. It would be nice if you test it. -- With best regards, Alexander Korotkov. partial-sort-7.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Partial sort
On Tue, Jan 28, 2014 at 7:41 AM, Marti Raudsepp ma...@juffo.org wrote: On Mon, Jan 27, 2014 at 9:26 PM, Alexander Korotkov aekorot...@gmail.com wrote: For now, I have attempt to fix extra columns in mergejoin problem. It would be nice if you test it. Yes, it solves the test cases I was trying with, thanks. 1) With enable_partialsort = off all mergejoin logic should behave as without partial sort patch. 2) With partial sort patch get_cheapest_fractional_path_for_pathkeys function is much more expensive to execute. With enable_partialsort = off it should be as cheap as without partial sort patch. When it comes to planning time, I really don't think you should bother. The planner enable_* settings are meant for troubleshooting, debugging and learning about the planner. You should not expect people to disable them in a production setting. It's not worth complicating the code for that rare case. This is stated in the documentation (http://www.postgresql.org/docs/current/static/runtime-config-query.html) and repeatedly on the mailing lists. But some benchmarks of planning performance are certainly warranted. I didn't test it, but I worry that overhead might be high. If it's true then it could be like constraint_exclusion option which id off by default because of planning overhead. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On Thu, Jan 23, 2014 at 9:27 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Wed, Jan 22, 2014 at 9:28 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 02:17 PM, Alexander Korotkov wrote: We already spent a lot of time with compression. Now we need to figure out the result we want see. I spent quite long time debugging varbyte encoding without segments. Finally, it passed very many tests without any problems. Now, it is just piece of junk. I'm afraid that we will have to reimplement everything from scratch another two or three times because code doesn't look perfect. For sure, it's incompatible with getting something into 9.4. That's a bit harsh :-). But yes, I understand what you're saying. It's quite common for large patches like this to be rewritten several times before being committed; you just don't know what works best until you've tried many designs. Remember we have also subsequent fast-scan which is very needed for hstore and other application. I propose to do final decisions now and concentrate forces on making committable patch with these decisions. And don't change these decisions even if somebody have idea how to improve code readability in 100 times and potential extendability in 1000 times. I propose following decisions: 1) Leave uncompressed area, allow duplicates in it 2) Don't introduce Items on page. Well, I got the insertions to work now without the uncompressed area, so in the spirit of Just Getting This Damn Patch Committed, I'm going to go ahead with that. We can add the uncompressed area later if performance testing shows it to be necessary. And I agree about the Items on page idea - we can come back to that too still in 9.4 timeframe if necessary, but probably not. So, committed. It's the same design as in the most recent patch I posted, but with a bunch of bugs fixed, and cleaned up all over. I'm going to move to the fast scan patch now, but please do test and review the committed version to see if I missed something. Great! Thanks a lot! Assertion in dataPlaceToPageLeafRecompress is too strong. Page can contain GinDataLeafMaxContentSize bytes. Patch is attached. My test-suite don't run correctly. I'm debugging now. ITSM I found this bug. ginVacuumPostingTreeLeaf re-encodes only some segments. Others are not even re-palloced. They are moved left in dataPlaceToPageLeafRecompress by memcpy. But it's incorrect to with memcpy, proper solution is memmove. Using memcpy platform dependently can lead to page corruption. Another solution is to re-palloc segments in ginVacuumPostingTreeLeaf. -- With best regards, Alexander Korotkov. gin-memmove-fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Jan 24, 2014 at 12:50 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/24/2014 10:03 AM, Alexander Korotkov wrote: ITSM I found this bug. ginVacuumPostingTreeLeaf re-encodes only some segments. Others are not even re-palloced. They are moved left in dataPlaceToPageLeafRecompress by memcpy. But it's incorrect to with memcpy, proper solution is memmove. Using memcpy platform dependently can lead to page corruption. Another solution is to re-palloc segments in ginVacuumPostingTreeLeaf. Good catch. Thanks, committed, changing memcpy to memmove. Will have to switch to pallocing everything in the future, if we make leafRepackItems smarter, so that it doesn't rewrite all the segments after the first modified one. OK. What about previous fix in assert? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Jan 24, 2014 at 1:19 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/24/2014 10:53 AM, Alexander Korotkov wrote: OK. What about previous fix in assert? Ah right, fixed that too now. Good, now my test-suite passed. Results are so. Time of operations event | period ---+- index_build | 00:01:45.709546 index_build_recovery | 00:00:09 index_update | 00:05:45.732214 index_update_recovery | 00:01:17 search_new| 00:24:02.191049 search_updated| 00:26:54.122852 (6 rows) Index sizes label | size ---+--- new | 387547136 after_updates | 610877440 (2 rows) Before compression results were following. Time of operations event | period ---+- index_build | 00:01:51.116722 index_build_recovery | 00:00:08 index_update | 00:05:12.124758 index_update_recovery | 00:01:43 search_new| 00:23:44.281424 search_updated| 00:25:41.96251 (6 rows) Index sizes label |size ---+ new | 884514816 after_updates | 1595252736 (2 rows) So, we can see little regression. However, I'm not sure if it's very significant. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
On Thu, Jan 23, 2014 at 8:22 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/14/2014 05:35 PM, Alexander Korotkov wrote: Attached version is rebased against last version of packed posting lists. Thanks! I think we're missing a trick with multi-key queries. We know that when multiple scan keys are used, they are ANDed together, so we can do the skip optimization even without the new tri-state consistent function. To get started, I propose the three attached patches. These only implement the optimization for the multi-key case, which doesn't require any changes to the consistent functions and hence no catalog changes. Admittedly this isn't anywhere near as useful in practice as the single key case, but let's go for the low-hanging fruit first. This nevertheless introduces some machinery that will be needed by the full patch anyway. I structured the code somewhat differently than your patch. There is no separate fast-path for the case where the optimization applies. Instead, I'm passing the advancePast variable all the way down to where the next batch of items are loaded from the posting tree. keyGetItem is now responsible for advancing the entry streams, and the logic in scanGetItem has been refactored so that it advances advancePast aggressively, as soon as one of the key streams let us conclude that no items a certain point can match. scanGetItem might yet need to be refactored when we get to the full preconsistent check stuff, but one step at a time. The first patch is the most interesting one, and contains the scanGetItem changes. The second patch allows seeking to the right segment in a posting tree page, and the third allows starting the posting tree scan from root, when skipping items (instead of just following the right-links). Here are some simple performance test results, demonstrating the effect of each of these patches. This is a best-case scenario. I don't think these patches has any adverse effects even in the worst-case scenario, although I haven't actually tried hard to measure that. The used this to create a test table: create table foo (intarr int[]); -- Every row contains 0 (frequent term), and a unique number. insert into foo select array[0,g] from generate_series(1, 1000) g; -- Add another tuple with 0, 1 combo physically to the end of the table. insert into foo values (array[0,1]); The query I used is this: postgres=# select count(*) from foo where intarr @ array[0] and intarr @ array[1]; count --- 2 (1 row) I measured the time that query takes, and the number of pages hit, using explain (analyze, buffers true) patches time (ms) buffers --- unpatched 650 1316 patch 1 0.521316 patches 1+2 0.501316 patches 1+2+3 0.1315 So, the second patch isn't doing much in this particular case. But it's trivial, and I think it will make a difference in other queries where you have the opportunity skip, but return a lot of tuples overall. In summary, these are fairly small patches, and useful on their, so I think these should be committed now. But please take a look and see if the logic in scanGetItem/keyGetItem looks correct to you. After this, I think the main fast scan logic will go into keyGetItem. Good, thanks! Now I can reimplement fast scan basing on this patches. PS. I find it a bit surprising that in your patch, you're completely bailing out if there are any partial-match keys involved. Is there some fundamental reason for that, or just not implemented? Just not implemented. I think there is two possible approaches to handle it: 1) Handle partial-match keys like OR on matching keys. 2) Implement keyAdvancePast for bitmap. First approach seems to be useful with low number of keys. Probably, we should implement automatic switching between them. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part2: fast scan
On Fri, Jan 24, 2014 at 6:48 AM, Tomas Vondra t...@fuzzy.cz wrote: I plan to do more thorough testing over the weekend, but I'd like to make sure I understand what to expect. My understanding is that this patch should: - give the same results as the current code (e.g. the fulltext should not return different rows / change the ts_rank etc.) - improve the performance of fulltext queries Are there any obvious rules what queries will benefit most from this? The queries generated by the tool I'm using for testing are mostly of this form: SELECT id FROM messages WHERE body_tsvector @ plainto_tsquery('english', 'word1 word2 ...') ORDER BY ts_rank(...) DESC LIMIT :n; with varying number of words and LIMIT values. During the testing today I haven't noticed any obvious performance difference, but I haven't spent much time on that. These patches optimizes only query with multiple WHERE clauses. For instance: SELECT id FROM messages WHERE body_tsvector @ plainto_tsquery('english', 'word1') AND body_tsvector @ plainto_tsquery('english', 'word2') ORDER BY ts_rank(...) DESC LIMIT :n; Optimizations inside single clause will be provided as separate patch. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Jan 22, 2014 at 12:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 09:25 AM, Alexander Korotkov wrote: On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/21/2014 11:35 AM, Heikki Linnakangas wrote: Oh, I see what's going on. I had assumed that there cannot be duplicate insertions into the posting tree, but that's dead wrong. The fast insertion mechanism depends on a duplicate insertion to do nothing. Ok, this turned out to be a much bigger change than I thought... In principle, it's not difficult to eliminate duplicates. However, to detect a duplicate, you have to check if the item is present in the uncompressed items array, or in the compressed lists. That requires decoding the segment where it should be. But if we decode the segment, what's the purpose of the uncompressed items array? Its purpose was to speed up insertions, by buffering them so that we don't need to decode and re-encode the page/segment on every inserted item. But if we're paying the price of decoding it anyway, we might as well include the new item and re-encode the segment. The uncompressed area saves some effort in WAL-logging, as the record of inserting an entry into the uncompressed area is much smaller than that of re-encoding part of the page, but if that really is a concern, we could track more carefully which parts of the page is modified, and only WAL-log the required parts. And hopefully, the fast-update lists buffer inserts so that you insert many items at a time to the posting tree, and the price of re-encoding is only paid once. So, now that I think about this once more, I don't think the uncompressed area in every leaf page is a good idea. I didn't get why insertion of duplicate TIDs to uncompressed area and eliminate them of re-encoding? It would be somewhat more work during updates, more work during scan, more WAL records. But is it really significant for real-life workloads? Hmm, so you would merrily insert duplicate TIDs into the uncompressed area, and weed them out when reading or recompressing the page? I had not thought of that. Yeah, it might be a good trade-off, duplicates are not expected to happen very often. I refactored the way the recompression routine works again. It is now more clearly a multi-step process. First, the existing page is disassembled into an in-memory struct, which is a linked list of all the segments. In-memory each segment can be represented as an array of item pointers, or in the compressed format. In the next phase, all the new items are added to the segments where they belong. This naturally can lead to overly large segments; in the third phase, the items are redistributed among the segments, and compressed back to the compressed format. Finally, the finished segments are written back to the page, or pages if it had to be split. The same subroutines to disassemble and recompress are also used in vacuum. Attached is what I've got now. This is again quite heavily-changed from the previous version - sorry for repeatedly rewriting this. I'll continue testing and re-reviewing this tomorrow. Let's clarify where we are. We had quite debugged and tested version with hard-wired varbyte encoding. Now we're reimplementing and debugging segmented version of varbyte encoding in a hope that one day we can easily replace it with something better that meets our restrictions (but we didn't find it already). Is it right? The segmentation was a sensible change on code-readability grounds alone. Yes, it made it easier to experiment with different encodings, and will make it easier to replace the encoding in the future, but that wasn't the only reason for doing it. It's nice to have the encoding/decoding stuff in ginpostinglists.c, so that the rest of the code just passes around opaque GinPostingList structs (previously known as PostingListSegment). One thing I have been pondering, though: Instead of storing the posting lists one after each other on the leaf page, it might be better to store them as Items on the page, with normal ItemIds pointing to each. So the page layout would be more standard, and you could use PageAddItem/PageIndexMultiDelete to add/remove posting lists to page. The immediate advantage of that would be that it would make it possible to do a binary search of the segments, to quickly locate the right segment where a tuple belongs to. That might not be significant in practice - linearly scanning 32 items is not slow either. And it would add some overhead, one line pointer per segment, 4 * 32 = 128 bytes per page with the current segment size of 256 bytes. But then again, it might be a good idea just because it would make the pages look more like any other page, which is generally a good thing. We already spent a lot of time with compression. Now we need to figure out the result we want see. I spent quite
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Jan 22, 2014 at 9:28 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 02:17 PM, Alexander Korotkov wrote: We already spent a lot of time with compression. Now we need to figure out the result we want see. I spent quite long time debugging varbyte encoding without segments. Finally, it passed very many tests without any problems. Now, it is just piece of junk. I'm afraid that we will have to reimplement everything from scratch another two or three times because code doesn't look perfect. For sure, it's incompatible with getting something into 9.4. That's a bit harsh :-). But yes, I understand what you're saying. It's quite common for large patches like this to be rewritten several times before being committed; you just don't know what works best until you've tried many designs. Remember we have also subsequent fast-scan which is very needed for hstore and other application. I propose to do final decisions now and concentrate forces on making committable patch with these decisions. And don't change these decisions even if somebody have idea how to improve code readability in 100 times and potential extendability in 1000 times. I propose following decisions: 1) Leave uncompressed area, allow duplicates in it 2) Don't introduce Items on page. Well, I got the insertions to work now without the uncompressed area, so in the spirit of Just Getting This Damn Patch Committed, I'm going to go ahead with that. We can add the uncompressed area later if performance testing shows it to be necessary. And I agree about the Items on page idea - we can come back to that too still in 9.4 timeframe if necessary, but probably not. So, committed. It's the same design as in the most recent patch I posted, but with a bunch of bugs fixed, and cleaned up all over. I'm going to move to the fast scan patch now, but please do test and review the committed version to see if I missed something. Great! Thanks a lot! Assertion in dataPlaceToPageLeafRecompress is too strong. Page can contain GinDataLeafMaxContentSize bytes. Patch is attached. My test-suite don't run correctly. I'm debugging now. -- With best regards, Alexander Korotkov. gin-assert-fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Mon, Jan 20, 2014 at 10:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/17/2014 08:49 PM, Alexander Korotkov wrote: On Fri, Jan 17, 2014 at 10:38 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/17/2014 01:05 PM, Alexander Korotkov wrote: Seems to be fixed in the attached version of patch. Another improvement in this version: only left-most segments and further are re-encoded. Left part of page are left untouched. I'm looking into this now. A few quick notes: * Even when you don't re-encode the whole page, you still WAL-log the whole page. While correct, it'd be a pretty obvious optimization to only WAL-log the modified part. Oh, I overlooked it. I wrote code in ginRedoInsertData which finds appropriate place fro data but didn't notice that I still wrote whole page to xlog record. Yeah. I think ginRedoInsertData was too cute to be bug-free. I added an explicit unmodifiedsize field to the WAL record, so that ginRedoInsertData doesn't need to calculate it. * When new items are appended to the end of the page so that only the last existing compressed segment is re-encoded, and the page has to be split, the items aren't divided 50/50 on the pages. The loop that moves segments destined for the left page to the right won't move any existing, untouched, segments. I think this loop will move one segment. However, it's too small. Implemented this. I noticed that the gin vacuum redo routine is dead code, except for the data-leaf page handling, because we never remove entries or internal nodes (page deletion is a separate wal record type). And the data-leaf case is functionally equivalent to heap newpage records. I removed the dead code and made it more clear that it resembles heap newpage. Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test. I tried my test-suite but it hangs on index scan with infinite loop. I re-tried it on my laptop with -O0. I found it to crash on update and vacuum in some random places like: Assert(GinPageIsData(page)); in xlogVacuumPage Assert(ndecoded == totalpacked); in ginCompressPostingList Trying to debug it. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GIN improvements part 1: additional information
On Tue, Jan 21, 2014 at 4:28 PM, Alexander Korotkov aekorot...@gmail.comwrote: I noticed that the gin vacuum redo routine is dead code, except for the data-leaf page handling, because we never remove entries or internal nodes (page deletion is a separate wal record type). And the data-leaf case is functionally equivalent to heap newpage records. I removed the dead code and made it more clear that it resembles heap newpage. Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test. I tried my test-suite but it hangs on index scan with infinite loop. I re-tried it on my laptop with -O0. I found it to crash on update and vacuum in some random places like: Assert(GinPageIsData(page)); in xlogVacuumPage Assert(ndecoded == totalpacked); in ginCompressPostingList Trying to debug it. Another question is about dataPlaceToPageLeaf: while ((Pointer) seg segend) { if (ginCompareItemPointers(minNewItem, seg-first) 0) break; Shouldn't we adjust seg to previous segment? If minNewItem is less than seg-first we should insert it to previous segment. -- With best regards, Alexander Korotkov.