[HACKERS] GIN - Generalized Inverted iNdex. Try 2.
We (me and Oleg) are glad to present GIN to PostgreSQL. If community will agree, we will commit it to HEAD branch. http://www.sigaev.ru/gin/gin.gz http://www.sigaev.ru/gin/README.txt Install: % cd pgsql % zcat gin.gz | patch -p0 make and initdb, install tsearch2 Changes from previous patch: * add support for tsearch2 * add 'fuzzy' limit * fixes README: Gin for PostgreSQL Gin was sponsored by jfg://networks (http://www.jfg-networks.com/) Gin stands for Generalized Inverted Index and should be considered as a genie, not a drink. Generalized means that index doesn't know what operation it accelerates, it works with strategies, defined for specific data type (read Index Method Strategies chapter in PostgreSQL documentation). In that sense, gin is similar to the GiST and differs from btree index, which has predefined comparison based operations. Inverted index is an index structure storing a (key,posting list) pairs, where posting list is a set of documents in which key occurs (document, usually, contains many keys). The primary goal of the Gin index is a support for scalable full text search in PostgreSQL. Gin is consists of a B-tree constructed over entries (ET, entries tree), where entry is an element of indexed value ( element of array, lexeme for tsvector) and where each tuple in the leaf pages is either pointer to B-tree over item pointers (PT, posting tree), or list of item pointers (PL, posting list) if tuple is small enough. Notes: There is no delete operation for ET. The reason for this, is that from our experience, a set of unique words of a whole collection changed very rare. This greatly simplify code and concurrency algorithm. Gin comes with built-in support for one dimensional arrays ( integer[], text[], no support for NULL elements) and following operations: * contains : value_array @ query_array * overlap : value_array query_array * contained: value_array ~ query_array Synopsis =# create index txt_idx on aa using gin(a); Features * Concurrency * WAL-logging (recoverable) * user-defined opclass, the scheme is similar to GiST * optimized index creation (make use of maintenance_work_mem to accumulate postings in memory) * tsearch2 opclass * soft upper limit of the returned results set using GUC variable gin_fuzzy_search_limit Gin fuzzy limit There are often situations, when full text search returns a very big set of results, which is very difficult to manage, since reading tuples from disk and their ordering could takes a lot of time, which is unacceptable for the production (notice, that search itself is very fast). Such queries are usually contain very frequent lexemes, so results are not very helpful. To facilitate execution of such queries we introduced a configurable soft upper limit of the size of the returned set - GUC variable gin_fuzzy_search_limit, which is 0 on default (no limitation). This subset randomly chosen from the whole result set. Soft means that the actual number of returned results could slightly differs from this limit, depending on the query and the quality of system random generator. From our experience, we found that value about several thousands (5000-2) is ok, i.e., gin fuzzy limit will have no effects for queries returning result set lesser than this number. Limitations * no support for multicolumn indices * Gin doesn't uses scan-kill_prior_tuple scan-ignore_killed_tuples * Gin searches entry only by equality matching, this may be improved in future * Gin doesn't supports full scan of index * Gin doesn't index NULL value Gin interface OpClass interface (pseudocode). Example for opclas is in ginarayproc.c. Datum* extractValue(Datum inputValue, uint32* nentries) Returns array of Datum of entries of value to be indexed, nentries should contains number of returning entries int compareEntry( Datum a, Datum b ) Compares two entries (not the indexing values!) Datum* extractQuery(Datum query, uint32* nentries, StrategyNumber n) Returns array of Datum of entries of query to be searched, n contains Strategy number of operation. bool consistent( bool[] check, StrategyNumber n, Datum query) The size of check array is the same as sizeof of array returned by extractQuery. Each element of check array is true if indexed value has corresponding entry in the query, i.e. if check[i] == TRUE then i-th entry of query is presented in indexed value. Function should returns true if indexed value matches by StrategyNumber and query. Open items We appreciate any comments, help and recommendations. * teach optimizer/executor, that GIN is intrinsically clustered, i.e., it always returns ItemPointer in ascending order. * tweak gincostestimate * GIN stores several ItemPointer to heap tuple, so vacuum full produces warning message: WARNING: index idx contains 88395 row versions, but table contains 51812 row versions HINT: Rebuild the index with REINDEX.
Re: [HACKERS] GIN - Generalized Inverted iNdex. Try 2.
On Wed, 26 Apr 2006, Teodor Sigaev wrote: We (me and Oleg) are glad to present GIN to PostgreSQL. If community will agree, we will commit it to HEAD branch. http://www.sigaev.ru/gin/gin.gz http://www.sigaev.ru/gin/README.txt Install: % cd pgsql % zcat gin.gz | patch -p0 make and initdb, install tsearch2 I just built this, and noticed that the regression test for opr_sanity fails with your patch. I attached the regression.diffs. -- BOFH excuse #85: Windows 95 undocumented feature *** ./expected/opr_sanity.out Wed Jan 25 18:35:51 2006 --- ./results/opr_sanity.outWed Apr 26 08:31:13 2006 *** *** 778,785 WHERE p4.amopclaid = p2.oid AND p4.amopsubtype = p3.amopsubtype); oid | amname | oid | opcname | amopsubtype ! -++-+-+- ! (0 rows) -- Check that amopopr points at a reasonable-looking operator, ie a binary -- operator yielding boolean. --- 778,791 WHERE p4.amopclaid = p2.oid AND p4.amopsubtype = p3.amopsubtype); oid | amname | oid | opcname | amopsubtype ! --++--+---+- ! 2742 | gin| 2745 | _int4_ops | 0 ! 2742 | gin| 2745 | _int4_ops | 0 ! 2742 | gin| 2745 | _int4_ops | 0 ! 2742 | gin| 2746 | _text_ops | 0 ! 2742 | gin| 2746 | _text_ops | 0 ! 2742 | gin| 2746 | _text_ops | 0 ! (6 rows) -- Check that amopopr points at a reasonable-looking operator, ie a binary -- operator yielding boolean. *** *** 825,831 783 | 10 | | 783 | 11 | | 783 | 12 | | ! (24 rows) -- Check that all operators linked to by opclass entries have selectivity -- estimators. This is not absolutely required, but it seems a reasonable --- 831,840 783 | 10 | | 783 | 11 | | 783 | 12 | | ! 2742 |1 | ! 2742 |2 | @ ! 2742 |3 | ~ ! (27 rows) -- Check that all operators linked to by opclass entries have selectivity -- estimators. This is not absolutely required, but it seems a reasonable *** *** 847,854 WHERE p1.amopopr = p2.oid AND p1.amopclaid = p3.oid AND NOT binary_coercible(p3.opcintype, p2.oprleft); amopclaid | amopopr | oid | oprname | opcname ! ---+-+-+-+- ! (0 rows) SELECT p1.amopclaid, p1.amopopr, p2.oid, p2.oprname, p3.opcname FROM pg_amop AS p1, pg_operator AS p2, pg_opclass AS p3 --- 856,869 WHERE p1.amopopr = p2.oid AND p1.amopclaid = p3.oid AND NOT binary_coercible(p3.opcintype, p2.oprleft); amopclaid | amopopr | oid | oprname | opcname ! ---+-+--+-+--- ! 2746 |2750 | 2750 | | _text_ops ! 2745 |2750 | 2750 | | _int4_ops ! 2746 |2751 | 2751 | @ | _text_ops ! 2745 |2751 | 2751 | @ | _int4_ops ! 2746 |2752 | 2752 | ~ | _text_ops ! 2745 |2752 | 2752 | ~ | _int4_ops ! (6 rows) SELECT p1.amopclaid, p1.amopopr, p2.oid, p2.oprname, p3.opcname FROM pg_amop AS p1, pg_operator AS p2, pg_opclass AS p3 == ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] GIN - Generalized Inverted iNdex. Try 2.
I just built this, and noticed that the regression test for opr_sanity fails with your patch. I attached the regression.diffs. Sorry, this part isn't done yet, because we are waiting of community decision.. We don't add regression test yet. If community don't like to include GIN in core, we make a contrib/gin, but in this case GIN can't use WAL feature because of WAL interface can't call user-defined function. The reason for first diff is a hardcoded 'gist' index: -- We have to exclude GiST, unfortunately, since it hasn't got any fixed -- requirements about strategy operators. SELECT p1.oid, p1.amname, p2.oid, p2.opcname, p3.amopsubtype FROM pg_am AS p1, pg_opclass AS p2, pg_amop AS p3 WHERE p2.opcamid = p1.oid AND p3.amopclaid = p2.oid AND p1.amname != 'gist' AND p1.amstrategies != (SELECT count(*) FROM pg_amop AS p4 WHERE p4.amopclaid = p2.oid AND p4.amopsubtype = p3.amopsubtype); Second is right diff. For the thread one reason is that operations , ~, @ defined for anyarray, but used for particular types. *** ./expected/opr_sanity.out Wed Jan 25 18:35:51 2006 --- ./results/opr_sanity.outWed Apr 26 08:31:13 2006 *** *** 778,785 WHERE p4.amopclaid = p2.oid AND p4.amopsubtype = p3.amopsubtype); oid | amname | oid | opcname | amopsubtype ! -++-+-+- ! (0 rows) -- Check that amopopr points at a reasonable-looking operator, ie a binary -- operator yielding boolean. --- 778,791 WHERE p4.amopclaid = p2.oid AND p4.amopsubtype = p3.amopsubtype); oid | amname | oid | opcname | amopsubtype ! --++--+---+- ! 2742 | gin| 2745 | _int4_ops | 0 ! 2742 | gin| 2745 | _int4_ops | 0 ! 2742 | gin| 2745 | _int4_ops | 0 ! 2742 | gin| 2746 | _text_ops | 0 ! 2742 | gin| 2746 | _text_ops | 0 ! 2742 | gin| 2746 | _text_ops | 0 ! (6 rows) -- Check that amopopr points at a reasonable-looking operator, ie a binary -- operator yielding boolean. *** *** 825,831 783 | 10 | | 783 | 11 | | 783 | 12 | | ! (24 rows) -- Check that all operators linked to by opclass entries have selectivity -- estimators. This is not absolutely required, but it seems a reasonable --- 831,840 783 | 10 | | 783 | 11 | | 783 | 12 | | ! 2742 |1 | ! 2742 |2 | @ ! 2742 |3 | ~ ! (27 rows) -- Check that all operators linked to by opclass entries have selectivity -- estimators. This is not absolutely required, but it seems a reasonable *** *** 847,854 WHERE p1.amopopr = p2.oid AND p1.amopclaid = p3.oid AND NOT binary_coercible(p3.opcintype, p2.oprleft); amopclaid | amopopr | oid | oprname | opcname ! ---+-+-+-+- ! (0 rows) SELECT p1.amopclaid, p1.amopopr, p2.oid, p2.oprname, p3.opcname FROM pg_amop AS p1, pg_operator AS p2, pg_opclass AS p3 --- 856,869 WHERE p1.amopopr = p2.oid AND p1.amopclaid = p3.oid AND NOT binary_coercible(p3.opcintype, p2.oprleft); amopclaid | amopopr | oid | oprname | opcname ! ---+-+--+-+--- ! 2746 |2750 | 2750 | | _text_ops ! 2745 |2750 | 2750 | | _int4_ops ! 2746 |2751 | 2751 | @ | _text_ops ! 2745 |2751 | 2751 | @ | _int4_ops ! 2746 |2752 | 2752 | ~ | _text_ops ! 2745 |2752 | 2752 | ~ | _int4_ops ! (6 rows) SELECT p1.amopclaid, p1.amopopr, p2.oid, p2.oprname, p3.opcname FROM pg_amop AS p1, pg_operator AS p2, pg_opclass AS p3 == ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] GIN - Generalized Inverted iNdex. Try 2.
What changed between Try 1 and Try 2? Teodor Sigaev wrote: We (me and Oleg) are glad to present GIN to PostgreSQL. If community will agree, we will commit it to HEAD branch. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] GIN - Generalized Inverted iNdex. Try 2.
Oh I can't read - ignore me :) Teodor Sigaev wrote: Changes from previous patch: * add support for tsearch2 * add 'fuzzy' limit * fixes ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org