Le 03 mars 2017 à 14:08, Artur Zakirov écrivait : > On 03.03.2017 15:49, Nicolas Paris wrote: > > > >Hi Oleg, > > > >Thanks. I thought pgtrgm was not able to index my long texts because of > >limitation of 8191 bytes per index row for btree. > > > >Then I found out it is possible to use pgtrgm over a GIN/GIST index. > >My final use case is phrase mining in texts. > > > >I want my application returns texts that contains approximatly the user > >entry: > > > >Eg: user search "Hello Word" > >a text containing "blah blah blah hello world blah blah blah" would be > >returned. > > > >Test: > >postgres=# CREATE table test_trgm (texts text); > >CREATE TABLE > >postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops); > >CREATE INDEX > >postgres=# SET enable_seqscan = OFF; > >SET > >postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah > >blah blah'); > >INSERT 0 1 > >postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah > >blah blah'); > >INSERT 0 1 > >postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm > >WHERE texts % 'hello word'; > > texts | similarity > >-------------------------------------------+------------ > > blah blah blah hello world blah blah blah | 0.473684 > > blah blah blah hello word blah blah blah | 0.6875 > >(2 rows) > > > >postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM > >test_trgm WHERE texts % 'hello word'; > > QUERY PLAN > >----------------------------------------------------------------------------------- > > Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32) > > Recheck Cond: (texts % 'hello word'::text) > > -> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 > > width=0) > > Index Cond: (texts % 'hello word'::text) > >(4 rows) > > > >Conclusion: If I d'say 0.4 is my threshold, would this methodology meet > >my requirements ? > > > >Thanks for the help ! > > > > Hello, > > If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For > example: > > postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm > WHERE 'hello word' <% texts; > texts | word_similarity > -------------------------------------------+----------------- > blah blah blah hello world blah blah blah | 0.818182 > blah blah blah hello word blah blah blah | 1 > (2 rows) > > 1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html >
Nice ! I do have 9.6 version. Would this kind of index could handle more than 20M large texts ? The recheck condition looks ressource consuming. The full text index + phrase search + synonym dictionnary is the only other alternativ to deal with typo-phrase mining ? Is there any possibility in the future to add typo in the full text road-map ? Thanks, > -- > Artur Zakirov > Postgres Professional: http://www.postgrespro.com > Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general