On Fri, Nov 15, 2013 at 6:57 PM, Rod Taylor <p...@rbt.ca> wrote: > I tried again this morning using gin-packed-postinglists-16.patch and > gin-fast-scan.6.patch. No crashes. > > It is about a 0.1% random sample of production data (10,000,000 records) > with the below structure. Pg was compiled with debug enabled in both cases. > > Table "public.kp" > Column | Type | Modifiers > --------+---------+----------- > id | bigint | not null > string | text | not null > score1 | integer | > score2 | integer | > score3 | integer | > score4 | integer | > Indexes: > "kp_pkey" PRIMARY KEY, btree (id) > "kp_string_key" UNIQUE CONSTRAINT, btree (string) > "textsearch_gin_idx" gin (to_tsvector('simple'::regconfig, string)) > WHERE score1 IS NOT NULL > > > > This is a query tested. All data is in Pg buffer cache for these timings. > Words like "the" and "and" are very common (~9% of entries, each) and a > word like "hotel" is much less common (~0.2% of entries). > > SELECT id,string > FROM kp > WHERE score1 IS NOT NULL > AND to_tsvector('simple', string) @@ to_tsquery('simple', ?) > -- ? is substituted with the query strings > ORDER BY score1 DESC, score2 ASC > LIMIT 1000; > > Limit (cost=56.04..56.04 rows=1 width=37) (actual time=250.010..250.032 > rows=142 loops=1) > -> Sort (cost=56.04..56.04 rows=1 width=37) (actual > time=250.008..250.017 rows=142 loops=1) > Sort Key: score1, score2 > Sort Method: quicksort Memory: 36kB > -> Bitmap Heap Scan on kp (cost=52.01..56.03 rows=1 width=37) > (actual time=249.711..249.945 rows=142 loops=1) > Recheck Cond: ((to_tsvector('simple'::regconfig, string) @@ > '''hotel'' & ''and'' & ''the'''::tsquery) AND (score1 IS NOT NULL)) > -> Bitmap Index Scan on textsearch_gin_idx > (cost=0.00..52.01 rows=1 width=0) (actual time=249.681..249.681 rows=142 > loops=1) > Index Cond: (to_tsvector('simple'::regconfig, string) > @@ '''hotel'' & ''and'' & ''the'''::tsquery) > Total runtime: 250.096 ms > > > > Times are from \timing on. > > MASTER > ======= > the: 888.436 ms 926.609 ms 885.502 ms > and: 944.052 ms 937.732 ms 920.050 ms > hotel: 53.992 ms 57.039 ms 65.581 ms > and & the & hotel: 260.308 ms 248.275 ms 248.098 ms > > These numbers roughly match what we get with Pg 9.2. The time savings > between 'the' and 'and & the & hotel' is mostly heap lookups for the score > and the final sort. > > > > The size of the index on disk is about 2% smaller in the patched version. > > PATCHED > ======= > the: 1055.169 ms 1081.976 ms 1083.021 ms > and: 912.173 ms 949.364 ms 965.261 ms > hotel: 62.591 ms 64.341 ms 62.923 ms > and & the & hotel: 268.577 ms 259.293 ms 257.408 ms > hotel & and & the: 253.574 ms 258.071 ms 250.280 ms > > I was hoping that the 'and & the & hotel' case would improve with this > patch to be closer to the 'hotel' search, as I thought that was the kind of > thing it targeted. Unfortunately, it did not. I actually applied the > patches, compiled, initdb/load data, and ran it again thinking I made a > mistake. > > Reordering the terms 'hotel & and & the' doesn't change the result. >
Oh, in this path new consistent method isn't implemented for tsvector opclass, for array only. Will be fixed soon. BTW, was index 2% smaller or 2 times smaller? If it's 2% smaller than I need to know more about your dataset :) ------ With best regards, Alexander Korotkov.