Re: [GENERAL] like & optimization

2013-10-19 Thread Scott Ribe
Thank you all. Both the double index & pg_trgm would be good solutions.

On Oct 14, 2013, at 3:40 PM, Merlin Moncure  wrote:

> On Sat, Oct 12, 2013 at 4:28 PM, Torsten Förtsch
>  wrote:
>> On 12/10/13 20:08, Scott Ribe wrote:
>>> select * from test where tz >= start and tz < end and colb like '%foobar%'
>> 
>> I think you can use an index only for wildcard expressions that are
>> anchored at the beginning. So,
>> 
>>  select * from test where tz >= start and tz < end
>> and colb like 'foobar%'
>> 
>> can use an index on colb.
>> 
>> You could perhaps
>> 
>>  select * from test where tz >= start and tz < end
>> and colb like 'foobar%'
>>  union all
>>  select * from test where tz >= start and tz < end
>> and reverse(colb) like 'raboof%'
>> 
>> Then you need 2 indexes, one on colb the other on reverse(colb).
>> 
>> You can have duplicates in the result set if the table contains rows
>> where colb='foobar'. If that's a problem, use union distinct.
>> 
>> Alternatively, if foobar is kind of a word (with boundaries), you could
>> consider full-text search.
> 
> pg_trgm module optimizes 'like with wildcards' without those
> restrictions.  It's very fast for what it does.  Because of the
> GIST/GIN dependency index only scans are not going to be used through
> pg_tgrm though.
> 
> merlin
> 


-- 
Scott Ribe
scott_r...@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice






-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] like & optimization

2013-10-14 Thread Merlin Moncure
On Sat, Oct 12, 2013 at 4:28 PM, Torsten Förtsch
 wrote:
> On 12/10/13 20:08, Scott Ribe wrote:
>> select * from test where tz >= start and tz < end and colb like '%foobar%'
>
> I think you can use an index only for wildcard expressions that are
> anchored at the beginning. So,
>
>   select * from test where tz >= start and tz < end
>  and colb like 'foobar%'
>
> can use an index on colb.
>
> You could perhaps
>
>   select * from test where tz >= start and tz < end
>  and colb like 'foobar%'
>   union all
>   select * from test where tz >= start and tz < end
>  and reverse(colb) like 'raboof%'
>
> Then you need 2 indexes, one on colb the other on reverse(colb).
>
> You can have duplicates in the result set if the table contains rows
> where colb='foobar'. If that's a problem, use union distinct.
>
> Alternatively, if foobar is kind of a word (with boundaries), you could
> consider full-text search.

pg_trgm module optimizes 'like with wildcards' without those
restrictions.  It's very fast for what it does.  Because of the
GIST/GIN dependency index only scans are not going to be used through
pg_tgrm though.

merlin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] like & optimization

2013-10-12 Thread Scott Ribe
On Oct 12, 2013, at 4:21 PM, Tom Lane  wrote:

> The reason you're losing on this is that the "select *" command eliminates
> the possibility of an index-only scan (I'm assuming that that selects some
> columns that aren't in the index).  Given that a plain indexscan will
> always involve fetching each heap row that satisfies the indexable
> condition (the one on tz), the planner figures it might as well use the
> physically-smaller index.

OK, that logic makes sense. In the particular case I'm looking at, the 
comparison to colb will match such a tiny fraction that I think it should be 
faster to use the index first before fetching heap rows. (It most certainly 
would be faster if the rows to be evaluated for the colb match were randomly 
dispersed, but because they tend to be naturally clustered on tz anyway, and 
the rows are pretty small, there's some chance an index scan might not save 
enough heap row I/O to offset it's own I/O.)

> It's true that in principle we could use the index-only-scan index AM
> machinery to retrieve colb from the index, and then check the LIKE
> predicate on that value before we go to the heap to get the other values;
> but the code isn't factored that way at the moment.  I'm not entirely sure
> that such cases arise often enough to be worth making it happen.  I think
> there was discussion of this point back when the index-only-scan patch was
> being written, and we decided it didn't seem worth pursuing at the time.

It's not a common-enough case for me to worry about. This is a very rare query 
in this application--I just wanted to know if I was missing something wrt 
indexes or whatever. It took me a long time to even find varchar_pattern_ops. 
(This is one particular question where the top results from google searches are 
dominated by incorrect assertions. Yes, Virginia, it *IS* possible to use an 
index in evaluating a like '%whatever' condition--whether or not it helps in a 
particular query is an open question, but it most certainly is possible.)

Besides, you've given me the hint, if I really care about this I can try a 
covering index ;-)

-- 
Scott Ribe
scott_r...@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice






-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] like & optimization

2013-10-12 Thread Tom Lane
Scott Ribe  writes:
> PG 9.3, consider a table test like:
> tz timestamp not null,
> cola varchar not null,
> colb varchar not null

> 2 compound indexes:

> tz_cola on (tz, cola)
> tz_colb on (tz, colb varchar_pattern_ops)

> now a query, for some start & end timestamps:

> select * from test where tz >= start and tz < end and colb like '%foobar%'

> Assume that the tz restriction is somewhat selective, say 1% of the table, 
> and the colb restriction is extremely selective, say less than 0.1%.

> It seems to me that the fastest way to resolve this query is to use the 
> tz_colb index directly, scanning the range between tz >= start and tz < end 
> for the colb condition.

> But pg wants to use the pg_cola index to find all rows in the time range, 
> then filter those rows for the colb condition. (FYI, cola contains only very 
> small values, while colb's values are typically several times longer.)

The reason you're losing on this is that the "select *" command eliminates
the possibility of an index-only scan (I'm assuming that that selects some
columns that aren't in the index).  Given that a plain indexscan will
always involve fetching each heap row that satisfies the indexable
condition (the one on tz), the planner figures it might as well use the
physically-smaller index.

It's true that in principle we could use the index-only-scan index AM
machinery to retrieve colb from the index, and then check the LIKE
predicate on that value before we go to the heap to get the other values;
but the code isn't factored that way at the moment.  I'm not entirely sure
that such cases arise often enough to be worth making it happen.  I think
there was discussion of this point back when the index-only-scan patch was
being written, and we decided it didn't seem worth pursuing at the time.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] like & optimization

2013-10-12 Thread Sergey Konoplev
On Sat, Oct 12, 2013 at 11:08 AM, Scott Ribe
 wrote:
[skipped]

> select * from test where tz >= start and tz < end and colb like '%foobar%'
>
> Assume that the tz restriction is somewhat selective, say 1% of the table, 
> and the colb restriction is extremely selective, say less than 0.1%.

[skipped]

> Here's a sample query plan:
>
>  Index Scan using tz_cola on test  (cost=0.56..355622.52 rows=23 width=106) 
> (actual time=61.403..230.649 rows=4 loops=1)
>Index Cond: ((tz >= '2013-04-01 06:00:00-05'::timestamp with time zone) 
> AND (tz <= '2013-04-30 06:00:00-05'::timestamp with time zone))
>Filter: ((colb)::text ~~ '%foobar%'::text)
>Rows Removed by Filter: 261725
>  Total runtime: 230.689 ms

You need to create the pg_trgm extension
(http://www.postgresql.org/docs/9.3/static/pgtrgm.html) if you need
things like LIKE '%foobar%' to use indexes.

First create 2 separate indexes for your columns ... USING btree (tz)
and ... USING gin (colb gin_trgm_ops). And to EXPLAIN ANALYZE your
query. It could probably be the best option because BitmapAnd might be
very effective.

You could also try to create a multi-column index, but you will need
to create a btree_gin extension first
(http://www.postgresql.org/docs/9.3/static/btree-gin.html) to be able
to combine GIN and btree in one index ... gin (tz, colb gin_trgm_ops).
Then EXPLAIN ANALYZE the query and compare with the above.

Note, if you have intensive writes on the table you would probably
want to set FASTUPDATE to off on the GIN index, because it might lead
to unpredictable stalls
(http://www.postgresql.org/docs/9.3/static/gin-implementation.html#GIN-FAST-UPDATE).

-- 
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray...@gmail.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] like & optimization

2013-10-12 Thread Torsten Förtsch
On 12/10/13 20:08, Scott Ribe wrote:
> select * from test where tz >= start and tz < end and colb like '%foobar%'

I think you can use an index only for wildcard expressions that are
anchored at the beginning. So,

  select * from test where tz >= start and tz < end
 and colb like 'foobar%'

can use an index on colb.

You could perhaps

  select * from test where tz >= start and tz < end
 and colb like 'foobar%'
  union all
  select * from test where tz >= start and tz < end
 and reverse(colb) like 'raboof%'

Then you need 2 indexes, one on colb the other on reverse(colb).

You can have duplicates in the result set if the table contains rows
where colb='foobar'. If that's a problem, use union distinct.

Alternatively, if foobar is kind of a word (with boundaries), you could
consider full-text search.

Just my 2¢,
Torsten


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] like & optimization

2013-10-12 Thread Scott Ribe
PG 9.3, consider a table test like:

tz timestamp not null,
cola varchar not null,
colb varchar not null

2 compound indexes:

tz_cola on (tz, cola)
tz_colb on (tz, colb varchar_pattern_ops)

now a query, for some start & end timestamps:

select * from test where tz >= start and tz < end and colb like '%foobar%'

Assume that the tz restriction is somewhat selective, say 1% of the table, and 
the colb restriction is extremely selective, say less than 0.1%.

It seems to me that the fastest way to resolve this query is to use the tz_colb 
index directly, scanning the range between tz >= start and tz < end for the 
colb condition.

But pg wants to use the pg_cola index to find all rows in the time range, then 
filter those rows for the colb condition. (FYI, cola contains only very small 
values, while colb's values are typically several times longer.)

Now if I tweak the time range, I can get it to seq scan the table for all 
conditions, or bitmap heap scan + re-check cond tz + filter colb + bitmap index 
scan tz_cola, but never use the tz_colb index...

Am I right about the fastest way to perform the search? Is there some way to 
get pg to do this, or would this require an enhancement?

Here's a sample query plan:

 Index Scan using tz_cola on test  (cost=0.56..355622.52 rows=23 width=106) 
(actual time=61.403..230.649 rows=4 loops=1)
   Index Cond: ((tz >= '2013-04-01 06:00:00-05'::timestamp with time zone) AND 
(tz <= '2013-04-30 06:00:00-05'::timestamp with time zone))
   Filter: ((colb)::text ~~ '%foobar%'::text)
   Rows Removed by Filter: 261725
 Total runtime: 230.689 ms

-- 
Scott Ribe
scott_r...@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice






-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general