Re: [PERFORM] indexing for distinct search in timestamp based table

2008-09-07 Thread Rainer Mager
Thanks for the suggestion. This seems to work pretty well on 8.3, but not so
well on 8.2. We were planning on upgrading to 8.3 soon anyway, we just have
to move up our schedule a bit.

 

I think that this type of algorithm would make sense in core. I suspect that
being in there some further optimizations could be done that pl/pgsql can't
do.

 

 

--Rainer

 

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Vladimir
Sitnikov
Sent: Saturday, September 06, 2008 12:11 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] indexing for distinct search in timestamp based table

 

You might get great improvement for '%' cases using index on
channel_name(field, start_time) and a little bit of pl/pgsql

 

Basically, you need to implement the following algorithm:

 1) curr_field = ( select  min(field) from ad_log )

 2) record_exists = ( select 1 from ad_log where field=cur_field and
_all_other_conditions limit 1 )

 3) if record_exists==1 then add curr_field to the results

 3) curr_field = (select min(field) from ad_log where field  
curr_field ) 

 4) if curr_field is not null then goto 2

 

 

I believe it might make sense implement this approach in the core (I would
call it index distinct scan)

 

That could dramatically improve select distinct column from table and
select column from table group by column kind of queries when there
exists an index on column and a particular column has very small number of
distinct values.

 

For instance:  say a table has 10'000'000 rows, while column of interest has
only 20 distinct values. In that case, the database will be able to get
every of those 20 values in virtually 20 index lookups.

 

What does the community think about that?



Re: [PERFORM] indexing for distinct search in timestamp based table

2008-09-05 Thread Vladimir Sitnikov
You might get great improvement for '%' cases using index on
channel_name(field,
start_time) and a little bit of pl/pgsql

Basically, you need to implement the following algorithm:
 1) curr_field = ( select  min(field) from ad_log )
 2) record_exists = ( select 1 from ad_log where field=cur_field and
_all_other_conditions limit 1 )
 3) if record_exists==1 then add curr_field to the results
 3) curr_field = (select min(field) from ad_log where field  
 curr_field )
 4) if curr_field is not null then goto 2


I believe it might make sense implement this approach in the core (I would
call it index distinct scan)

That could dramatically improve select distinct column from table and
select column from table group by column kind of queries when there
exists an index on column and a particular column has very small number of
distinct values.

For instance:  say a table has 10'000'000 rows, while column of interest has
only 20 distinct values. In that case, the database will be able to get
every of those 20 values in virtually 20 index lookups.

What does the community think about that?


Re: [PERFORM] indexing for distinct search in timestamp based table

2008-08-29 Thread Rainer Mager
Thanks for the suggestions.

 

David's assumption is correct in that this is a log table with no updates or
deletes. I tried making it CLUSTERED in our test environment, but it didn't
seem to make any difference for my query. It did take about 6 hours to do
the transformation, so it would be difficult to find the time to do in
production, but I'm sure we could work something out if it really looked
beneficial. Unfortunately, as I said, initial tests don't seem to indicate
any benefit.

 

I believe that my performance difficulty comes from the need for DISTINCT
(or GROUP BY) data. That is, the normal start_time index seems fine for
limiting the date range, but when I need to select DISTINCT data from the
date range it seems that Postgres still needs to scan the entire limited
date range.

 

Unfortunately, we support arbitrary date range queries on this table, so I
don't think the partitioning idea is an option for us.

 

 

What I'm playing with now is creating separate tables to hold the
channel_name, ad_name, and player_name data with PRIMARY KEY ids. Since
there are very few of these compared to the number of rows in the main
table, this will give me a quick way to get the DISTINCT values over the
entire data set. My problem then will be reducing that to the DISTINCT
values for a limited date range.

 

As a side effect bonus of this I expect the database to shrink considerably
as these text fields, although not that long (roughly 20 to 50 characters),
are certainly longer than a simple foreign key reference.

 

 

--Rainer

 

 

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Scott Carey
Sent: Friday, August 29, 2008 8:02 AM
To: David Rowley
Cc: Rainer Mager; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] indexing for distinct search in timestamp based table

 

Another suggestion is to partition the table by date ranges.  If most of the
range queries occur on particular batches of time, this will make all
queries more efficient, and improve locality and efficiency of all indexes
on the table.

This is more work than simply a table CLUSTER, especially in maintenance
overhead, but it will generally help a lot in cases like these.
Additionally, if these don't change much after some period of time the
tables older than the modification window can be vacuumed, clustered, and
reindexed if needed to make them as efficient as possible and maintenance
free after that point (other than backups and archives).

Another benefit of clustering is in backup / restore.  You can incrementally
back up only the index partitions that have changed -- for large databases
this reduces pg_dump and pg_restore times substantially.  To do this you
combine regular expressions with the pg_dump exclude tables or include
tables flags.



On Thu, Aug 28, 2008 at 3:48 PM, David Rowley [EMAIL PROTECTED] wrote:

I once also had a similar performance problem when looking for all matching
rows between two timestamps. In fact that's why I'm here today. The problem
was with MySQL. I had some tables of around 10 million rows and all my
searching was timestamp based. MySQL didn't do what I wanted. I found that
using a CLUSTERED index with postgresql to be lightning quick. Yet mostly
the matching rows I was working with was not much over the 100k mark. I'm
wondering if clustering the table on ad_log_start_time will help cut down on
random reads.

That's if you can afford to block the users while postgresql clusters the
table.

If you're inserting in order of the start_time column (which I was) then the
cluster should almost maintain itself (I think), providing you're not
updating or deleting anyway, I'd assume that since it looks like a log
table.

David.



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Rainer Mager
Sent: 28 August 2008 09:06
To: pgsql-performance@postgresql.org
Subject: [PERFORM] indexing for distinct search in timestamp based table

I'm looking for some help in speeding up searches. My table is pretty simple
(see below), but somewhat large, and continuously growing. Currently it has
about 50 million rows.

The table is (I know I have excessive indexes, I'm trying to get the
appropriate ones and drop the extras):
 Table public.ad_log
   Column|Type |
Modifiers
--+-+---
-
 ad_log_id| integer | not null default
nextval('ad_log_ad_log_id_seq'::regclass)
 channel_name | text| not null
 player_name  | text| not null
 ad_name  | text| not null
 start_time   | timestamp without time zone | not null
 end_time | timestamp without time zone | not null
Indexes:
   ad_log_pkey PRIMARY KEY, btree (ad_log_id)
   ad_log_channel_name_key UNIQUE, btree (channel_name, player_name,
ad_name, start_time

[PERFORM] indexing for distinct search in timestamp based table

2008-08-28 Thread Rainer Mager
I'm looking for some help in speeding up searches. My table is pretty simple
(see below), but somewhat large, and continuously growing. Currently it has
about 50 million rows.

The table is (I know I have excessive indexes, I'm trying to get the
appropriate ones and drop the extras):
  Table public.ad_log
Column|Type |
Modifiers
--+-+---
-
 ad_log_id| integer | not null default
nextval('ad_log_ad_log_id_seq'::regclass)
 channel_name | text| not null
 player_name  | text| not null
 ad_name  | text| not null
 start_time   | timestamp without time zone | not null
 end_time | timestamp without time zone | not null
Indexes:
ad_log_pkey PRIMARY KEY, btree (ad_log_id)
ad_log_channel_name_key UNIQUE, btree (channel_name, player_name,
ad_name, start_time, end_time)
ad_log_ad_and_start btree (ad_name, start_time)
ad_log_ad_name btree (ad_name)
ad_log_all btree (channel_name, player_name, start_time, ad_name)
ad_log_channel_name btree (channel_name)
ad_log_end_time btree (end_time)
ad_log_player_and_start btree (player_name, start_time)
ad_log_player_name btree (player_name)
ad_log_start_time btree (start_time)



The query I'm trying to speed up is below. In it the field tag can be one
of channel_name, player_name, or ad_name. I'm actually trying to return the
distinct values and I found GROUP BY to be slightly faster than using
DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses
in which case we use '%', but it seems Postgres optimizes that pretty well.

SELECT field FROM ad_log 
WHERE channel_name LIKE :channel_name
AND player_name LIKE :player_name 
AND ad_name LIKE :ad_name 
AND start_time BETWEEN :start_date AND (date(:end_date) + 1)
GROUP BY field ORDER BY field


A typical query is:

explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%'
AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND
(date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name;

with the result being:
 
QUERY PLAN


---
 Sort  (cost=1163169.02..1163169.03 rows=5 width=10) (actual
time=75460.187..75460.192 rows=15 loops=1)
   Sort Key: channel_name
   Sort Method:  quicksort  Memory: 17kB
   -  HashAggregate  (cost=1163168.91..1163168.96 rows=5 width=10) (actual
time=75460.107..75460.114 rows=15 loops=1)
 -  Bitmap Heap Scan on ad_log  (cost=285064.30..1129582.84
rows=13434427 width=10) (actual time=8506.250..65771.597 rows=13701296
loops=1)
   Recheck Cond: ((start_time = '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time =
'2008-07-29'::date))
   Filter: ((channel_name ~~ '%'::text) AND (ad_name ~~
'%'::text))
   -  Bitmap Index Scan on ad_log_start_time
(cost=0.00..281705.70 rows=13434427 width=0) (actual time=8488.443..8488.443
rows=13701296 loops=1)
 Index Cond: ((start_time = '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time =
'2008-07-29'::date))
 Total runtime: 75460.361 ms


It seems to me there should be some way to create an index to speed this up,
but the various ones I've tried so far haven't helped. Any suggestions would
be greatly appreciated.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] indexing for distinct search in timestamp based table

2008-08-28 Thread H. Hall

Rainer Mager wrote:

I'm looking for some help in speeding up searches. My table is pretty simple
(see below), but somewhat large, and continuously growing. Currently it has
about 50 million rows.
  


Regarding your use of  LIKE:
(1)If you are able to specify the beginning character(s) of the 
statement you are searching for, you will have a better chance of your 
statement using an index. If you specify a wildcard(%) before the search 
string, the entire string in the column must be searched therefore no 
index will be used.
(2) Reorder your where clause to reduce the size of the set that LIKE 
operates on. In your example below, put the BETWEEN before the LIKE.
(3) Consider the use of trigrams instead of LIKE. I have not used it but 
I notice that postgres supports trigrams:


The pg_trgm module provides functions and operators for determining the 
similarity of text based on trigram matching, as well as index operator 
classes that support fast searching for similar strings.


Here is the link: http://www.postgresql.org/docs/current/static/pgtrgm.html

--cheers
HH


The table is (I know I have excessive indexes, I'm trying to get the
appropriate ones and drop the extras):
  Table public.ad_log
Column|Type |
Modifiers
--+-+---
-
 ad_log_id| integer | not null default
nextval('ad_log_ad_log_id_seq'::regclass)
 channel_name | text| not null
 player_name  | text| not null
 ad_name  | text| not null
 start_time   | timestamp without time zone | not null
 end_time | timestamp without time zone | not null
Indexes:
ad_log_pkey PRIMARY KEY, btree (ad_log_id)
ad_log_channel_name_key UNIQUE, btree (channel_name, player_name,
ad_name, start_time, end_time)
ad_log_ad_and_start btree (ad_name, start_time)
ad_log_ad_name btree (ad_name)
ad_log_all btree (channel_name, player_name, start_time, ad_name)
ad_log_channel_name btree (channel_name)
ad_log_end_time btree (end_time)
ad_log_player_and_start btree (player_name, start_time)
ad_log_player_name btree (player_name)
ad_log_start_time btree (start_time)



The query I'm trying to speed up is below. In it the field tag can be one
of channel_name, player_name, or ad_name. I'm actually trying to return the
distinct values and I found GROUP BY to be slightly faster than using
DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses
in which case we use '%', but it seems Postgres optimizes that pretty well.

SELECT field FROM ad_log 
	WHERE channel_name LIKE :channel_name
	AND player_name LIKE :player_name 
	AND ad_name LIKE :ad_name 
	AND start_time BETWEEN :start_date AND (date(:end_date) + 1)

GROUP BY field ORDER BY field


A typical query is:

explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%'
AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND
(date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name;

with the result being:
 
QUERY PLAN



---
 Sort  (cost=1163169.02..1163169.03 rows=5 width=10) (actual
time=75460.187..75460.192 rows=15 loops=1)
   Sort Key: channel_name
   Sort Method:  quicksort  Memory: 17kB
   -  HashAggregate  (cost=1163168.91..1163168.96 rows=5 width=10) (actual
time=75460.107..75460.114 rows=15 loops=1)
 -  Bitmap Heap Scan on ad_log  (cost=285064.30..1129582.84
rows=13434427 width=10) (actual time=8506.250..65771.597 rows=13701296
loops=1)
   Recheck Cond: ((start_time = '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time =
'2008-07-29'::date))
   Filter: ((channel_name ~~ '%'::text) AND (ad_name ~~
'%'::text))
   -  Bitmap Index Scan on ad_log_start_time
(cost=0.00..281705.70 rows=13434427 width=0) (actual time=8488.443..8488.443
rows=13701296 loops=1)
 Index Cond: ((start_time = '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time =
'2008-07-29'::date))
 Total runtime: 75460.361 ms


It seems to me there should be some way to create an index to speed this up,
but the various ones I've tried so far haven't helped. Any suggestions would
be greatly appreciated.


  



--
H. Hall
ReedyRiver Group LLC
http://www.reedyriver.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] indexing for distinct search in timestamp based table

2008-08-28 Thread David Rowley
I once also had a similar performance problem when looking for all matching
rows between two timestamps. In fact that's why I'm here today. The problem
was with MySQL. I had some tables of around 10 million rows and all my
searching was timestamp based. MySQL didn't do what I wanted. I found that
using a CLUSTERED index with postgresql to be lightning quick. Yet mostly
the matching rows I was working with was not much over the 100k mark. I'm
wondering if clustering the table on ad_log_start_time will help cut down on
random reads.

That's if you can afford to block the users while postgresql clusters the
table.

If you're inserting in order of the start_time column (which I was) then the
cluster should almost maintain itself (I think), providing you're not
updating or deleting anyway, I'd assume that since it looks like a log
table.

David.

 
-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Rainer Mager
Sent: 28 August 2008 09:06
To: pgsql-performance@postgresql.org
Subject: [PERFORM] indexing for distinct search in timestamp based table

I'm looking for some help in speeding up searches. My table is pretty simple
(see below), but somewhat large, and continuously growing. Currently it has
about 50 million rows.

The table is (I know I have excessive indexes, I'm trying to get the
appropriate ones and drop the extras):
  Table public.ad_log
Column|Type |
Modifiers
--+-+---
-
 ad_log_id| integer | not null default
nextval('ad_log_ad_log_id_seq'::regclass)
 channel_name | text| not null
 player_name  | text| not null
 ad_name  | text| not null
 start_time   | timestamp without time zone | not null
 end_time | timestamp without time zone | not null
Indexes:
ad_log_pkey PRIMARY KEY, btree (ad_log_id)
ad_log_channel_name_key UNIQUE, btree (channel_name, player_name,
ad_name, start_time, end_time)
ad_log_ad_and_start btree (ad_name, start_time)
ad_log_ad_name btree (ad_name)
ad_log_all btree (channel_name, player_name, start_time, ad_name)
ad_log_channel_name btree (channel_name)
ad_log_end_time btree (end_time)
ad_log_player_and_start btree (player_name, start_time)
ad_log_player_name btree (player_name)
ad_log_start_time btree (start_time)



The query I'm trying to speed up is below. In it the field tag can be one
of channel_name, player_name, or ad_name. I'm actually trying to return the
distinct values and I found GROUP BY to be slightly faster than using
DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses
in which case we use '%', but it seems Postgres optimizes that pretty well.

SELECT field FROM ad_log 
WHERE channel_name LIKE :channel_name
AND player_name LIKE :player_name 
AND ad_name LIKE :ad_name 
AND start_time BETWEEN :start_date AND (date(:end_date) + 1)
GROUP BY field ORDER BY field


A typical query is:

explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%'
AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND
(date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name;

with the result being:
 
QUERY PLAN


---
 Sort  (cost=1163169.02..1163169.03 rows=5 width=10) (actual
time=75460.187..75460.192 rows=15 loops=1)
   Sort Key: channel_name
   Sort Method:  quicksort  Memory: 17kB
   -  HashAggregate  (cost=1163168.91..1163168.96 rows=5 width=10) (actual
time=75460.107..75460.114 rows=15 loops=1)
 -  Bitmap Heap Scan on ad_log  (cost=285064.30..1129582.84
rows=13434427 width=10) (actual time=8506.250..65771.597 rows=13701296
loops=1)
   Recheck Cond: ((start_time = '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time =
'2008-07-29'::date))
   Filter: ((channel_name ~~ '%'::text) AND (ad_name ~~
'%'::text))
   -  Bitmap Index Scan on ad_log_start_time
(cost=0.00..281705.70 rows=13434427 width=0) (actual time=8488.443..8488.443
rows=13701296 loops=1)
 Index Cond: ((start_time = '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time =
'2008-07-29'::date))
 Total runtime: 75460.361 ms


It seems to me there should be some way to create an index to speed this up,
but the various ones I've tried so far haven't helped. Any suggestions would
be greatly appreciated.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org

Re: [PERFORM] indexing for distinct search in timestamp based table

2008-08-28 Thread Scott Carey
Another suggestion is to partition the table by date ranges.  If most of the
range queries occur on particular batches of time, this will make all
queries more efficient, and improve locality and efficiency of all indexes
on the table.

This is more work than simply a table CLUSTER, especially in maintenance
overhead, but it will generally help a lot in cases like these.
Additionally, if these don't change much after some period of time the
tables older than the modification window can be vacuumed, clustered, and
reindexed if needed to make them as efficient as possible and maintenance
free after that point (other than backups and archives).

Another benefit of clustering is in backup / restore.  You can incrementally
back up only the index partitions that have changed -- for large databases
this reduces pg_dump and pg_restore times substantially.  To do this you
combine regular expressions with the pg_dump exclude tables or include
tables flags.


On Thu, Aug 28, 2008 at 3:48 PM, David Rowley [EMAIL PROTECTED] wrote:

 I once also had a similar performance problem when looking for all matching
 rows between two timestamps. In fact that's why I'm here today. The problem
 was with MySQL. I had some tables of around 10 million rows and all my
 searching was timestamp based. MySQL didn't do what I wanted. I found that
 using a CLUSTERED index with postgresql to be lightning quick. Yet mostly
 the matching rows I was working with was not much over the 100k mark. I'm
 wondering if clustering the table on ad_log_start_time will help cut down
 on
 random reads.

 That's if you can afford to block the users while postgresql clusters the
 table.

 If you're inserting in order of the start_time column (which I was) then
 the
 cluster should almost maintain itself (I think), providing you're not
 updating or deleting anyway, I'd assume that since it looks like a log
 table.

 David.


 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Rainer Mager
 Sent: 28 August 2008 09:06
 To: pgsql-performance@postgresql.org
 Subject: [PERFORM] indexing for distinct search in timestamp based table

 I'm looking for some help in speeding up searches. My table is pretty
 simple
 (see below), but somewhat large, and continuously growing. Currently it has
 about 50 million rows.

 The table is (I know I have excessive indexes, I'm trying to get the
 appropriate ones and drop the extras):
  Table public.ad_log
Column|Type |
 Modifiers

 --+-+---
 -
  ad_log_id| integer | not null default
 nextval('ad_log_ad_log_id_seq'::regclass)
  channel_name | text| not null
  player_name  | text| not null
  ad_name  | text| not null
  start_time   | timestamp without time zone | not null
  end_time | timestamp without time zone | not null
 Indexes:
ad_log_pkey PRIMARY KEY, btree (ad_log_id)
ad_log_channel_name_key UNIQUE, btree (channel_name, player_name,
 ad_name, start_time, end_time)
ad_log_ad_and_start btree (ad_name, start_time)
ad_log_ad_name btree (ad_name)
ad_log_all btree (channel_name, player_name, start_time, ad_name)
ad_log_channel_name btree (channel_name)
ad_log_end_time btree (end_time)
ad_log_player_and_start btree (player_name, start_time)
ad_log_player_name btree (player_name)
ad_log_start_time btree (start_time)



 The query I'm trying to speed up is below. In it the field tag can be one
 of channel_name, player_name, or ad_name. I'm actually trying to return the
 distinct values and I found GROUP BY to be slightly faster than using
 DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses
 in which case we use '%', but it seems Postgres optimizes that pretty well.

 SELECT field FROM ad_log
WHERE channel_name LIKE :channel_name
AND player_name LIKE :player_name
AND ad_name LIKE :ad_name
AND start_time BETWEEN :start_date AND (date(:end_date) + 1)
GROUP BY field ORDER BY field


 A typical query is:

 explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%'
 AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND
 (date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name;

 with the result being:

 QUERY PLAN

 

 
 ---
  Sort  (cost=1163169.02..1163169.03 rows=5 width=10) (actual
 time=75460.187..75460.192 rows=15 loops=1)
   Sort Key: channel_name
   Sort Method:  quicksort  Memory: 17kB
   -  HashAggregate  (cost=1163168.91..1163168.96 rows=5 width=10) (actual
 time=75460.107..75460.114 rows=15 loops=1)
 -  Bitmap Heap Scan