Re: [PERFORM] Postgres 8.x on Windows Server in production

2009-04-12 Thread Rainer Mager
We're running 8.3, but when we started this server about 2 years ago it was
an earlier 8.x, I don't remember which.

--Rainer

> -Original Message-
> From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-
> ow...@postgresql.org] On Behalf Of Scott Marlowe
> Sent: Monday, April 13, 2009 8:41 AM
> To: Rainer Mager
> Cc: Ognjen Blagojevic; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Postgres 8.x on Windows Server in production
> 
> On Sun, Apr 12, 2009 at 5:13 PM, Rainer Mager 
> wrote:
> > We use Postgres 8.x in production on Windows Server 2003. We have not
> done a
> > direct head-to-head comparison against any *nix environment, so I
> can't
> > really compare them, but I can still give a few comments.
> 
> Just wondering, what version are you actually running?  Big
> differences from 8.0, 8.1, 8.2, 8.3 and soon 8.4.  For people taking
> your advice on running on windows, it helps them make a decision on
> whether or not to upgrade.
> 
> > First of all, it seems that some of the popular file systems in *nix
> are
> > more robust at preventing disk fragmentation than NTFS is. Because of
> this I
> > definitely recommend have some defragging solution. What we've
> settled on in
> 
> Linux file systems still fragment, they just don't tend to fragment as
> much.  As the drive gets closer to being full fragmentation will
> become more of a problem.
> 
> --
> Sent via pgsql-performance mailing list (pgsql-
> performa...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres 8.x on Windows Server in production

2009-04-12 Thread Rainer Mager
We use Postgres 8.x in production on Windows Server 2003. We have not done a
direct head-to-head comparison against any *nix environment, so I can't
really compare them, but I can still give a few comments.

First of all, it seems that some of the popular file systems in *nix are
more robust at preventing disk fragmentation than NTFS is. Because of this I
definitely recommend have some defragging solution. What we've settled on in
O&O (that's the company name) Defrag Server version. Their software has some
nice features related to defragging in the background while monitoring
system usage so an to impact performance minimally.

Second, one big difficulty with running on Windows is that most of the
Postgres expertise seems is around *nix environments. This means that when
you do need to investigate a performance issue it can be more difficult to
get direct advice. For example, perusing this mailing list will show lot's
of tips suggesting running various tools to show io performance, etc. Well,
on Windows the toolset is different.


All in all we've been happy enough with Windows. Certainly we've never
considered migrating to *nix because of difficulties with it.



--Rainer

> -Original Message-
> From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-
> ow...@postgresql.org] On Behalf Of Ognjen Blagojevic
> Sent: Friday, April 10, 2009 6:47 PM
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] Postgres 8.x on Windows Server in production
> 
> Hi all,
> 
> What are your experiences with Postgres 8.x in production use on
> Windows
> Server 2003/2008? Are there any limitations, trade-offs or quirks?
> 
> My client is accustomed to Windows Server environment, but it seems
> hard
> to google good information about these types of installations.
> 
> Regards,
> Ognjen
> 
> --
> Sent via pgsql-performance mailing list (pgsql-
> performa...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] difficulties with time based queries

2009-04-09 Thread Rainer Mager
Thanks for all of the suggestions so far. I've been trying to reduce the
number of indices I have, but I'm running into a problem. I have a need to
do queries on this table with criteria applied to the date and possibly any
or all of the other key columns. As a reminder, here's my table:

 

 Table "public.ad_log"

   Column   |Type |
Modifiers

+-+-
---

 ad_log_id  | integer | not null default
nextval('ad_log_ad_log_id_seq'::regclass)

 channel| integer | not null

 player | integer | not null

 ad | integer | not null

 start_time | timestamp without time zone |

 end_time   | timestamp without time zone |

 

So, I need indices that make it fast querying against start_time as well as
all possible combinations of channel, player, and ad. Below is a sample
query that uses all of these (note that I've removed actual strings to
protect customer data). The result is fine in terms of speed, but since it's
using the ad_log_ad_date index I'm wondering what the best strategy is to
cover queries that don't specify an ad. Should I have 4 indices, one with
just the start_time (for when no other columns are specified) and the other
three each with the start_time and the three other criteria: channel,
player, and ad? I'm currently experimenting with various options, but since
it takes a couple of hours to create a particular index this is taking a
while.

 

 

# explain analyze SELECT ad_log.ad_log_id, channels.name as channel_name,
players.name as player_name, ads.name as ad_name, start_time, end_time,
(data IS NOT NULL) AS has_screenshot FROM channels, players,
players_history, ads,  ad_log LEFT OUTER JOIN ad_log_screenshot USING
(ad_log_id) WHERE channel=channels.id AND player=players_history.id AND
players_history.player_instance = players.id AND ad=ads.id AND channels.name
LIKE '' AND players.name LIKE '' AND ads.name
LIKE '' AND  date(start_time) BETWEEN '2009-01-20' AND
date('2009-01-21') ORDER BY channels.name, players.name, start_time,
ads.name LIMIT 100 OFFSET 100;

 

 
QUERY PLAN



--

 Limit  (cost=7425.26..7425.26 rows=1 width=120) (actual
time=1256.116..1256.202 rows=39 loops=1)

   ->  Sort  (cost=7425.26..7425.26 rows=1 width=120) (actual
time=1255.965..1256.068 rows=139 loops=1)

 Sort Key: channels.name, players.name, ad_log.start_time, ads.name

 Sort Method:  quicksort  Memory: 35kB

 ->  Nested Loop Left Join  (cost=0.01..7425.25 rows=1 width=120)
(actual time=179.086..1255.451 rows=139 loops=1)

   ->  Nested Loop  (cost=0.01..7417.06 rows=1 width=88) (actual
time=137.488..1212.531 rows=139 loops=1)

 Join Filter: (ad_log.channel = channels.id)

 ->  Nested Loop  (cost=0.01..7415.73 rows=1 width=60)
(actual time=120.308..1192.867 rows=139 loops=1)

   Join Filter: (players_history.id = ad_log.player)

   ->  Nested Loop  (cost=0.00..36.92 rows=1
width=17) (actual time=21.960..23.405 rows=1 loops=1)

 Join Filter: (players.id =
players_history.player_instance)

 ->  Seq Scan on players  (cost=0.00..11.80
rows=1 width=17) (actual time=5.981..6.083 rows=1 loops=1)

   Filter: (name ~~ ''::text)

 ->  Seq Scan on players_history
(cost=0.00..14.50 rows=850 width=8) (actual time=15.880..16.592 rows=850
loops=1)

   ->  Nested Loop  (cost=0.01..7371.03 rows=622
width=51) (actual time=75.161..1156.076 rows=15600 loops=1)

 ->  Seq Scan on ads  (cost=0.00..72.79
rows=1 width=27) (actual time=15.776..31.975 rows=1 loops=1)

   Filter: (name ~~ ''::text)

 ->  Index Scan using ad_log_ad_date on
ad_log  (cost=0.01..7267.77 rows=2438 width=32) (actual
time=59.375..1095.229 rows=15600 loops=1)

   Index Cond: ((ad_log.ad = ads.id) AND
(date(ad_log.start_time) >= '2009-01-20'::date) AND (date(ad_log.start_time)
<= '2009-01-21'::date))

 ->  Seq Scan on channels  (cost=0.00..1.31 rows=1
width=36) (actual time=0.128..0.132 rows=1 loops=139)

   Filter: (channels.name ~~ ''::text)

   ->  Index Scan using ad_log_screenshot_pkey on
ad_log_screenshot  (cost=0.00..8.18 rows=1 width=36) (actual
time=0.304..0.304 rows=0 loops=139)

 Index Cond: (ad_log.ad_log_id =
ad_log_screenshot.ad_log_id)

 Total runtime

Re: [PERFORM] difficulties with time based queries

2009-04-08 Thread Rainer Mager
So, I defragged my disk and reran my original query and it got a little
better, but still far higher than I'd like. I then rebuilt (dropped and
recreated) the ad_log_date_all index and reran the query and it is quite a
bit better:

 

# explain analyze select * from ad_log where date(start_time) <
date('2009-03-31') and date(start_time) >= date('2009-03-30');

  QUERY PLAN


--

 Bitmap Heap Scan on ad_log  (cost=64770.21..3745596.62 rows=2519276
width=32) (actual time=1166.479..13862.107 rows=2275167 loops=1)

   Recheck Cond: ((date(start_time) < '2009-03-31'::date) AND
(date(start_time) >= '2009-03-30'::date))

   ->  Bitmap Index Scan on ad_log_date_all  (cost=0.00..64140.39
rows=2519276 width=0) (actual time=1143.582..1143.582 rows=2275167 loops=1)

 Index Cond: ((date(start_time) < '2009-03-31'::date) AND
(date(start_time) >= '2009-03-30'::date))

 Total runtime: 14547.885 ms

 

 

During the query the disk throughput peaked at 30MB/s and was mostly at
around 20MB/s, much better.

 

So, a few questions:

 

What can I do to prevent the index from getting bloated, or in whatever
state it was in?

 

What else can I do to further improve queries on this table? Someone
suggested posting details of my conf file. Which settings are most likely to
be useful for this?

 

Any other suggestions?

 

 

Thanks,

 

--Rainer



Re: [PERFORM] difficulties with time based queries

2009-04-05 Thread Rainer Mager


> -Original Message-
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> "Rainer Mager"  writes:
> >> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> >> Hmm ... it's pretty unusual to see the index fetch portion of a
> bitmap
> >> scan take the bulk of the runtime.  Usually that part is fast and
> where
> >> the pain comes is in fetching from the heap.   I wonder whether that
> >> index has become bloated.  How big are the table and the index
> >> physically?  (Look at pg_class.relpages, or if you want a really
> >> accurate number try pg_relation_size().)
> 
> > Can you give me some more info on how to look at these stats?
> 
> Since you've got 8.3 it's easy: select pg_relation_size('tablename')
> (or indexname).  The result is in bytes, so you might want to
> divide by 1K or 1M to keep the number readable.

Ok, nice and simple...I like it:

The result for the table ad_log, is 30,063 MB. The result for the index,
ad_log_date_all, is 17,151 MB. I guess this roughly makes sense since the
index is on 4 fields and the table only has 6 fields.

For the particular query I'm trying to optimize at the moment I believe I
should be able to use an index that references only 2 fields, which, I
imagine, should reduce the time needed to read it. I'll play with this a bit
and see what happens.

Any other suggestions?

--Rainer




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] difficulties with time based queries

2009-04-05 Thread Rainer Mager
Thanks for all the replies, I'll try to address the follow up questions:

> From: David Wilson [mailto:david.t.wil...@gmail.com]
> 
> The stats look good and it's using a viable index for your query. What
> kind of hardware is this on, and what are the relevant postgresql.conf
> lines? (Or, for that matter, what does iostat say while this query's
> running?)

I'm running on Windows, so I don't have iostat, but perfmon tells me my Avg.
Disk Queue Length went up to 1.2 during the query (versus a normal value of
about 0.02). Also disk throughput was at about 1.2 MB/s during the query. I
don't know how much of this is random versus linear.



> From: PFC [mailto:li...@peufeu.com]
>
>   With this quantity of rows, you want to try to make the disk
> accesses as
> linear as possible.
>   This means your table should be organized on disk by date, at
> least
> roughly.
>   If your data comes from an import that was sorted on some other
> column,
> this may not be the case.
> 
>   What kind of bytes/s do you get from the drives ?

The data should be mostly ordered by date. It is all logged in semi-realtime
such that 99% will be logged within an hour of the timestamp. Also, as
stated above, during this query it was about 1.2 MB/s, which I know isn't
great. I admit this isn't the best hardware in the world, but I would expect
better than that for linear queries.

>   Do you UPDATE or DELETE a lot from this table ? Is it vacuum'd
> enough ?

No, this table has no UPDATEs or DELETEs. It is auto vacuum'd, but no manual
vacuuming.

In regards to clustering, I'm hesitant to do that unless I have no other
choice. My understanding is that I would need to do periodic re-clustering
to maintain it, and during that time the table is very busy.


> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Hmm ... it's pretty unusual to see the index fetch portion of a bitmap
> scan take the bulk of the runtime.  Usually that part is fast and where
> the pain comes is in fetching from the heap.   I wonder whether that
> index has become bloated.  How big are the table and the index
> physically?  (Look at pg_class.relpages, or if you want a really
> accurate number try pg_relation_size().)

Can you give me some more info on how to look at these stats? That is,
what/where is pg_class.relpages, etc. I'll also do some searching for this
info.

> What Postgres version is this, exactly?

8.3.3
 
> BTW, I think you've gone way overboard in your indexing of this table;
> those indexes are certainly consuming well more space than the table
> does, and a lot of them are redundant.

Agreed, I need to look carefully at all of the queries we do on this table
and reduce this.



--Rainer


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] difficulties with time based queries

2009-04-05 Thread Rainer Mager
I have a somewhat large table (more than 100 million rows) that contains log
data with start_time and end_time columns. When I try to do queries on this
table I always find them slower than what I need and what I believe should
be possible.

 

For example, I limited the following query to just a single day and it still
is much slower than what I would expect. In reality I need to do queries
that span a few weeks.

 

explain analyze select * from ad_log where date(start_time) <
date('2009-03-31') and date(start_time) >= date('2009-03-30');

 

 

Bitmap Heap Scan on ad_log  (cost=73372.57..3699152.24 rows=2488252
width=32) (actual time=49792.862..64611.255 rows=2268490 loops=1)

   Recheck Cond: ((date(start_time) < '2009-03-31'::date) AND
(date(start_time) >= '2009-03-30'::date))

   ->  Bitmap Index Scan on ad_log_date_all  (cost=0.00..72750.51
rows=2488252 width=0) (actual time=49776.332..49776.332 rows=2268490
loops=1)

 Index Cond: ((date(start_time) < '2009-03-31'::date) AND
(date(start_time) >= '2009-03-30'::date))

 Total runtime: 65279.352 ms

 

 

The definition of the table is:

 

   Column   |Type |
Modifiers

+-+-
---

 ad_log_id  | integer | not null default
nextval('ad_log_ad_log_id_seq'::regclass)

 channel| integer | not null

 player | integer | not null

 ad | integer | not null

 start_time | timestamp without time zone |

 end_time   | timestamp without time zone |

Indexes:

"ad_log_pkey" PRIMARY KEY, btree (ad_log_id)

"ad_log_unique" UNIQUE, btree (channel, player, ad, start_time,
end_time)

"ad_log_ad" btree (ad)

"ad_log_ad_date" btree (ad, date(start_time))

"ad_log_channel" btree (channel)

"ad_log_channel_date" btree (channel, date(start_time))

"ad_log_date_all" btree (date(start_time), channel, player, ad)

"ad_log_player" btree (player)

"ad_log_player_date" btree (player, date(start_time))

Foreign-key constraints:

"ad_log_ad_fkey" FOREIGN KEY (ad) REFERENCES ads(id)

"ad_log_channel_fkey" FOREIGN KEY (channel) REFERENCES channels(id)

"ad_log_player_fkey" FOREIGN KEY (player) REFERENCES players_history(id)

Triggers:

rollup_ad_logs_daily AFTER INSERT ON ad_log FOR EACH ROW EXECUTE
PROCEDURE rollup_ad_logs_daily()

 

 

Any suggestions would be appreciated.

 

--Rainer



[PERFORM] speeding up table creation

2008-10-14 Thread Rainer Mager
I have an interesting performance improvement need. As part of the automatic
test suite we run in our development environment, we re-initialize our test
database a number of times in order to ensure it is clean before running a
test. We currently do this by dropping the public schema and then recreating
our tables (roughly 30 tables total). After that we do normal inserts, etc,
but never with very much data. My question is, what settings can we tweak to
improve performance is this scenario? Specifically, if there was a way to
tell Postgres to keep all operations in memory, that would probably be
ideal.

 

We actually tried running Postgres off of a RAM disk and this did help a
reasonable amount, but we're running under Windows and setting up the RAM
disk is a hassle and all of our developers would need to do it.

 

 

Any tips would be appreciated.

 

 

--Rainer



[PERFORM] why does this use the wrong index?

2008-09-18 Thread Rainer Mager
I have two identical queries except for the date range. In the first case,
with the wider date range, the correct (I believe) index is used. In the
second case where the date range is smaller a different index is used and a
less efficient plan is chosen. In the second query the problem seems to be
CPU resoures; while it is running 1 core of the CPU is 100% busy.

Note, if I drop the ad_log_date index then this query is always fast, but
some other queries I do require that index.

So, What can I do to encourage Postgres to use the first index even when the
date range is smaller.




# explain analyze SELECT name FROM players AS foo WHERE EXISTS (SELECT 1
FROM ad_log WHERE player = foo.id AND  date(start_time) BETWEEN
E'2008-09-14' AND E'2008-09-18' LIMIT 1) ORDER BY name;
   QUERY
PLAN


-
 Sort  (cost=1573.74..1574.31 rows=230 width=13) (actual time=28.421..28.505
rows=306 loops=1)
   Sort Key: foo.name
   Sort Method:  quicksort  Memory: 28kB
   ->  Seq Scan on players foo  (cost=0.00..1564.72 rows=230 width=13)
(actual time=0.104..27.876 rows=306 loops=1)
 Filter: (subplan)
 SubPlan
   ->  Limit  (cost=0.01..3.39 rows=1 width=0) (actual
time=0.058..0.058 rows=1 loops=460)
 ->  Index Scan using ad_log_player_date on ad_log
(cost=0.01..34571.03 rows=10228 width=0) (actual time=0.056..0.056 rows=1
loops=460)
   Index Cond: ((player = $0) AND (date(start_time) >=
'2008-09-14'::date) AND (date(start_time) <= '2008-09-18'::date))
 Total runtime: 28.623 ms
(10 rows)



# explain analyze SELECT name FROM players AS foo WHERE EXISTS (SELECT 1
FROM ad_log WHERE player = foo.id AND  date(start_time) BETWEEN
E'2008-09-18' AND E'2008-09-18' LIMIT 1) ORDER BY name;
 QUERY PLAN

-
 Index Scan using players_name_key on players foo  (cost=0.00..8376.84
rows=230 width=13) (actual time=813.695..143452.810 rows=301 loops=1)
   Filter: (subplan)
   SubPlan
 ->  Limit  (cost=0.01..18.14 rows=1 width=0) (actual
time=311.846..311.846 rows=1 loops=460)
   ->  Index Scan using ad_log_date on ad_log  (cost=0.01..18.14
rows=1 width=0) (actual time=311.844..311.844 rows=1 loops=460)
 Index Cond: ((date(start_time) >= '2008-09-18'::date) AND
(date(start_time) <= '2008-09-18'::date))
 Filter: (player = $0)
 Total runtime: 143453.100 ms
(8 rows)



Thanks,

--Rainer


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] best use of another drive

2008-09-08 Thread Rainer Mager
Thanks. I should have mentioned the existing disk and the new one are
already both mirrored (not together, though :-p). So we had 2 drives that
were mirrored and just added 2 more that are mirrored.

--Rainer

-Original Message-
From: Joshua Drake [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 09, 2008 1:16 PM
To: Scott Marlowe
Cc: Rainer Mager; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] best use of another drive

On Mon, 8 Sep 2008 22:11:25 -0600
"Scott Marlowe" <[EMAIL PROTECTED]> wrote:

> On Mon, Sep 8, 2008 at 8:19 PM, Rainer Mager <[EMAIL PROTECTED]>
> wrote:
> > I've recently installed another drive in my db server and was
> > wondering what the best use of it is. Some thoughts I have are:
> 
> Being a DBA, I'd tend to say make it a mirror of the first drive.
> 

+1

-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] best use of another drive

2008-09-08 Thread Rainer Mager
I've recently installed another drive in my db server and was wondering what
the best use of it is. Some thoughts I have are:


1. Move some of the databases to the new drive. If this is a good idea, is
there a way to do this without a dump/restore? I'd prefer to move the folder
if possible since that would be much faster.


2. Move some logs to the new drive. Again, if this is recommended I'd be
happy to, but some directions on the right procedures would be appreciated.


3. Other...any other ideas?


Thanks,

--Rainer


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] indexing for distinct search in timestamp based table

2008-09-07 Thread Rainer Mager
Thanks for the suggestion. This seems to work pretty well on 8.3, but not so
well on 8.2. We were planning on upgrading to 8.3 soon anyway, we just have
to move up our schedule a bit.

 

I think that this type of algorithm would make sense in core. I suspect that
being in there some further optimizations could be done that pl/pgsql can't
do.

 

 

--Rainer

 

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Vladimir
Sitnikov
Sent: Saturday, September 06, 2008 12:11 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] indexing for distinct search in timestamp based table

 

You might get great improvement for '%' cases using index on
channel_name(, start_time) and a little bit of pl/pgsql

 

Basically, you need to implement the following algorithm:

 1) curr_ = ( select  min() from ad_log )

 2) record_exists = ( select 1 from ad_log where =cur_ and
_all_other_conditions limit 1 )

 3) if record_exists==1 then add curr_ to the results

 3) curr_ = (select min() from ad_log where   >
curr_ ) 

 4) if curr_ is not null then goto 2

 

 

I believe it might make sense implement this approach in the core (I would
call it "index distinct scan")

 

That could dramatically improve "select distinct  from " and
"select  from  group by " kind of queries when there
exists an index on  and a particular column has very small number of
distinct values.

 

For instance:  say a table has 10'000'000 rows, while column of interest has
only 20 distinct values. In that case, the database will be able to get
every of those 20 values in virtually 20 index lookups.

 

What does the community think about that?



Re: [PERFORM] indexing for distinct search in timestamp based table

2008-08-29 Thread Rainer Mager
Thanks for the suggestions.

 

David's assumption is correct in that this is a log table with no updates or
deletes. I tried making it CLUSTERED in our test environment, but it didn't
seem to make any difference for my query. It did take about 6 hours to do
the transformation, so it would be difficult to find the time to do in
production, but I'm sure we could work something out if it really looked
beneficial. Unfortunately, as I said, initial tests don't seem to indicate
any benefit.

 

I believe that my performance difficulty comes from the need for DISTINCT
(or GROUP BY) data. That is, the normal start_time index seems fine for
limiting the date range, but when I need to select DISTINCT data from the
date range it seems that Postgres still needs to scan the entire limited
date range.

 

Unfortunately, we support arbitrary date range queries on this table, so I
don't think the partitioning idea is an option for us.

 

 

What I'm playing with now is creating separate tables to hold the
channel_name, ad_name, and player_name data with PRIMARY KEY ids. Since
there are very few of these compared to the number of rows in the main
table, this will give me a quick way to get the DISTINCT values over the
entire data set. My problem then will be reducing that to the DISTINCT
values for a limited date range.

 

As a side effect bonus of this I expect the database to shrink considerably
as these text fields, although not that long (roughly 20 to 50 characters),
are certainly longer than a simple foreign key reference.

 

 

--Rainer

 

 

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Scott Carey
Sent: Friday, August 29, 2008 8:02 AM
To: David Rowley
Cc: Rainer Mager; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] indexing for distinct search in timestamp based table

 

Another suggestion is to partition the table by date ranges.  If most of the
range queries occur on particular batches of time, this will make all
queries more efficient, and improve locality and efficiency of all indexes
on the table.

This is more work than simply a table CLUSTER, especially in maintenance
overhead, but it will generally help a lot in cases like these.
Additionally, if these don't change much after some period of time the
tables older than the modification window can be vacuumed, clustered, and
reindexed if needed to make them as efficient as possible and maintenance
free after that point (other than backups and archives).

Another benefit of clustering is in backup / restore.  You can incrementally
back up only the index partitions that have changed -- for large databases
this reduces pg_dump and pg_restore times substantially.  To do this you
combine regular expressions with the pg_dump "exclude tables" or "include
tables" flags.



On Thu, Aug 28, 2008 at 3:48 PM, David Rowley <[EMAIL PROTECTED]> wrote:

I once also had a similar performance problem when looking for all matching
rows between two timestamps. In fact that's why I'm here today. The problem
was with MySQL. I had some tables of around 10 million rows and all my
searching was timestamp based. MySQL didn't do what I wanted. I found that
using a CLUSTERED index with postgresql to be lightning quick. Yet mostly
the matching rows I was working with was not much over the 100k mark. I'm
wondering if clustering the table on ad_log_start_time will help cut down on
random reads.

That's if you can afford to block the users while postgresql clusters the
table.

If you're inserting in order of the start_time column (which I was) then the
cluster should almost maintain itself (I think), providing you're not
updating or deleting anyway, I'd assume that since it looks like a log
table.

David.



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Rainer Mager
Sent: 28 August 2008 09:06
To: pgsql-performance@postgresql.org
Subject: [PERFORM] indexing for distinct search in timestamp based table

I'm looking for some help in speeding up searches. My table is pretty simple
(see below), but somewhat large, and continuously growing. Currently it has
about 50 million rows.

The table is (I know I have excessive indexes, I'm trying to get the
appropriate ones and drop the extras):
 Table "public.ad_log"
   Column|Type |
Modifiers
--+-+---
-
 ad_log_id| integer | not null default
nextval('ad_log_ad_log_id_seq'::regclass)
 channel_name | text| not null
 player_name  | text| not null
 ad_name  | text| not null
 start_time   | timestamp without time zone | not null
 end_time | timestamp without time zone | not null
Inde

[PERFORM] indexing for distinct search in timestamp based table

2008-08-28 Thread Rainer Mager
I'm looking for some help in speeding up searches. My table is pretty simple
(see below), but somewhat large, and continuously growing. Currently it has
about 50 million rows.

The table is (I know I have excessive indexes, I'm trying to get the
appropriate ones and drop the extras):
  Table "public.ad_log"
Column|Type |
Modifiers
--+-+---
-
 ad_log_id| integer | not null default
nextval('ad_log_ad_log_id_seq'::regclass)
 channel_name | text| not null
 player_name  | text| not null
 ad_name  | text| not null
 start_time   | timestamp without time zone | not null
 end_time | timestamp without time zone | not null
Indexes:
"ad_log_pkey" PRIMARY KEY, btree (ad_log_id)
"ad_log_channel_name_key" UNIQUE, btree (channel_name, player_name,
ad_name, start_time, end_time)
"ad_log_ad_and_start" btree (ad_name, start_time)
"ad_log_ad_name" btree (ad_name)
"ad_log_all" btree (channel_name, player_name, start_time, ad_name)
"ad_log_channel_name" btree (channel_name)
"ad_log_end_time" btree (end_time)
"ad_log_player_and_start" btree (player_name, start_time)
"ad_log_player_name" btree (player_name)
"ad_log_start_time" btree (start_time)



The query I'm trying to speed up is below. In it the  tag can be one
of channel_name, player_name, or ad_name. I'm actually trying to return the
distinct values and I found GROUP BY to be slightly faster than using
DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses
in which case we use '%', but it seems Postgres optimizes that pretty well.

SELECT  FROM ad_log 
WHERE channel_name LIKE :channel_name
AND player_name LIKE :player_name 
AND ad_name LIKE :ad_name 
AND start_time BETWEEN :start_date AND (date(:end_date) + 1)
GROUP BY  ORDER BY 


A typical query is:

explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%'
AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND
(date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name;

with the result being:
 
QUERY PLAN


---
 Sort  (cost=1163169.02..1163169.03 rows=5 width=10) (actual
time=75460.187..75460.192 rows=15 loops=1)
   Sort Key: channel_name
   Sort Method:  quicksort  Memory: 17kB
   ->  HashAggregate  (cost=1163168.91..1163168.96 rows=5 width=10) (actual
time=75460.107..75460.114 rows=15 loops=1)
 ->  Bitmap Heap Scan on ad_log  (cost=285064.30..1129582.84
rows=13434427 width=10) (actual time=8506.250..65771.597 rows=13701296
loops=1)
   Recheck Cond: ((start_time >= '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time <=
'2008-07-29'::date))
   Filter: ((channel_name ~~ '%'::text) AND (ad_name ~~
'%'::text))
   ->  Bitmap Index Scan on ad_log_start_time
(cost=0.00..281705.70 rows=13434427 width=0) (actual time=8488.443..8488.443
rows=13701296 loops=1)
 Index Cond: ((start_time >= '2008-07-01
00:00:00'::timestamp without time zone) AND (start_time <=
'2008-07-29'::date))
 Total runtime: 75460.361 ms


It seems to me there should be some way to create an index to speed this up,
but the various ones I've tried so far haven't helped. Any suggestions would
be greatly appreciated.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance