Re: [GENERAL] efficient storing of urls

2004-03-01 Thread scott.marlowe
On Fri, 27 Feb 2004, Shane Wegner wrote:

> On Fri, Feb 27, 2004 at 06:00:36PM -0500, Sean Shanny wrote:
> > Shane,
> > 
> > Can you give an example of a query that has gotten slower due to the 
> > increasing size of the urls table with an explain analyze?
> 
> The database is a simple traffic monitoring tool so we have
> a hits table which gets a new row for every url accessed. 
> Very simple table
> 
> create table hits(
> hit_date date not null,
> hit_time time(0) without time zone not null,
> url_id int references urls(id)
> );
> 
> A select to display the 100 most popular pages:
> explain analyze select count(*) as c,url from hits,urls where hit_date between 
> '2004-01-01' and '2004-01-31' and url_id=urls.id group by url order by c desc limit 
> 100;
>   QUERY PLAN
> --
>  Limit  (cost=320189.71..320189.96 rows=100 width=68) (actual 
> time=34156.080..34156.324 rows=100 loops=1)
>->  Sort  (cost=320189.71..320700.06 rows=204138 width=68) (actual 
> time=34156.068..34156.208 rows=100 loops=1)
>  Sort Key: count(*)
>  ->  GroupAggregate  (cost=281214.19..283255.57 rows=204138 width=68) 
> (actual time=32457.857..33584.861 rows=53460 loops=1)
>->  Sort  (cost=281214.19..281724.54 rows=204138 width=68) (actual 
> time=32457.690..32873.446 rows=24 loops=1)
>  Sort Key: urls.url
>  ->  Merge Join  (cost=239594.05..244280.05 rows=204138 
> width=68) (actual time=21363.547..24385.213 rows=24 loops=1)
>Merge Cond: ("outer".url_id = "inner".id)
>->  Sort  (cost=168400.38..168914.15 rows=205508 width=4) 
> (actual time=14785.934..15156.772 rows=249350 loops=1)
>  Sort Key: hits.url_id
>  ->  Seq Scan on hits  (cost=0.00..148512.07 
> rows=205508 width=4) (actual time=40.265..12081.506 rows=249350 loops=1)
>Filter: ((hit_date >= '2004-01-01'::date) AND 
> (hit_date <= '2004-01-31'::date))
>->  Sort  (cost=71193.67..72005.68 rows=324805 width=72) 
> (actual time=6577.430..7422.945 rows=519307 loops=1)
>  Sort Key: urls.id
>  ->  Seq Scan on urls  (cost=0.00..7389.05 
> rows=324805 width=72) (actual time=0.110..1187.617 rows=324805 loops=1)
>  Total runtime: 34221.250 ms
> (16 rows)

Your single biggest jump in actual time seems to be coming from the seq 
scan on hits, which takes you from 40 ms to 12,000 ms.

Can you index the hit_date and see if that helps?


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [GENERAL] efficient storing of urls

2004-02-28 Thread Chris Browne
[EMAIL PROTECTED] (Sean Shanny) writes:
> Can you give an example of a query that has gotten slower due to the
> increasing size of the urls table with an explain analyze?

There's a "known issue" in that URL strings commonly contain the prefix:

   http://www.

What you get, as a result, is that there's very little uniqueness
there, and indices are known to suffer.

There was a report last week that essentially putting the URLs in
backwards, and having a functional index on the backwards form, led to
greatly improved selectivity of the index.

The approach being suggested here looks more like that of the "prefix
splitting" typical to Patricia Tries; that's what the New Oxford
English Dictionary project used for building efficient text search
indices.  It ought to be pretty quick, but pretty expensive in terms
of the complexity that gets added in.

I suspect that doing the "reverse the URL" trick would be a cheaper
"fix."
-- 
"cbbrowne","@","ntlug.org"
http://www.ntlug.org/~cbbrowne/linuxxian.html
"This .signature is  shareware.  Send in $20 for  the fully registered
version..."

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings