Hi Hans,

> I have database with ~560.000 rows. One of the columns is an
> IP address (type:inet). Now, I want to only get the unique ip
> addresses out. That takes about 3minutes on a P4 2.66Ghz!

I don't know the PostgreSQL Code, I can only tell you from database
theory.

- a database uses an index if 'selectivness' is below 1-3% - above that
  it's cheaper to do a seqscan (linear reading vs random) - you have
  selectiveness of 100% that's why an index is ignored
  (selectivness = how many tupel of the table will probably be returned)
- UNIQUE requires sorting: you sort your table, then walk down the
  sorted table to get all unique keys (if previous = current ignore
  current)
- if sorting can't be done in memory a sort-merge approach is used: if
  you have 1 MB sort memory 1 MB is sorted and written to temp space;
  then the next 1 MB is sorted and merged with the previous 1 MB (using
  a pointer that walks down the sorted temp space - doesn't need memory);
  sort another 1 MB and merge with the 2 MB ... this is repeated until
  everything is sorted in temp space
  this leads to massive IO and that's very, very, very expensive

So you only have the chance to increase your sort memory (Pg config
switch).
If you need that more often you should change your database design.

> Here is the analysis of the query:
> ---------
> explain analyze SELECT distinct ip FROM rawdata WHERE sid='2';
> 
>  Unique  (cost=66182.54..69013.19 rows=125 width=11) (actual
> time=140856.389..183592.925 rows=125 loops=1)
>    ->  Sort  (cost=66182.54..67597.86 rows=566130 width=11) (actual
> time=140856.375..162975.490 rows=2232934 loops=1)
>          Sort Key: ip
>          ->  Seq Scan on rawdata  (cost=0.00..12086.62 rows=566130
> width=11) (actual time=0.041..51072.133 rows=2232934 loops=1)
>                Filter: (sid = 2::smallint)
>  Total runtime: 183836.671 ms
> (6 rows)
> 
> 
> And the description of the table:
> --------
>  \d rawdata
>                                        Table "public.rawdata"
>   Column   |            Type             |                       
> Modifiers
> -----------+-----------------------------+----------------------------------------------------------
>  rid       | bigint                      | not null default
> nextval('public.rawdata_rid_seq'::text)
>  sid       | smallint                    | not null
>  direction | smallint                    | not null
>  ip        | inet                        | not null
>  bytes     | integer                     | not null
>  timestamp | timestamp without time zone | not null
>  month     | smallint                    | default
> date_part('month'::text, now())
> Indexes:
>     "rawdata_sid_dir" btree (sid, direction)
>     "rawdata_sid_ip_dir" btree (sid, ip, direction)
> 
> 
> Now, as you can see the problem here is that pgsql first sorts all the
> 560k rows before applying the index to them in order to find out what
> is really relevant here.
> 
> Sorting takes up 2/3 of the time spent on this query. Now, any ideas
> to speed this up except having a separate table for the unique ip's?
> 
> Adding a separate index for only the ip field does not help, it does
> not get used.
> 
> -HK
> 
> _______________________________________________
> Dbmail-dev mailing list
> Dbmail-dev@dbmail.org
> http://twister.fastxs.net/mailman/listinfo/dbmail-dev

-- 
Thomas
-- 
http://www.tmueller.com for pgp key (95702B3B)

Reply via email to