Re: [PERFORM] merge join killing performance

2010-05-19 Thread Tom Lane
Matthew Wakeling matt...@flymine.org writes:
 On Tue, 18 May 2010, Scott Marlowe wrote:
 Aggregate  (cost=902.41..902.42 rows=1 width=4)
 -  Merge Join  (cost=869.97..902.40 rows=1 width=4)
 Merge Cond: (f.eid = ev.eid)
 -  Index Scan using files_eid_idx on files f
 (cost=0.00..157830.39 rows=3769434 width=8)

 Okay, that's weird. How is the cost of the merge join only 902, when the 
 cost of one of the branches 157830, when there is no LIMIT?

It's apparently estimating (wrongly) that the merge join won't have to
scan very much of files before it can stop because it finds an eid
value larger than any eid in the other table.  So the issue here is an
inexact stats value for the max eid.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] merge join killing performance

2010-05-19 Thread Scott Marlowe
On Wed, May 19, 2010 at 10:53 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Matthew Wakeling matt...@flymine.org writes:
 On Tue, 18 May 2010, Scott Marlowe wrote:
 Aggregate  (cost=902.41..902.42 rows=1 width=4)
     -  Merge Join  (cost=869.97..902.40 rows=1 width=4)
         Merge Cond: (f.eid = ev.eid)
         -  Index Scan using files_eid_idx on files f
         (cost=0.00..157830.39 rows=3769434 width=8)

 Okay, that's weird. How is the cost of the merge join only 902, when the
 cost of one of the branches 157830, when there is no LIMIT?

 It's apparently estimating (wrongly) that the merge join won't have to
 scan very much of files before it can stop because it finds an eid
 value larger than any eid in the other table.  So the issue here is an
 inexact stats value for the max eid.

That's a big table.  I'll try cranking up the stats target for that
column and see what happens.  Thanks!

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] merge join killing performance

2010-05-19 Thread Scott Marlowe
On Wed, May 19, 2010 at 10:53 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Matthew Wakeling matt...@flymine.org writes:
 On Tue, 18 May 2010, Scott Marlowe wrote:
 Aggregate  (cost=902.41..902.42 rows=1 width=4)
     -  Merge Join  (cost=869.97..902.40 rows=1 width=4)
         Merge Cond: (f.eid = ev.eid)
         -  Index Scan using files_eid_idx on files f
         (cost=0.00..157830.39 rows=3769434 width=8)

 Okay, that's weird. How is the cost of the merge join only 902, when the
 cost of one of the branches 157830, when there is no LIMIT?

 It's apparently estimating (wrongly) that the merge join won't have to
 scan very much of files before it can stop because it finds an eid
 value larger than any eid in the other table.  So the issue here is an
 inexact stats value for the max eid.

I changed stats target to 1000 for that field and still get the bad plan.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] merge join killing performance

2010-05-19 Thread Scott Marlowe
On Wed, May 19, 2010 at 2:27 PM, Scott Marlowe scott.marl...@gmail.com wrote:
 On Wed, May 19, 2010 at 10:53 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Matthew Wakeling matt...@flymine.org writes:
 On Tue, 18 May 2010, Scott Marlowe wrote:
 Aggregate  (cost=902.41..902.42 rows=1 width=4)
     -  Merge Join  (cost=869.97..902.40 rows=1 width=4)
         Merge Cond: (f.eid = ev.eid)
         -  Index Scan using files_eid_idx on files f
         (cost=0.00..157830.39 rows=3769434 width=8)

 Okay, that's weird. How is the cost of the merge join only 902, when the
 cost of one of the branches 157830, when there is no LIMIT?

 It's apparently estimating (wrongly) that the merge join won't have to
 scan very much of files before it can stop because it finds an eid
 value larger than any eid in the other table.  So the issue here is an
 inexact stats value for the max eid.

 I changed stats target to 1000 for that field and still get the bad plan.

And of course ran analyze across the table...

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] merge join killing performance

2010-05-19 Thread Matthew Wakeling

On Wed, 19 May 2010, Scott Marlowe wrote:

It's apparently estimating (wrongly) that the merge join won't have to
scan very much of files before it can stop because it finds an eid
value larger than any eid in the other table.  So the issue here is an
inexact stats value for the max eid.


I wandered if it could be something like that, but I rejected that idea, 
as it obviously wasn't the real world case, and statistics should at least 
get that right, if they are up to date.



I changed stats target to 1000 for that field and still get the bad plan.


What do the stats say the max values are?

Matthew

--
Nog: Look! They've made me into an ensign!
O'Brien: I didn't know things were going so badly.
Nog: Frightening, isn't it?
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] merge join killing performance

2010-05-19 Thread Scott Marlowe
On Wed, May 19, 2010 at 7:46 PM, Matthew Wakeling matt...@flymine.org wrote:
 On Wed, 19 May 2010, Scott Marlowe wrote:

 It's apparently estimating (wrongly) that the merge join won't have to
 scan very much of files before it can stop because it finds an eid
 value larger than any eid in the other table.  So the issue here is an
 inexact stats value for the max eid.

 I wandered if it could be something like that, but I rejected that idea, as
 it obviously wasn't the real world case, and statistics should at least get
 that right, if they are up to date.

 I changed stats target to 1000 for that field and still get the bad plan.

 What do the stats say the max values are?

5277063,5423043,13843899 (I think).

# select count(distinct eid) from files;
 count
---
   365
(1 row)

# select count(*) from files;
  count
-
  3793748

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] merge join killing performance

2010-05-19 Thread Scott Marlowe
On Wed, May 19, 2010 at 8:04 PM, Scott Marlowe scott.marl...@gmail.com wrote:
 On Wed, May 19, 2010 at 7:46 PM, Matthew Wakeling matt...@flymine.org wrote:
 On Wed, 19 May 2010, Scott Marlowe wrote:

 It's apparently estimating (wrongly) that the merge join won't have to
 scan very much of files before it can stop because it finds an eid
 value larger than any eid in the other table.  So the issue here is an
 inexact stats value for the max eid.

 I wandered if it could be something like that, but I rejected that idea, as
 it obviously wasn't the real world case, and statistics should at least get
 that right, if they are up to date.

 I changed stats target to 1000 for that field and still get the bad plan.

 What do the stats say the max values are?

 5277063,5423043,13843899 (I think).

 # select count(distinct eid) from files;
  count
 ---
   365
 (1 row)

 # select count(*) from files;
  count
 -
  3793748

A followup.  of those rows,

select count(*) from files where eid is null;
  count
-
 3793215

are null.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] merge join killing performance

2010-05-19 Thread Scott Marlowe
On Wed, May 19, 2010 at 8:06 PM, Scott Marlowe scott.marl...@gmail.com wrote:
 On Wed, May 19, 2010 at 8:04 PM, Scott Marlowe scott.marl...@gmail.com 
 wrote:
 On Wed, May 19, 2010 at 7:46 PM, Matthew Wakeling matt...@flymine.org 
 wrote:
 On Wed, 19 May 2010, Scott Marlowe wrote:

 It's apparently estimating (wrongly) that the merge join won't have to
 scan very much of files before it can stop because it finds an eid
 value larger than any eid in the other table.  So the issue here is an
 inexact stats value for the max eid.

 I wandered if it could be something like that, but I rejected that idea, as
 it obviously wasn't the real world case, and statistics should at least get
 that right, if they are up to date.

 I changed stats target to 1000 for that field and still get the bad plan.

 What do the stats say the max values are?

 5277063,5423043,13843899 (I think).

 # select count(distinct eid) from files;
  count
 ---
   365
 (1 row)

 # select count(*) from files;
  count
 -
  3793748

 A followup.  of those rows,

 select count(*) from files where eid is null;
  count
 -
  3793215

 are null.

So, Tom, so you think it's possible that the planner isn't noticing
all those nulls and thinks it'll just take a row or two to get to the
value it needs to join on?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-19 Thread David Jarvis
Hi,

I recently switched to PostgreSQL from MySQL so that I can use PL/R for data
analysis. The query in MySQL form (against a more complex table structure)
takes ~5 seconds to run. The query in PostgreSQL I have yet to let finish,
as it takes over a minute. I think I have the correct table structure in
place (it is much simpler than the former structure in MySQL), however the
query executes a full table scan against the parent table's 273 million
rows.

*Questions*

What is the proper way to index the dates to avoid full table scans?

Options I have considered:

   - GIN
   - GiST
   - Rewrite the WHERE clause
   - Separate year_taken, month_taken, and day_taken columns to the tables

*Details
*
The HashAggregate from the plan shows a cost of 10006220141.11, which is, I
suspect, on the astronomically huge side. There is a full table scan on the
measurement table (itself having neither data nor indexes) being performed.
The table aggregates 237 million rows from its child tables. The
sluggishness comes from this part of the query:

  m.taken BETWEEN
/* Start date. */
  (extract( YEAR FROM m.taken )||'-01-01')::date AND
/* End date. Calculated by checking to see if the end date wraps
  into the next year. If it does, then add 1 to the current year.
*/
(cast(extract( YEAR FROM m.taken ) + greatest( -1 *
  sign(
(extract( YEAR FROM m.taken )||'-12-31')::date -
(extract( YEAR FROM m.taken )||'-01-01')::date ), 0
) AS text)||'-12-31')::date

There are 72 child tables, each having a year index and a station index,
which are defined as follows:

CREATE TABLE climate.measurement_12_013 (
-- Inherited from table climate.measurement_12_013:  id bigint NOT NULL
DEFAULT nextval('climate.measurement_id_seq'::regclass),
-- Inherited from table climate.measurement_12_013:  station_id integer
NOT NULL,
-- Inherited from table climate.measurement_12_013:  taken date NOT
NULL,
-- Inherited from table climate.measurement_12_013:  amount numeric(8,2)
NOT NULL,
-- Inherited from table climate.measurement_12_013:  category_id
smallint NOT NULL,
-- Inherited from table climate.measurement_12_013:  flag character
varying(1) NOT NULL DEFAULT ' '::character varying,
  CONSTRAINT measurement_12_013_category_id_check CHECK (category_id =
7),
  CONSTRAINT measurement_12_013_taken_check CHECK
(date_part('month'::text, taken)::integer = 12)
)
INHERITS (climate.measurement)

CREATE INDEX measurement_12_013_s_idx
  ON climate.measurement_12_013
  USING btree
  (station_id);
CREATE INDEX measurement_12_013_y_idx
  ON climate.measurement_12_013
  USING btree
  (date_part('year'::text, taken));

(Foreign key constraints to be added later.)

The following query runs abysmally slow due to a full table scan:

SELECT
  count(1) AS measurements,
  avg(m.amount) AS amount
FROM
  climate.measurement m
WHERE
  m.station_id IN (
SELECT
  s.id
FROM
  climate.station s,
  climate.city c
WHERE
/* For one city... */
c.id = 5182 AND

/* Where stations are within an elevation range... */
s.elevation BETWEEN 0 AND 3000 AND

/* and within a specific radius... */
6371.009 * SQRT(
  POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
(COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
  POW(RADIANS(c.longitude_decimal - s.longitude_decimal),
2))
) = 50
) AND

  /* Data before 1900 is shaky; insufficient after 2009. */
  extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

  /* Whittled down by category... */
  m.category_id = 1 AND

  /* Between the selected days and years... */
  m.taken BETWEEN
   /* Start date. */
   (extract( YEAR FROM m.taken )||'-01-01')::date AND
/* End date. Calculated by checking to see if the end date wraps
   into the next year. If it does, then add 1 to the current year.
*/
(cast(extract( YEAR FROM m.taken ) + greatest( -1 *
  sign(
(extract( YEAR FROM m.taken )||'-12-31')::date -
(extract( YEAR FROM m.taken )||'-01-01')::date ), 0
) AS text)||'-12-31')::date
GROUP BY
  extract( YEAR FROM m.taken )

What are your thoughts?

Thank you!