Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Yeb Havinga

Hello David,
The table aggregates 237 million rows from its child tables. The 
sluggishness comes from this part of the query:


  m.taken BETWEEN
/* Start date. */
  (extract( YEAR FROM m.taken )||'-01-01')::date AND
/* End date. Calculated by checking to see if the end date wraps
  into the next year. If it does, then add 1 to the current year.
*/
(cast(extract( YEAR FROM m.taken ) + greatest( -1 *
  sign(
(extract( YEAR FROM m.taken )||'-12-31')::date -
(extract( YEAR FROM m.taken )||'-01-01')::date ), 0
) AS text)||'-12-31')::date
Either I had too less coffee and completely misunderstand this 
expression, or it is always true and can be omitted. Could you explain a 
bit what this part tries to do and maybe also show it's original 
counterpart in the source database?


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Thom Brown
On 20 May 2010 06:06, David Jarvis thanga...@gmail.com wrote:
 Hi,

 I recently switched to PostgreSQL from MySQL so that I can use PL/R for data
 analysis. The query in MySQL form (against a more complex table structure)
 takes ~5 seconds to run. The query in PostgreSQL I have yet to let finish,
 as it takes over a minute. I think I have the correct table structure in
 place (it is much simpler than the former structure in MySQL), however the
 query executes a full table scan against the parent table's 273 million
 rows.

 Questions

 What is the proper way to index the dates to avoid full table scans?

 Options I have considered:

 GIN
 GiST
 Rewrite the WHERE clause
 Separate year_taken, month_taken, and day_taken columns to the tables

 Details

 The HashAggregate from the plan shows a cost of 10006220141.11, which is, I
 suspect, on the astronomically huge side. There is a full table scan on the
 measurement table (itself having neither data nor indexes) being performed.
 The table aggregates 237 million rows from its child tables. The
 sluggishness comes from this part of the query:

   m.taken BETWEEN
     /* Start date. */
   (extract( YEAR FROM m.taken )||'-01-01')::date AND
     /* End date. Calculated by checking to see if the end date wraps
   into the next year. If it does, then add 1 to the current year.
     */
     (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
   sign(
     (extract( YEAR FROM m.taken )||'-12-31')::date -
     (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
     ) AS text)||'-12-31')::date

 There are 72 child tables, each having a year index and a station index,
 which are defined as follows:

     CREATE TABLE climate.measurement_12_013 (
     -- Inherited from table climate.measurement_12_013:  id bigint NOT NULL
 DEFAULT nextval('climate.measurement_id_seq'::regclass),
     -- Inherited from table climate.measurement_12_013:  station_id integer
 NOT NULL,
     -- Inherited from table climate.measurement_12_013:  taken date NOT
 NULL,
     -- Inherited from table climate.measurement_12_013:  amount numeric(8,2)
 NOT NULL,
     -- Inherited from table climate.measurement_12_013:  category_id
 smallint NOT NULL,
     -- Inherited from table climate.measurement_12_013:  flag character
 varying(1) NOT NULL DEFAULT ' '::character varying,
   CONSTRAINT measurement_12_013_category_id_check CHECK (category_id =
 7),
   CONSTRAINT measurement_12_013_taken_check CHECK
 (date_part('month'::text, taken)::integer = 12)
     )
     INHERITS (climate.measurement)

     CREATE INDEX measurement_12_013_s_idx
   ON climate.measurement_12_013
   USING btree
   (station_id);
     CREATE INDEX measurement_12_013_y_idx
   ON climate.measurement_12_013
   USING btree
   (date_part('year'::text, taken));

 (Foreign key constraints to be added later.)

 The following query runs abysmally slow due to a full table scan:

     SELECT
   count(1) AS measurements,
   avg(m.amount) AS amount
     FROM
   climate.measurement m
     WHERE
   m.station_id IN (
     SELECT
   s.id
     FROM
   climate.station s,
   climate.city c
     WHERE
     /* For one city... */
     c.id = 5182 AND

     /* Where stations are within an elevation range... */
     s.elevation BETWEEN 0 AND 3000 AND

     /* and within a specific radius... */
     6371.009 * SQRT(
   POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
     (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
   POW(RADIANS(c.longitude_decimal - s.longitude_decimal),
 2))
     ) = 50
     ) AND

   /* Data before 1900 is shaky; insufficient after 2009. */
   extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

   /* Whittled down by category... */
   m.category_id = 1 AND

   /* Between the selected days and years... */
   m.taken BETWEEN
    /* Start date. */
    (extract( YEAR FROM m.taken )||'-01-01')::date AND
     /* End date. Calculated by checking to see if the end date wraps
    into the next year. If it does, then add 1 to the current year.
     */
     (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
   sign(
     (extract( YEAR FROM m.taken )||'-12-31')::date -
     (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
     ) AS text)||'-12-31')::date
     GROUP BY
   extract( YEAR FROM m.taken )

 What are your thoughts?

 Thank you!



Could you provide the EXPLAIN output for that slow query?

Thom

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Matthew Wakeling

On Wed, 19 May 2010, David Jarvis wrote:

 extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND


That portion of the WHERE clause cannot use an index on m.taken. Postgres 
does not look inside functions (like extract) to see if something 
indexable is present. To get an index to work, you could create an index 
on (extract(YEAR FROM m.taken)).


Matthew

--
Here we go - the Fairy Godmother redundancy proof.
   -- Computer Science Lecturer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Tom Lane
Matthew Wakeling matt...@flymine.org writes:
 On Wed, 19 May 2010, David Jarvis wrote:
 extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

 That portion of the WHERE clause cannot use an index on m.taken. Postgres 
 does not look inside functions (like extract) to see if something 
 indexable is present. To get an index to work, you could create an index 
 on (extract(YEAR FROM m.taken)).

What you really need to do is not do date arithmetic using text-string
operations.  The planner has no intelligence about that whatsoever.
Convert the operations to something natural using real date or timestamp
types, and then look at what indexes you need.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] merge join killing performance

2010-05-20 Thread Tom Lane
Scott Marlowe scott.marl...@gmail.com writes:
 So, Tom, so you think it's possible that the planner isn't noticing
 all those nulls and thinks it'll just take a row or two to get to the
 value it needs to join on?

Could be.  I don't have time right now to chase through the code, but
that sounds like a plausible theory.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] merge join killing performance

2010-05-20 Thread Scott Marlowe
On Thu, May 20, 2010 at 8:28 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Scott Marlowe scott.marl...@gmail.com writes:
 So, Tom, so you think it's possible that the planner isn't noticing
 all those nulls and thinks it'll just take a row or two to get to the
 value it needs to join on?

 Could be.  I don't have time right now to chase through the code, but
 that sounds like a plausible theory.

K.  I think I'll try an index on that field where not null and see
if that helps.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi,

I have posted an image of the user inputs here:

http://i.imgur.com/MUkuZ.png

The problem is that I am given a range of days (Dec 22 - Mar 22) over a
range of years (1900 - 2009) and the range of days can span from one year to
the next. This is not the same as saying Dec 22, 1900 to Mar 22, 2009, for
which I do not need date math.

What you really need to do is not do date arithmetic using text-string
 operations.  The planner has no intelligence about that whatsoever.
 Convert the operations to something natural using real date or timestamp
 types, and then look at what indexes you need.


Any suggestions on how to go about this?

Thanks again!

Dave


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Thom Brown
On 20 May 2010 17:36, David Jarvis thanga...@gmail.com wrote:
 Hi, Thom.

 The query is given two items:

 Range of years
 Range of days

 I need to select all data between the range of days (e.g., Dec 22 - Mar 22)
 over the range of years (e.g., 1950 - 1970), such as shown here:

 http://i.imgur.com/MUkuZ.png

 For Jun 1 to Jul 1 it would be no problem because they the same year. But
 for Dec 22 to Mar 22, it is difficult because Mar 22 is in the next year
 (relative to Dec 22).

 How do I do that without strings?

 Dave



Okay, get your app to convert the month-date to a day of year, so we
have year_start, year_end, day_of_year_start, day_of_year_end

and your where clause would something like this:

WHERE extract(YEAR from m.taken) BETWEEN year1 and year2
AND (
extract(DOY from m.taken) BETWEEN day_of_year_start AND day_of_year_end
OR (
extract(DOY from m.taken) = day_of_year_start OR extract(DOY 
from
m.taken) = day_of_year_end
)
)

... substituting the placeholders where they appear.

So if we had:

year1=1941
year2=1952
day_of_year_start=244 (based on input date of 1st September)
day_of_year_end=94 (based on 4th April)

We'd have:

WHERE extract(YEAR from m.taken) BETWEEN 1941 and 1952
AND (
extract(DOY from m.taken) BETWEEN 244 AND 94
OR (
extract(DOY from m.taken) = 244 OR extract(DOY from m.taken) 
= 94
)
)

Then you could add expression indexes for the YEAR and DOY extract parts, like:

CREATE INDEX idx_taken_doy ON climate.measurement (EXTRACT(DOY from taken));
CREATE INDEX idx_taken_year ON climate.measurement (EXTRACT(YEAR from taken));

Although maybe you don't need those, depending on how the date
datatype matching works in the planner with the EXTRACT function.

Regards

Thom

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Thom Brown
On 20 May 2010 19:36, Thom Brown thombr...@gmail.com wrote:
 On 20 May 2010 17:36, David Jarvis thanga...@gmail.com wrote:
 Hi, Thom.

 The query is given two items:

 Range of years
 Range of days

 I need to select all data between the range of days (e.g., Dec 22 - Mar 22)
 over the range of years (e.g., 1950 - 1970), such as shown here:

 http://i.imgur.com/MUkuZ.png

 For Jun 1 to Jul 1 it would be no problem because they the same year. But
 for Dec 22 to Mar 22, it is difficult because Mar 22 is in the next year
 (relative to Dec 22).

 How do I do that without strings?

 Dave



 Okay, get your app to convert the month-date to a day of year, so we
 have year_start, year_end, day_of_year_start, day_of_year_end

 and your where clause would something like this:

 WHERE extract(YEAR from m.taken) BETWEEN year1 and year2
 AND (
        extract(DOY from m.taken) BETWEEN day_of_year_start AND day_of_year_end
        OR (
                extract(DOY from m.taken) = day_of_year_start OR extract(DOY 
 from
 m.taken) = day_of_year_end
        )
 )

 ... substituting the placeholders where they appear.

 So if we had:

 year1=1941
 year2=1952
 day_of_year_start=244 (based on input date of 1st September)
 day_of_year_end=94 (based on 4th April)

 We'd have:

 WHERE extract(YEAR from m.taken) BETWEEN 1941 and 1952
 AND (
        extract(DOY from m.taken) BETWEEN 244 AND 94
        OR (
                extract(DOY from m.taken) = 244 OR extract(DOY from m.taken) 
 = 94
        )
 )

 Then you could add expression indexes for the YEAR and DOY extract parts, 
 like:

 CREATE INDEX idx_taken_doy ON climate.measurement (EXTRACT(DOY from taken));
 CREATE INDEX idx_taken_year ON climate.measurement (EXTRACT(YEAR from taken));

 Although maybe you don't need those, depending on how the date
 datatype matching works in the planner with the EXTRACT function.

 Regards

 Thom


Actually, you could change that last bit from:

 OR (
                extract(DOY from m.taken) = day_of_year_start OR
extract(DOY from m.taken) = day_of_year_end
       )

to

OR extract(DOY from m.taken) NOT BETWEEN day_of_year_end AND day_of_year_start

That would be tidier and simpler :)

Thom

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Tom Lane
Thom Brown thombr...@gmail.com writes:
 On 20 May 2010 17:36, David Jarvis thanga...@gmail.com wrote:
 Okay, get your app to convert the month-date to a day of year, so we
 have year_start, year_end, day_of_year_start, day_of_year_end

 and your where clause would something like this:

 WHERE extract(YEAR from m.taken) BETWEEN year1 and year2
 AND (
   extract(DOY from m.taken) BETWEEN day_of_year_start AND day_of_year_end
   OR (
   extract(DOY from m.taken) = day_of_year_start OR extract(DOY 
 from
 m.taken) = day_of_year_end
   )
 )

extract(DOY) seems a bit problematic here, because its day numbering is
going to be different between leap years and non-leap years, and David's
problem statement doesn't allow for off-by-one errors.  You could
certainly invent your own function that worked similarly but always
translated a given month/day to the same number.

The other thing that's messy here is the wraparound requirement.
Rather than trying an OR like the above (which I think doesn't quite
work anyway --- won't it select everything?), it would be better if
you can have the app distinguish wraparound from non-wraparound cases
and issue different queries in the two cases.  In the non-wrap case
(start_day  end_day) it's pretty easy, just
my_doy(m.taken) BETWEEN start_val AND end_val
The easy way to handle the wrap case is
my_doy(m.taken) = start_val OR my_doy(m.taken) = end_val
although I can't help feeling there should be a smarter way to do
this where you can use an AND range check on some modified expression
derived from the date.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Thom Brown
On 20 May 2010 20:02, Tom Lane t...@sss.pgh.pa.us wrote:
 Thom Brown thombr...@gmail.com writes:
 On 20 May 2010 17:36, David Jarvis thanga...@gmail.com wrote:
 Okay, get your app to convert the month-date to a day of year, so we
 have year_start, year_end, day_of_year_start, day_of_year_end

 and your where clause would something like this:

 WHERE extract(YEAR from m.taken) BETWEEN year1 and year2
 AND (
       extract(DOY from m.taken) BETWEEN day_of_year_start AND day_of_year_end
       OR (
               extract(DOY from m.taken) = day_of_year_start OR extract(DOY 
 from
 m.taken) = day_of_year_end
       )
 )

 extract(DOY) seems a bit problematic here, because its day numbering is
 going to be different between leap years and non-leap years, and David's
 problem statement doesn't allow for off-by-one errors.  You could
 certainly invent your own function that worked similarly but always
 translated a given month/day to the same number.

 The other thing that's messy here is the wraparound requirement.
 Rather than trying an OR like the above (which I think doesn't quite
 work anyway --- won't it select everything?)

No.  It only would if using BETWEEN SYMMETRIC.

Like if m.taken is '2003-02-03', using a start day of year as 11th Nov
and end as 17th Feb, it would match the 2nd part of the outer OR
expression.  If you changed the end day of year to 2nd Feb, it would
yield no result as nothing is between 11th Nov and 17th Feb as it's a
negative difference, and 2nd Feb is lower than the taken date so fails
to match the first half of the inner most OR expression.

 , it would be better if
 you can have the app distinguish wraparound from non-wraparound cases
 and issue different queries in the two cases.  In the non-wrap case
 (start_day  end_day) it's pretty easy, just
        my_doy(m.taken) BETWEEN start_val AND end_val
 The easy way to handle the wrap case is
        my_doy(m.taken) = start_val OR my_doy(m.taken) = end_val
 although I can't help feeling there should be a smarter way to do
 this where you can use an AND range check on some modified expression
 derived from the date.

                        regards, tom lane


Yes, I guess I agree that the app can run different queries depending
on which date is higher.  I hadn't factored leap years into the
equation.  Can't think of what could be done for those cases off the
top of my head.  What is really needed is a way to match against day
and month parts instead of day, month and year without resorting
to casting to text of course.

Thom

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] [BUGS] Query causing explosion of temp space with join involving partitioning

2010-05-20 Thread Tom Lane
Krzysztof Nienartowicz krzysztof.nienartowicz.c...@gmail.com writes:
 surveys- SELECT  t1.SURVEY_PK, t1.SOURCE_PK, t1.TSTYPE,  t1.VALS
 surveys-   FROM sources t0 ,TS t1 where
 surveys-   (t0.SURVEYID = 16 AND t0.SRCID = 203510110032281 AND
 t0.SRCID = 203520107001677 and t0.SURVEYID = t1.SURVEY_PK AND t0.SRCID =
 t1.SOURCE_PK ) ORDER BY t0.SURVEYID ASC, t0.SRCID ASC

We don't make any attempt to infer derived inequality conditions,
so no, those constraints on t0.srcid won't be propagated over to
t1.source_pk.  Sorry.  It's been suggested before, but it would be
a lot of new mechanism and expense in the planner, and for most
queries it'd just slow things down to try to do that.

 I have around 30 clients running the same query with different
 parameters, but the query always returns 1000 rows (boundary values
 are pre-calculated,so it's like traversal of the equiwidth histogram
 if it comes to srcid/source_pk) and the rows from parallel queries
 cannot be overlapping. Usually query returns within around a second.
 I noticed however there are some queries that hang for many hours and
 what is most curious some of them created several GB of temp files.

Can you show us the query plan for the slow cases?

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
When using MySQL, the performance was okay (~5 seconds per query) using:

  date( concat_ws( '-', y.year, m.month, d.day ) ) between
-- Start date.
date( concat_ws( '-', y.year, $P{Month1}, $P{Day1} ) ) AND
-- End date. Calculated by checking to see if the end date wraps
-- into the next year. If it does, then add 1 to the current year.
--
date(
  concat_ws( '-',
y.year + greatest( -1 *
  sign(
datediff(
  date(
concat_ws('-', y.year, $P{Month2}, $P{Day2} )
  ),
  date(
concat_ws('-', y.year, $P{Month1}, $P{Day1} )
  )
)
  ), 0
), $P{Month2}, $P{Day2}
  )
)

This calculated the correct start days and end days, including leap years.

With MySQL, I normalized the date into three different tables: year
references, month references, and day references. The days contained only
the day (of the month) the measurement was made and the measured value. The
month references contained the month number for the measurement. The year
references had the years and station. Each table had its own index on the
year, month, or day.

When I had proposed that solution to the mailing list, I was introduced to a
more PostgreSQL-way, which was to use indexes on the date field.

In PostgreSQL, I have a single measurement table for the data (divided
into 72 child tables), which includes the date and station. I like this
because it feels clean and it is easier to understand. So far, however, it
has not been fast.

I was thinking that I could add three more columns to the measurement table:

year_taken, month_taken, day_taken

Then index those. That should allow me to avoid extracting years, months,
and days from the *m.taken* date column.

What do you think?

Thanks again!
Dave


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Tom Lane
David Jarvis thanga...@gmail.com writes:
 I was thinking that I could add three more columns to the measurement table:
 year_taken, month_taken, day_taken
 Then index those. That should allow me to avoid extracting years, months,
 and days from the *m.taken* date column.

You could, but I don't think there's any advantage to that versus
putting indexes on extract(day from taken) etc.  The extra fields
eat more space in the table proper, and the functional index isn't
really any more expensive than a plain index.  Not to mention that
you can have bugs with changing the date and forgetting to update
the derived columns, etc etc.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
What if I were to have the application pass in two sets of date ranges?

For the condition of Dec 22 to Mar 22:

Dec 22 would become:

   - Dec 22 - Dec 31

Mar 22 would become:

   - Jan 1 - Mar 22

The first range would always be for the current year; the second range would
always be for the year following the current year.

Would that allow PostgreSQL to use the index?

Dave


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Tom Lane
David Jarvis thanga...@gmail.com writes:
 What if I were to have the application pass in two sets of date ranges?
 For the condition of Dec 22 to Mar 22:
 Dec 22 would become:
- Dec 22 - Dec 31
 Mar 22 would become:
- Jan 1 - Mar 22

I think what you're essentially describing here is removing the OR from
the query in favor of issuing two queries and then combining the results
in the app.  Yeah, you could do that, but one would hope that it isn't
faster ;-)

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
I was hoping to eliminate this part of the query:

(cast(extract( YEAR FROM m.taken ) + greatest( -1 *
  sign(
(extract( YEAR FROM m.taken )||'-12-31')::date -
(extract( YEAR FROM m.taken )||'-01-01')::date ), 0
) AS text)||'-12-31')::date

That uses functions to create the dates, which is definitely the problem.
I'd still have the query return all the results for both data sets. If
providing the query with two data sets won't work, what will?

Dave


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Tom Lane
David Jarvis thanga...@gmail.com writes:
 I was hoping to eliminate this part of the query:
 (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
   sign(
 (extract( YEAR FROM m.taken )||'-12-31')::date -
 (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
 ) AS text)||'-12-31')::date

 That uses functions to create the dates, which is definitely the problem.

Well, it's not the functions per se that's the problem, it's the lack of
a useful index on the expression.  But as somebody remarked upthread,
that expression doesn't look correct at all.  Doesn't the whole
greatest() subexpression reduce to a constant?

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi,

I was still referring to the measurement table. You have an index on
 stationid, but still seem to be getting a sequential scan. Maybe the planner
 does not realise that you are selecting a small number of stations. Posting
 an EXPLAIN ANALYSE would really help here.


Here is the result from an *EXPLAIN ANALYZE*:

HashAggregate  (cost=5486752.27..5486756.27 rows=200 width=12) (actual
time=314328.657..314328.728 rows=110 loops=1)
  -  Hash Semi Join  (cost=1045.52..5451155.11 rows=4746289 width=12)
(actual time=197.950..313605.795 rows=463926 loops=1)
Hash Cond: (m.station_id = s.id)
-  Append  (cost=0.00..5343318.08 rows=4746289 width=16) (actual
time=74.411..306533.820 rows=42737997 loops=1)
  -  Seq Scan on measurement m  (cost=0.00..148.00 rows=1
width=20) (actual time=0.001..0.001 rows=0 loops=1)
Filter: ((category_id = 1) AND (date_part('year'::text,
(taken)::timestamp without time zone) = 1900::double precision) AND
(date_part('year'::text, (taken)::timestamp without time zone) =
2009::double precision) AND (taken = (((date_part('year'::text,
(taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND
(taken = date_part('year'::text, (taken)::timestamp without time zone)
+ GREATEST(((-1)::double precision * sign((date_part('year'::text,
(taken)::timestamp without time zone))::text || '-12-31'::text))::date -
(((date_part('year'::text, (taken)::timestamp without time zone))::text ||
'-01-01'::text))::date))::double precision)), 0::double precision)))::text
|| '-12-31'::text))::date))
  -  Seq Scan on measurement_01_001 m  (cost=0.00..438102.26
rows=389080 width=16) (actual time=74.409..24800.171 rows=3503256 loops=1)
Filter: ((category_id = 1) AND (date_part('year'::text,
(taken)::timestamp without time zone) = 1900::double precision) AND
(date_part('year'::text, (taken)::timestamp without time zone) =
2009::double precision) AND (taken = (((date_part('year'::text,
(taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND
(taken = date_part('year'::text, (taken)::timestamp without time zone)
+ GREATEST(((-1)::double precision * sign((date_part('year'::text,
(taken)::timestamp without time zone))::text || '-12-31'::text))::date -
(((date_part('year'::text, (taken)::timestamp without time zone))::text ||
'-01-01'::text))::date))::double precision)), 0::double precision)))::text
|| '-12-31'::text))::date))
  -  Seq Scan on measurement_02_001 m  (cost=0.00..399834.28
rows=354646 width=16) (actual time=29.217..22209.877 rows=3196631 loops=1)
Filter: ((category_id = 1) AND (date_part('year'::text,
(taken)::timestamp without time zone) = 1900::double precision) AND
(date_part('year'::text, (taken)::timestamp without time zone) =
2009::double precision) AND (taken = (((date_part('year'::text,
(taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND
(taken = date_part('year'::text, (taken)::timestamp without time zone)
+ GREATEST(((-1)::double precision * sign((date_part('year'::text,
(taken)::timestamp without time zone))::text || '-12-31'::text))::date -
(((date_part('year'::text, (taken)::timestamp without time zone))::text ||
'-01-01'::text))::date))::double precision)), 0::double precision)))::text
|| '-12-31'::text))::date))
  -  Seq Scan on measurement_03_001 m  (cost=0.00..438380.23
rows=389148 width=16) (actual time=15.915..24366.766 rows=3503937 loops=1)
Filter: ((category_id = 1) AND (date_part('year'::text,
(taken)::timestamp without time zone) = 1900::double precision) AND
(date_part('year'::text, (taken)::timestamp without time zone) =
2009::double precision) AND (taken = (((date_part('year'::text,
(taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND
(taken = date_part('year'::text, (taken)::timestamp without time zone)
+ GREATEST(((-1)::double precision * sign((date_part('year'::text,
(taken)::timestamp without time zone))::text || '-12-31'::text))::date -
(((date_part('year'::text, (taken)::timestamp without time zone))::text ||
'-01-01'::text))::date))::double precision)), 0::double precision)))::text
|| '-12-31'::text))::date))
  -  Seq Scan on measurement_04_001 m  (cost=0.00..432850.57
rows=384539 width=16) (actual time=15.852..24280.031 rows=3461931 loops=1)
Filter: ((category_id = 1) AND (date_part('year'::text,
(taken)::timestamp without time zone) = 1900::double precision) AND
(date_part('year'::text, (taken)::timestamp without time zone) =
2009::double precision) AND (taken = (((date_part('year'::text,
(taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND
(taken = date_part('year'::text, (taken)::timestamp without time zone)
+ GREATEST(((-1)::double precision * sign((date_part('year'::text,
(taken)::timestamp without time zone))::text || '-12-31'::text))::date -

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
The greatest() expression reduces to either the current year (year + 0) or
the next year (year + 1) by taking the sign of the difference in start/end
days. This allows me to derive an end date, such as:

Dec 22, 1900 to Mar 22, 1901

Then I check if the measured date falls between those two dates.

The expression might not be correct as I'm still quite new to PostgreSQL's
syntax.

Dave


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Stephen Frost
* David Jarvis (thanga...@gmail.com) wrote:
 I was hoping to eliminate this part of the query:
 (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
   sign(
 (extract( YEAR FROM m.taken )||'-12-31')::date -
 (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
 ) AS text)||'-12-31')::date

 That uses functions to create the dates, which is definitely the problem. 
[...]
 The greatest() expression reduces to either the current year (year + 0) or
 the next year (year + 1) by taking the sign of the difference in start/end
 days. This allows me to derive an end date, such as:
 
 Dec 22, 1900 to Mar 22, 1901

Something in here really smells fishy to me.  Those extract's above are
working on values which are from the table..  Why aren't you using these
functions to figure out how to construct the actual dates based on the
values provided by the *user*..?

Looking at your screenshot, I think you need to take those two date
values that the user provides, make them into actual dates (maybe you
need a CASE statement or something similar, that shouldn't be that hard,
and PG should just run that whole bit once, since to PG's point of view,
it's all constants), and then use those dates to query the tables.

Also, you're trying to do constraint_exclusion, but have you made sure
that it's turned on?  And have you made sure that those constraints are
really the right ones and that they make sense?  You're using a bunch of
extract()'s there too, why not just specify a CHECK constraint on the
date ranges which are allowed in the table..?

Maybe I've misunderstood the whole point here, but I don't think so.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Yeb Havinga

Tom Lane wrote:

David Jarvis thanga...@gmail.com writes:
  

I was hoping to eliminate this part of the query:
(cast(extract( YEAR FROM m.taken ) + greatest( -1 *
  sign(
(extract( YEAR FROM m.taken )||'-12-31')::date -
(extract( YEAR FROM m.taken )||'-01-01')::date ), 0
) AS text)||'-12-31')::date

That uses functions to create the dates, which is definitely the problem.



Well, it's not the functions per se that's the problem, it's the lack of
a useful index on the expression.  But as somebody remarked upthread,
that expression doesn't look correct at all.  Doesn't the whole
greatest() subexpression reduce to a constant?
  
That somebody was probably me. I still think the whole BETWEEN 
expression is a tautology. A small test did not provide a 
counterexample. In the select below everything but the select was 
copy/pasted.


create table m (taken timestamptz);
insert into m values (now());
insert into m values ('1900-12-31');
insert into m values ('2000-04-06');
select m.taken BETWEEN
   /* Start date. */
 (extract( YEAR FROM m.taken )||'-01-01')::date AND
   /* End date. Calculated by checking to see if the end date wraps
 into the next year. If it does, then add 1 to the current year.
   */
   (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
 sign(
   (extract( YEAR FROM m.taken )||'-12-31')::date -
   (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
   ) AS text)||'-12-31')::date from m;
 ?column?
--
t
t
t
(3 rows)

Another thing is that IF the climate measurements is partitioned on time 
(e.g each year?), then a function based index on the year part of 
m.taken is useless, pardon my french. I'm not sure if it is partitioned 
that way but it is an interesting thing to inspect, and perhaps rewrite 
the query to use constraint exclusion.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Stephen Frost
* David Jarvis (thanga...@gmail.com) wrote:
 There are 72 child tables, each having a year index and a station index,
 which are defined as follows:

S, my thoughts:

Partition by something that makes sense...  Typically, I'd say that you
would do it by the category id and when the measurement was taken.  Then
set up the appropriate check constraints on that so that PG can use
constraint_exclusion to identify what table it needs to actually go look
in.  How much data are we talking about, by the way? (# of rows)  If
you're not in the milions, partitioning at all is probably overkill and
might be part of the problem here..

create table climate.measurement_12_013 (
id bigint not null DEFAULT 
nextval('climate.measurement_id_seq'::regclass),
station_id integer not null,
taken date not null,
amount numeric(8,2) not null,
category_id integer not null,
flag varchar(1) not null default ' ',
check (category_id = 7),
check (taken = '1913-12-01' and taken = '1913-12-31')
)
inherits (climate.measurement);

CREATE INDEX measurement_12_013_s_idx
  ON climate.measurement_12_013
  USING btree
  (station_id);

CREATE INDEX measurement_12_013_d_idx
  ON climate.measurement_12_013
  USING btree
  (taken);

SELECT
  count(1) AS measurements,
  avg(m.amount) AS amount
FROM
  climate.measurement m
WHERE
  m.station_id IN (
SELECT
  s.id
FROM
  climate.station s,
  climate.city c
WHERE
/* For one city... */
c.id = 5182 AND

/* Where stations are within an elevation range... */
s.elevation BETWEEN 0 AND 3000 AND

/* and within a specific radius... */
-- Seriously, you should be using PostGIS here, that can
-- then use a GIST index to do this alot faster with a
-- bounding box...
6371.009 * SQRT(
  POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
(COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
  POW(RADIANS(c.longitude_decimal - s.longitude_decimal),
2))
) = 50
) AND

  /* Data before 1900 is shaky; insufficient after 2009. */
  -- I have no idea why this is here..  Aren't you forcing
  -- this already in your application code that's checking
  -- user input values?  Also, do you actually *have* any
  -- data outside this range?  If so, just pull out the
  -- tables with that data from the inheiritance
  -- m.taken = '1900-01-01' AND m.taken = '2009-12-31'
  -- extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

  /* Whittled down by category... */
  m.category_id = 1 AND

  /* Between the selected days and years... */
   CASE
 WHEN (user_start_year || user_start_day = user_stop_year || 
user_stop) THEN
 m.taken BETWEEN user_start_year || user_start_day  AND 
user_stop_year || user_stop
 WHEN (user_start_year || user_start_day  user_stop_year || 
user_stop) THEN
 m.taken BETWEEN (user_start_year || user_start_day)::date  AND
 ((user_stop_year || user_stop)::date + '1
 year'::interval)::date
-- I don't think you need/want this..?
-- GROUP BY
--  extract( YEAR FROM m.taken )

Enjoy,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Stephen Frost
* David Jarvis (thanga...@gmail.com) wrote:
 I was still referring to the measurement table. You have an index on
  stationid, but still seem to be getting a sequential scan. Maybe the planner
  does not realise that you are selecting a small number of stations. Posting
  an EXPLAIN ANALYSE would really help here.
 
 
 Here is the result from an *EXPLAIN ANALYZE*:

Yeah..  this is a horrible, horrible plan.  It does look like you've got
some serious data tho, at least.  Basically, PG is sequentially scanning
through all of the tables in your partitioning setup.  What is
constraint_exclusion set to?  What version of PG is this?  Do the
results og this query look at all correct to you?

Have you considered an index on elevation, btw?  How many records in
that city table are there and how many are actually in that range?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi,

~300 million measurements
~12000 stations (not 7 as I mentioned before)
~5500 cities

some serious data tho, at least.  Basically, PG is sequentially scanning
 through all of the tables in your partitioning setup.  What is
 constraint_exclusion set to?  What version of PG is this?  Do the
 results og this query look at all correct to you?


PG 8.4

show constraint_exclusion;
partition

With so much data, it is really hard to tell if the query looks okay without
having it visualized. I can't visualize it until I have the query set up
correctly. At the moment it looks like the query is wrong. :-(

Have you considered an index on elevation, btw?  How many records in
 that city table are there and how many are actually in that range?


I've since added a constraint on elevation; it'll help a bit:

CREATE INDEX station_elevation_idx
  ON climate.station
  USING btree
  (elevation);

Dave


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi,

   check (taken = '1913-12-01' and taken = '1913-12-31')


I don't think I want to constrain by year, for a few reasons:

1. There are a lot of years -- over 110.
2. There will be more years added (both in the future for 2010 and in the
past as I get data from other sources).

Currently I have it constrained by month and category. Each table then has
about 3 million rows (which is 216 million, but some tables have more, which
brings it to 273 million).


  /* Data before 1900 is shaky; insufficient after 2009. */
   -- I have no idea why this is here..  Aren't you forcing


Mostly temporary. It is also constrained by the user interface; however that
will likely change in the future. It should not be present in the database
structure itself.



  /* Between the selected days and years... */

   CASE
 WHEN (user_start_year || user_start_day = user_stop_year ||
 user_stop) THEN
 m.taken BETWEEN user_start_year || user_start_day  AND
 user_stop_year || user_stop
 WHEN (user_start_year || user_start_day  user_stop_year ||
 user_stop) THEN
 m.taken BETWEEN (user_start_year || user_start_day)::date  AND
 ((user_stop_year || user_stop)::date + '1
 year'::interval)::date
-- I don't think you need/want this..?


User selects this:

1. Years: 1950 to 1974
2. Days: Dec 22 to Mar 22

This means that the query must average data between Dec 22 1950 and Mar 22
1951 for the year of 1950. For 1951, the range is Dec 22 1951 to Mar 22
1952, and so on. If we switch the calendar (or alter the seasons) so that
winter starts Jan 1st (or ends Dec 31), then I could simplify the query. ;-)

Dave


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi,

Something in here really smells fishy to me.  Those extract's above are
 working on values which are from the table..  Why aren't you using these
 functions to figure out how to construct the actual dates based on the
 values provided by the *user*..?


Because I've only been using PostgreSQL for one week. For the last several
years I've been developing with Oracle on mid-sized systems (40 million
books, 5 million reservations per year, etc.). And even then, primarily on
the user-facing side of the applications.

Looking at your screenshot, I think you need to take those two date
 values that the user provides, make them into actual dates (maybe you
 need a CASE statement or something similar, that shouldn't be that hard,


So the user selects Dec 22 and Mar 22 for 1900 to 2009 and the system feeds
the report a WHERE clause that looks like:

  m.taken BETWEEN '22-12-1900'::date AND '22-03-1901'::date and
  m.taken BETWEEN '22-12-1901'::date AND '22-03-1902'::date and
  m.taken BETWEEN '22-12-1902'::date AND '22-03-1903'::date and ...

That tightly couples the report query to the code that sets the report
engine parameters. One of the parameters would be SQL code in the form of a
dynamically crafted WHERE clause. I'd rather keep the SQL code that is used
to create the report entirely with the report engine if at all possible.


 Also, you're trying to do constraint_exclusion, but have you made sure
 that it's turned on?  And have you made sure that those constraints are
 really the right ones and that they make sense?  You're using a bunch of
 extract()'s there too, why not just specify a CHECK constraint on the
 date ranges which are allowed in the table..?


I don't know what the date ranges are? So I can't partition them by year?

Right now I created 72 child tables by using the category and month. This
may have been a bad choice. But at least all the data is in the system now
so dissecting or integrating it back in different ways shouldn't take days.

Thanks everyone for all your help, I really appreciate the time you've taken
to guide me in the right direction to make the system as fast as it can be.

Dave


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
I took out the date conditions:

SELECT
  m.*
FROM
  climate.measurement m
WHERE
  m.category_id = 1 and
  m.station_id = 2043

This uses the station indexes:

Result  (cost=0.00..21781.18 rows=8090 width=28)
  -  Append  (cost=0.00..21781.18 rows=8090 width=28)
-  Seq Scan on measurement m  (cost=0.00..28.00 rows=1 width=38)
  Filter: ((category_id = 1) AND (station_id = 2043))
-  Bitmap Heap Scan on measurement_01_001 m  (cost=11.79..1815.67
rows=677 width=28)
  Recheck Cond: (station_id = 2043)
  Filter: (category_id = 1)
  -  Bitmap Index Scan on measurement_01_001_s_idx
(cost=0.00..11.62 rows=677 width=0)
Index Cond: (station_id = 2043)
-  Bitmap Heap Scan on measurement_02_001 m  (cost=14.47..1682.18
rows=627 width=28)
  Recheck Cond: (station_id = 2043)
  Filter: (category_id = 1)
  -  Bitmap Index Scan on measurement_02_001_s_idx
(cost=0.00..14.32 rows=627 width=0)
Index Cond: (station_id = 2043)

2500+ rows in 185 milliseconds.

That is pretty good (I'll need it to be better but for now it works).

Then combined the selection of the station:

SELECT
  m.*
FROM
  climate.measurement m,
  (SELECT
 s.id
   FROM
 climate.station s,
 climate.city c
   WHERE
 c.id = 5182 AND
 s.elevation BETWEEN 0 AND 3000 AND
 6371.009 * SQRT(
   POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
   (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))
 ) = 25
   ) t
WHERE
  m.category_id = 1 and
  m.station_id = t.id

The station index is no longer used, resulting in full table scans:

Hash Join  (cost=1045.52..1341150.09 rows=14556695 width=28)
  Hash Cond: (m.station_id = s.id)
  -  Append  (cost=0.00..867011.99 rows=43670085 width=28)
-  Seq Scan on measurement m  (cost=0.00..25.00 rows=6 width=38)
  Filter: (category_id = 1)
-  Seq Scan on measurement_01_001 m  (cost=0.00..71086.96
rows=3580637 width=28)
  Filter: (category_id = 1)
-  Seq Scan on measurement_02_001 m  (cost=0.00..64877.40
rows=3267872 width=28)
  Filter: (category_id = 1)
-  Seq Scan on measurement_03_001 m  (cost=0.00..71131.44
rows=3582915 width=28)
  Filter: (category_id = 1)

How do I avoid the FTS?

(I know about PostGIS but I can only learn and do so much at once.) ;-)

Here's the station query:

SELECT
  s.id
FROM
  climate.station s,
  climate.city c
WHERE
  c.id = 5182 AND
  s.elevation BETWEEN 0 AND 3000 AND
  6371.009 * SQRT(
POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
(COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))
  ) = 25

And its EXPLAIN:

Nested Loop  (cost=0.00..994.94 rows=4046 width=4)
  Join Filter: ((6371.009::double precision *
sqrt((pow(radians(((c.latitude_decimal - s.latitude_decimal))::double
precision), 2::double precision) + (cos((radians(((c.latitude_decimal +
s.latitude_decimal))::double precision) / 2::double precision)) *
pow(radians(((c.longitude_decimal - s.longitude_decimal))::double
precision), 2::double precision) = 25::double precision)
  -  Index Scan using city_pkey1 on city c  (cost=0.00..6.27 rows=1
width=16)
Index Cond: (id = 5182)
  -  Seq Scan on station s  (cost=0.00..321.08 rows=12138 width=20)
Filter: ((s.elevation = 0) AND (s.elevation = 3000))

I get a set of 78 rows returned in very little time.

Thanks again!
Dave


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread Matthew Wakeling

On Thu, 20 May 2010, David Jarvis wrote:

I took out the date conditions:

SELECT
 m.*
FROM
 climate.measurement m
WHERE
 m.category_id = 1 and
 m.station_id = 2043

This uses the station indexes:


Yes, because there is only one station_id selected. That's exactly what an 
index is for.



Then combined the selection of the station:
The station index is no longer used, resulting in full table scans:



Nested Loop  (cost=0.00..994.94 rows=4046 width=4)
  Join Filter: ((6371.009::double precision *
sqrt((pow(radians(((c.latitude_decimal - s.latitude_decimal))::double
precision), 2::double precision) + (cos((radians(((c.latitude_decimal +
s.latitude_decimal))::double precision) / 2::double precision)) *
pow(radians(((c.longitude_decimal - s.longitude_decimal))::double
precision), 2::double precision) = 25::double precision)
  -  Index Scan using city_pkey1 on city c  (cost=0.00..6.27 rows=1
width=16)
Index Cond: (id = 5182)
  -  Seq Scan on station s  (cost=0.00..321.08 rows=12138 width=20)
Filter: ((s.elevation = 0) AND (s.elevation = 3000))

I get a set of 78 rows returned in very little time.


(An EXPLAIN ANALYSE would be better here). Look at the expected number of 
stations returned. It expects 4046 which is a large proportion of the 
available stations. It therefore expects to have to touch a large 
proportion of the measurement table, therefore it thinks that it will be 
fastest to do a seq scan. In actual fact, for 78 stations, the index would 
be faster, but for 4046 it wouldn't.


If you will be querying by season quite regularly, had you considered 
partitioning by season?


Matthew

--
Geography is going places.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi,

(An EXPLAIN ANALYSE would be better here). Look at the expected number of
 stations


Nested Loop  (cost=0.00..994.94 rows=4046 width=4) (actual
time=0.053..41.173 rows=78 loops=1)
  Join Filter: ((6371.009::double precision *
sqrt((pow(radians(((c.latitude_decimal - s.latitude_decimal))::double
precision), 2::double precision) + (cos((radians(((c.latitude_decimal +
s.latitude_decimal))::double precision) / 2::double precision)) *
pow(radians(((c.longitude_decimal - s.longitude_decimal))::double
precision), 2::double precision) = 25::double precision)
  -  Index Scan using city_pkey1 on city c  (cost=0.00..6.27 rows=1
width=16) (actual time=0.014..0.016 rows=1 loops=1)
Index Cond: (id = 5182)
  -  Seq Scan on station s  (cost=0.00..321.08 rows=12138 width=20)
(actual time=0.007..5.256 rows=12139 loops=1)
Filter: ((s.elevation = 0) AND (s.elevation = 3000))
Total runtime: 41.235 ms

expects to have to touch a large proportion of the measurement table,
 therefore it thinks that it will be fastest to do a seq scan. In actual
 fact, for 78 stations, the index would be faster, but for 4046 it wouldn't.


This is rather unexpected. I'd have figured it would use the actual number.


 If you will be querying by season quite regularly, had you considered
 partitioning by season?


I have no idea what the regular queries will be. The purpose of the system
is to open the data up to the public using a simple user interface so that
they can generate their own custom reports. That user interface allows
people to pick year intervals, day ranges, elevations, categories
(temperature, precipitation, snow depth, etc.), and lat/long perimeter
coordinates (encompassing any number of stations) or a city and radius.

Dave