[SQL] optimize self-join query

2011-10-27 Thread Ty Busby
I have a table that stores a very large starting number called 
epc_start_numeric and a quantity.  I've apparently built the most inefficient 
query possible for doing the job I need: find out if any records overlap.  
Imagine the epc_start_numeric + quantity representing a block of numbers.  I 
need to find out if any of these blocks overlap.

Here is the table:

CREATE TABLE ftp_epc_audit
(
 record_id serial NOT NULL,
 sent_filename text NOT NULL,
 pcid text NOT NULL,
 tsid text NOT NULL,
 user_sec_id text NOT NULL,
 format_set text NOT NULL,
 format text NOT NULL,
 epc_start text NOT NULL,
 quantity integer,
 epc_decimal_start numeric(29)
)
WITH OIDS;
ALTER TABLE ftp_epc_audit OWNER TO postgres;

And the query is currently this:

SELECT '', 0, a.*, '', 0, b.* FROM ftp_epc_audit_staging a,
ftp_epc_audit_staging b
WHERE a.sent_filename <> b.sent_filename
AND a.quantity > 0
AND b.quantity > 0
AND a.locked = 1
AND b.locked = 1
AND(( a.epc_decimal_start BETWEEN b.epc_decimal_start AND b.epc_decimal_start + 
b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 BETWEEN b.epc_decimal_start AND 
b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 < b.epc_decimal_start AND 
a.epc_decimal_start + a.quantity -1 > b.epc_decimal_start + b.quantity - 1 ))

The column sent_filename is the unique value used so that a record does not 
find itself.  I've got an index on sent_filename, quantity, locked and 
epc_decimal_start.

The query runs fine with a very small number of records.  However, I need to 
process 60,000 records and this is taking hours.  There must be a fundemental 
flaw in either the query or my overall design because it doesn't seem like a 
challenging task.  

I've also tried a variant of this query which also takes several hours to run 
through 60,000 records.

SELECT * FROM ftp_epc_audit_staging a
WHERE a.quantity > 0
AND a.locked = 1
AND EXISTS ( SELECT TRUE FROM ftp_epc_audit_staging b WHERE b.locked = 1
AND b.quantity > 0 AND a.sent_filename <> b.sent_filename
AND(( a.epc_decimal_start BETWEEN b.epc_decimal_start AND b.epc_decimal_start + 
b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 BETWEEN b.epc_decimal_start AND 
b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 < b.epc_decimal_start AND 
a.epc_decimal_start + a.quantity -1 > b.epc_decimal_start + b.quantity - 1 )))

I would really appreciate any thoughts on this.
-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


[SQL] how to use explain analyze

2011-10-27 Thread alan
I'm new to postgres and was wondering how to use EXPLAIN ANALYZE 

Can I use the output from ANALYZE EXPLAIN to estimate or predict the
actual time
it would take for a given query to return?

I ask because I'm writing a typical web app that allows the user to
build and submit a query
to my DB. Since I don't know how "simple" or "complex" the user-
generated queries will be
I thought it might be possible to use the ANALYZE EXPLAIN output to
make a "guestimation"
about the expected return time of the query.

I'd like to use this in my web-app to determine whether to run the
query in real-time (user waits
for results) or queue up the query (and notify the user once the query
is finished). E.g.:
  if (the Total runtime" reported by explain analyze is > n ms) {
  tell the user that his request was submitted for processing, and
notify the user once resuilts are available
  } else {
 run the query and wait for the results in real time.
  }

Thanks,
Alan

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


[SQL] How to obtain a coalesce aggregation in a GROUP BY query?

2011-10-27 Thread Federico Dal Maso
I need write a query like this

select coalesce_agg(col order by col asc) from some_table group by other_col

clearly the col column is nullable.
which is the best way (in terms of performance too) to obtain this?

Are there any built-in aggregate function or should I write a new aggregate
function with CREATE AGGREGATE?
Do you think coalesce_agg should be a built-in aggregation?

Thanks in advance

-- 
Federico Dal Maso


Re: [SQL] how to use explain analyze

2011-10-27 Thread Brent Dombrowski
On Oct 25, 2011, at 7:12 AM, alan wrote:

> I'm new to postgres and was wondering how to use EXPLAIN ANALYZE 
> 
> Can I use the output from ANALYZE EXPLAIN to estimate or predict the
> actual time
> it would take for a given query to return?
> 
> I ask because I'm writing a typical web app that allows the user to
> build and submit a query
> to my DB. Since I don't know how "simple" or "complex" the user-
> generated queries will be
> I thought it might be possible to use the ANALYZE EXPLAIN output to
> make a "guestimation"
> about the expected return time of the query.
> 
> I'd like to use this in my web-app to determine whether to run the
> query in real-time (user waits
> for results) or queue up the query (and notify the user once the query
> is finished). E.g.:
>  if (the Total runtime" reported by explain analyze is > n ms) {
>  tell the user that his request was submitted for processing, and
> notify the user once resuilts are available
>  } else {
> run the query and wait for the results in real time.
>  }
> 
> Thanks,
> Alan

Check out the docs on EXPLAIN, 
http://www.postgresql.org/docs/9.0/static/sql-explain.html

EXPLAIN will give you an estimation. How well the internal statistics match the 
actual data will determine how close the estimation is.

EXPLAIN ANALYZE will give you an estimation and the actuals. WARNING: EXPLAIN 
ANALYZE carries out the query. If there any data modifications in the query 
(UPDATE, INSERT, etc.) they will be carried out.

Note: ANALYZE is a different command and EXPLAIN is not a valid option for it. 
You'll be researching this command if the estimates are way off.

Brent.


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] optimize self-join query

2011-10-27 Thread Tom Lane
Ty Busby  writes:
> I have a table that stores a very large starting number called 
> epc_start_numeric and a quantity.  I've apparently built the most inefficient 
> query possible for doing the job I need: find out if any records overlap.  
> Imagine the epc_start_numeric + quantity representing a block of numbers.  I 
> need to find out if any of these blocks overlap.

Yeah, overlap is a hard problem.  Basically, Postgres doesn't have any
way to do your query short of comparing each row to each other row,
so the cost goes up as O(N^2).

If you know more than you've let on about the properties of the
intervals, you might be able to improve things.  For instance
if the intervals fall into nonoverlapping buckets then you could
add a constraint that the buckets of the two sides are equal.
Postgres is a lot better with equality join constraints than it
is with range constraints, so it would be able to match up rows
and only do the O(N^2) work within each bucket.

In the long run we might have better answers --- Jeff Davis has been
working on range types for years now, and one of the long-range goals
of that is to have smarter support for this type of problem.  But for
now, it's going to be painful.

regards, tom lane

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


[SQL] Different order by behaviour depending on where clause?

2011-10-27 Thread Jan Bakuwel
Hi,

I have a compound query with some grouping, having and order by's saved
as a view, say with name "myview".

A) select * from "myview" returns the results as expected in the correct
order (the order by is on fields "Category", "Year", "Month" and a few
other fields). The results are correctly ordered by these fields as
specified in the view.

B) select * from "myview" where "Year"=2011 and "Month"=1 also returns
the results as expected in the correct order (a subset of A).

however

C) select * from "myview" where "Year"=2011 and "Month" >= 1 and "Month"
<= 1 returns the same resultset as B but the order of the rows is not
correct (not even close; haven't been able to see any pattern).

Any idea how I can further analyse/diagnose this?

regards,
Jan


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] optimize self-join query

2011-10-27 Thread Lee Hachadoorian
On Tue, Oct 25, 2011 at 2:37 PM, Ty Busby  wrote:

> I have a table that stores a very large starting number called
> epc_start_numeric and a quantity.  I've apparently built the most
> inefficient query possible for doing the job I need: find out if any records
> overlap.  Imagine the epc_start_numeric + quantity representing a block of
> numbers.  I need to find out if any of these blocks overlap.
>
> Here is the table:
>
> CREATE TABLE ftp_epc_audit
> (
>  record_id serial NOT NULL,
>  sent_filename text NOT NULL,
>  pcid text NOT NULL,
>  tsid text NOT NULL,
>  user_sec_id text NOT NULL,
>  format_set text NOT NULL,
>  format text NOT NULL,
>  epc_start text NOT NULL,
>  quantity integer,
>  epc_decimal_start numeric(29)
> )
> WITH OIDS;
> ALTER TABLE ftp_epc_audit OWNER TO postgres;
>

Not having access to your dataset, I created the following test data:

DROP TABLE IF EXISTS ftp_epc_audit;
CREATE TABLE ftp_epc_audit (
 record_id serial NOT NULL,
 quantity integer,
 epc_decimal_start numeric(29)
)
WITH OIDS;
ALTER TABLE ftp_epc_audit ADD PRIMARY KEY(record_id);

INSERT INTO ftp_epc_audit (epc_decimal_start, quantity)
SELECT
generate_series(5, 1, 5) AS epc_decimal_start,
round(ceiling(20 * random()) / ceiling(5 * random()))::int AS quantity
;

The strange final column creates something like a stepped Poisson
distribution, most of the values of quantity are < 5 but they can go as high
as 20. Since epc_decimal_start increases in steps of five, this ensures we
have a many records with nonoverlapping ranges and many that overlap one,
two, or three adjacent ranges.


>
> And the query is currently this:
>
> SELECT '', 0, a.*, '', 0, b.* FROM ftp_epc_audit_staging a,
> ftp_epc_audit_staging b
> WHERE a.sent_filename <> b.sent_filename
> AND a.quantity > 0
> AND b.quantity > 0
> AND a.locked = 1
> AND b.locked = 1
> AND(( a.epc_decimal_start BETWEEN b.epc_decimal_start AND
> b.epc_decimal_start + b.quantity - 1 )
> OR ( a.epc_decimal_start + a.quantity - 1 BETWEEN b.epc_decimal_start AND
> b.quantity - 1 )
> OR ( a.epc_decimal_start + a.quantity - 1 < b.epc_decimal_start AND
> a.epc_decimal_start + a.quantity -1 > b.epc_decimal_start + b.quantity - 1
> ))
>

I rewrote this to run against my pared down 2000 record test table as:

SELECT
*
FROM
ftp_epc_audit a CROSS JOIN ftp_epc_audit b
WHERE
a.record_id <> b.record_id AND
(( a.epc_decimal_start BETWEEN b.epc_decimal_start AND b.epc_decimal_start +
b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 BETWEEN b.epc_decimal_start AND
b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 < b.epc_decimal_start AND
a.epc_decimal_start + a.quantity -1 > b.epc_decimal_start + b.quantity - 1
))
;

These condition inside AND ((…)) looks unnecessarily complex. In fact, the
last OR (…) clause can't ever evaluate to TRUE, but it does hurt the query
performance.

If x = b.epc_decimal_start
and y = b.epc_decimal_start + b.quantity - 1
and quantity is an integer restricted by an earlier clause to be > 0
Then x ≤ y
Therefore there is no z such that z < x and z > y

Testing this on the 2000 record table, dropping that clause alone made the
query time drop from ~20 seconds to ~12 seconds. I also confirmed that
dropping that clause returns the same number of records.

The final four lines can be further simplified to

…AND
a.epc_decimal_start < b.epc_decimal_start
AND b.epc_decimal_start < a.epc_decimal_start + a.quantity

This says, for each record, compare only the records with a greater range
start, and return any record whose range starts before my range stops. This
also returns the same number of records as the previous version and on the
2000 record table runs in ~2.6 seconds.

There are further speed gains from using a WITH clause so that the range
limits are only evaluated once (~2.0 seconds), and in using a window
function to ORDER the recordset by range start, then compare the rank
instead of the range start (~1.5 seconds). It ends up looking like this:

WITH rank_epc AS (
SELECT
record_id,
epc_decimal_start AS range_low,
epc_decimal_start + quantity AS range_high,
rank() OVER (ORDER BY epc_decimal_start)
FROM
ftp_epc_audit
)
SELECT
*
FROM
rank_epc a CROSS JOIN rank_epc b WHERE a.rank < b.rank AND b.range_low <
a.range_high
;

This ran on the 2000 record table in ~1.5 seconds and on a 20,000 record
dataset in 172 seconds.

Finally, depending on your use case and what you want to do once you
identify the problematic records, it might be useful to only compare
adjacent ranges to see if they overlap. If adjacent ranges are then "fixed"
so that they don't overlap anymore, there won't be any left that overlap two
or three or more ranges. This can be accomplished with:

WITH rank_epc AS (
SELECT
record_id,
epc_decimal_start AS range_low,
epc_decimal_start + quantity AS range_high,
rank() OVER (ORDER BY epc_decimal_start)
FROM
ftp_epc_audit
)
SELECT
*
FROM
rank_epc a JOIN rank_epc b ON (a.rank = b.rank - 1 AND b.range_low <
a.range_high)
;

This ran on the 20,00

Re: [SQL] Different order by behaviour depending on where clause?

2011-10-27 Thread Pavel Stehule
Hello

2011/10/28 Jan Bakuwel :
> Hi,
>
> I have a compound query with some grouping, having and order by's saved
> as a view, say with name "myview".
>
> A) select * from "myview" returns the results as expected in the correct
> order (the order by is on fields "Category", "Year", "Month" and a few
> other fields). The results are correctly ordered by these fields as
> specified in the view.
>
> B) select * from "myview" where "Year"=2011 and "Month"=1 also returns
> the results as expected in the correct order (a subset of A).
>
> however
>
> C) select * from "myview" where "Year"=2011 and "Month" >= 1 and "Month"
> <= 1 returns the same resultset as B but the order of the rows is not
> correct (not even close; haven't been able to see any pattern).
>
> Any idea how I can further analyse/diagnose this?
>
> regards,
> Jan
>

Look on EXPLAIN - these queries will have a different execution plan

http://www.postgresql.org/docs/8.4/static/sql-explain.html

Regards

Pavel Stehule

>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql