Re: [PERFORM] Inefficient max query when using group by

2017-05-04 Thread David Rowley
On 4 May 2017 at 22:52,   wrote:
> I have a performance problem with my query. As a simplified example, I have
> a table called Book, which has three columns: id, released (timestamp) and
> author_id. I have a need to search for the latest books released by multiple
> authors, at a specific point in the history. This could be latest book
> between beginning of time and now, or latest book released last year etc. In
> other words, only the latest book for each author, in specific time window.
> I have also a combined index for released and author_id columns.
>
> First, I tried a simple query that selects maximum value of released and the
> author_id, which are grouped by the author_id (then later do a join by these
> author_id, released columns to get the whole rows).  Performance of this
> query is pretty bad (Execution time around 250-300ms for five authors). See
> query and query plan in the link below:
>
> https://gist.github.com/jehie/ca9fac16b6e3c19612d815446a0e1bc0
>
>
>
> The execution time seems to grow linearly when the number of author_ids
> increase (50ms per author_id). I don’t completely understand why it takes so
> long for this query to execute and why it does not use the directional index
> scan?
>
> I also tried second query using limit (where I can only ask for one
> author_id at a time, so cannot use this directly when searching for books of
> multiple author), which performs nicely (0.2ms):
>
> https://gist.github.com/jehie/284e7852089f6debe22e05c63e73027f
>
>
>
> So, any ideas how to make multiple-author lookups (like in the first query)
> perform better? Or any other ideas?

Yes, you could sidestep the whole issue by using a LATERAL join.

Something like:

EXPLAIN ANALYZE
SELECT b.released, b.author_id
FROM (VALUES('1'),('2'),('3'),('4'),('5')) a (author_id)
CROSS JOIN LATERAL (SELECT released, author_id
 FROM book
  WHERE author_id = a.author_id
AND released<=to_timestamp(2e9)
AND released>=to_timestamp(0)
ORDER BY released desc
 LIMIT 1) b;

or you could write a function which just runs that query. Although,
with the above or the function method, if you give this enough
authors, then it'll eventually become slower than the problem query.
Perhaps if you know the number of authors will not be too great, then
you'll be ok.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Inefficient max query when using group by

2017-05-04 Thread bricklen
On Thu, May 4, 2017 at 3:52 AM,  wrote:

> Hi,
>
> I have a performance problem with my query. As a simplified example, I
> have a table called Book, which has three columns: id, released (timestamp)
> and author_id. I have a need to search for the latest books released by
> multiple authors, at a specific point in the history. This could be latest
> book between beginning of time and now, or latest book released last year
> etc. In other words, only the latest book for each author, in specific time
> window. I have also a combined index for released and author_id columns.
>

As far as the query itself, I suspect you are paying a penalty for the
to_timestamp() calls. Try the same query with hard-coded timestamps:
"AND released<='2017-05-05 00:00:00' AND released>='1970-01-01 00:00:00'"
If you need these queries to be lightning fast then this looks like a good
candidate for using Materialized Views:
https://www.postgresql.org/docs/current/static/sql-creatematerializedview.html


[PERFORM] Inefficient max query when using group by

2017-05-04 Thread jesse.hietanen
Hi,
I have a performance problem with my query. As a simplified example, I have a 
table called Book, which has three columns: id, released (timestamp) and 
author_id. I have a need to search for the latest books released by multiple 
authors, at a specific point in the history. This could be latest book between 
beginning of time and now, or latest book released last year etc. In other 
words, only the latest book for each author, in specific time window. I have 
also a combined index for released and author_id columns.
First, I tried a simple query that selects maximum value of released and the 
author_id, which are grouped by the author_id (then later do a join by these 
author_id, released columns to get the whole rows).  Performance of this query 
is pretty bad (Execution time around 250-300ms for five authors). See query and 
query plan in the link below:
https://gist.github.com/jehie/ca9fac16b6e3c19612d815446a0e1bc0

The execution time seems to grow linearly when the number of author_ids 
increase (50ms per author_id). I don't completely understand why it takes so 
long for this query to execute and why it does not use the directional index 
scan?
I also tried second query using limit (where I can only ask for one author_id 
at a time, so cannot use this directly when searching for books of multiple 
author), which performs nicely (0.2ms):
https://gist.github.com/jehie/284e7852089f6debe22e05c63e73027f

So, any ideas how to make multiple-author lookups (like in the first query) 
perform better? Or any other ideas?

Here is the SQL to create the Table, Index, generate some test data and both 
queries:
https://gist.github.com/jehie/87665c03bee124f8a96de24cae798194

Thanks,
Jesse