[GENERAL] advice sought - general approaches to optimizing queries around event streams

2014-09-26 Thread Jonathan Vanasco

I have a growing database with millions of rows that track resources against an 
event stream.  

i have a few handfuls of queries that interact with this stream in a variety of 
ways, and I have managed to drop things down from 70s to 3.5s on full scans and 
offer .05s partial scans.  

no matter how i restructure queries, I can't seem to get around a few 
bottlenecks and I wanted to know if there were any tips/tricks from the 
community on how to approach them.  

a simple form of my database would be:

--  1k of
create table stream (
id int not null primary key,
)

-- 1MM of
create table resource (
id int not null primary key,
col_a bool,
col_b bool,
col_c text,
);

-- 10MM of
create table streamevent (
id int not null,
event_timestamp timestamp not null,
stream_id int not null references stream(id)
);

-- 10MM of
create table resource_2_stream_event(
resource_id int not null references resource(id),
streamevent_id int not null references streamevent(id)
)

Everything is running off of indexes; there are no seq scans.

I've managed to optimize my queries by avoiding joins against tables, and 
turning the stream interaction into a subquery or CTE.  
better performance has come from limiting the number of stream events  ( 
which are only the timestamp and resource_id off a joined table ) 

The bottlenecks I've encountered have primarily been:

1.  When interacting with a stream, the ordering of event_timestamp and 
deduplicating of resources becomes an issue.
I've figured out a novel way to work with the most recent events, but 
distant events are troublesome

using no limit, the query takes 3500 ms
using a limit of 1, the query takes 320ms
using a limit of 1000, the query takes 20ms

there is a dedicated index of on event_timestamp (desc) , and it is 
being used
according to the planner... finding all the records is fine; 
merging-into and sorting the aggregate to handle the deduplication of records 
in a stream seems to be the issue (either with DISTINCT or max+group_by)


2.  I can't figure out an effective way to search for a term against an 
entire stream (using a tsquery/gin based search)

I thought about limiting the query by finding matching resources first, 
then locking it to an event stream, but:
- scanning the entire table for a term takes about 10 seconds 
on an initial hit.  subsequent queries for the same terms end up using the 
cache, and complete within 20ms.

I get better search performance by calculating the event stream, then 
searching it for matching documents, but I still have the performance issues 
related to limiting the window of events

i didn't include example queries, because I'm more concerned with the general 
approaches and ideas behind dealing with large data sets than i am with raw SQL 
right now.  

i'm hoping someone can enlighten me into looking at new ways to solve these 
problems.   i think i've learned more about postgres/sql in the past 48hour 
than I have in the past 15 years, and I'm pretty sure that the improvements I 
need will come from new ways of querying data , rather than optimizing the 
current queries.




















-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] advice sought - general approaches to optimizing queries around event streams

2014-09-26 Thread Gavin Flower

On 27/09/14 09:02, Jonathan Vanasco wrote:

I have a growing database with millions of rows that track resources against an 
event stream.

i have a few handfuls of queries that interact with this stream in a variety of 
ways, and I have managed to drop things down from 70s to 3.5s on full scans and 
offer .05s partial scans.

no matter how i restructure queries, I can't seem to get around a few 
bottlenecks and I wanted to know if there were any tips/tricks from the 
community on how to approach them.

a simple form of my database would be:

--  1k of
create table stream (
id int not null primary key,
)

-- 1MM of
create table resource (
id int not null primary key,
col_a bool,
col_b bool,
col_c text,
);

-- 10MM of
create table streamevent (
id int not null,
event_timestamp timestamp not null,
stream_id int not null references stream(id)
);

-- 10MM of
create table resource_2_stream_event(
resource_id int not null references resource(id),
streamevent_id int not null references streamevent(id)
)

Everything is running off of indexes; there are no seq scans.

I've managed to optimize my queries by avoiding joins against tables, and 
turning the stream interaction into a subquery or CTE.
better performance has come from limiting the number of stream events  ( 
which are only the timestamp and resource_id off a joined table )

The bottlenecks I've encountered have primarily been:

1.  When interacting with a stream, the ordering of event_timestamp and 
deduplicating of resources becomes an issue.
I've figured out a novel way to work with the most recent events, but 
distant events are troublesome

using no limit, the query takes 3500 ms
using a limit of 1, the query takes 320ms
using a limit of 1000, the query takes 20ms

there is a dedicated index of on event_timestamp (desc) , and it is 
being used
according to the planner... finding all the records is fine; 
merging-into and sorting the aggregate to handle the deduplication of records 
in a stream seems to be the issue (either with DISTINCT or max+group_by)


2.  I can't figure out an effective way to search for a term against an 
entire stream (using a tsquery/gin based search)

I thought about limiting the query by finding matching resources first, 
then locking it to an event stream, but:
- scanning the entire table for a term takes about 10 seconds 
on an initial hit.  subsequent queries for the same terms end up using the 
cache, and complete within 20ms.

I get better search performance by calculating the event stream, then 
searching it for matching documents, but I still have the performance issues 
related to limiting the window of events

i didn't include example queries, because I'm more concerned with the general 
approaches and ideas behind dealing with large data sets than i am with raw SQL 
right now.

i'm hoping someone can enlighten me into looking at new ways to solve these 
problems.   i think i've learned more about postgres/sql in the past 48hour 
than I have in the past 15 years, and I'm pretty sure that the improvements I 
need will come from new ways of querying data , rather than optimizing the 
current queries.

Minor point: when specifying PRIMARY KEY, you don't need to also put NOT 
NULL (this should make no change to performance).


I notice that the 'id' of 'streamevent' is not marked as a PRIMARY KEY, 
so it will not have an index associated with it - hence referencing it 
as a foreign key might be slower than expected.



Cheers,
Gavin


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] advice sought - general approaches to optimizing queries around event streams

2014-09-26 Thread Brent Wood
We have a similar timeseries database approaching 500m records.

We partition the main tables (much like your events) into one year subsets, 
with a clustered index on timestamp for all but the live year.

https://blog.engineyard.com/2013/scaling-postgresql-performance-table-partitioning
http://www.postgresql.org/docs/9.3/static/ddl-partitioning.html
http://www.postgresql.org/docs/9.3/static/sql-cluster.html

As discussed here previously, you can also improve performance using hardware - 
SSD'd vs spindles. Also note that tablespaces, with indexes on your faster 
drives  data on slower ones can improve performance.

http://www.postgresql.org/docs/9.3/static/manage-ag-tablespaces.html

Also make sure your db server is optimised for the database size  hardware 
configuration - like perhaps alloe fewer concurrent users, but more resources 
per user, or see what pgtune recommends.

Should help your performance, in terms of underlying db efficiency  
performance, rather than tweaking your actual queries.

Brent Wood

Programme leader: Environmental Information Delivery
NIWA
DDI:  +64 (4) 3860529

Brent Wood
Principal Technician - GIS and Spatial Data Management
Programme Leader - Environmental Information Delivery
+64-4-386-0529 | 301 Evans Bay Parade, Greta Point, Wellington | 
www.niwa.co.nzhttp://www.niwa.co.nz
[NIWA]http://www.niwa.co.nz

From: pgsql-general-ow...@postgresql.org pgsql-general-ow...@postgresql.org 
on behalf of Jonathan Vanasco postg...@2xlp.com
Sent: Saturday, September 27, 2014 9:02 AM
To: PostgreSQL general
Subject: [GENERAL] advice sought - general approaches to optimizing queries 
around event streams

I have a growing database with millions of rows that track resources against an 
event stream.

i have a few handfuls of queries that interact with this stream in a variety of 
ways, and I have managed to drop things down from 70s to 3.5s on full scans and 
offer .05s partial scans.

no matter how i restructure queries, I can't seem to get around a few 
bottlenecks and I wanted to know if there were any tips/tricks from the 
community on how to approach them.

a simple form of my database would be:

   --  1k of
   create table stream (
   id int not null primary key,
   )

   -- 1MM of
   create table resource (
   id int not null primary key,
   col_a bool,
   col_b bool,
   col_c text,
   );

   -- 10MM of
   create table streamevent (
   id int not null,
   event_timestamp timestamp not null,
   stream_id int not null references stream(id)
   );

   -- 10MM of
   create table resource_2_stream_event(
   resource_id int not null references resource(id),
   streamevent_id int not null references streamevent(id)
   )

Everything is running off of indexes; there are no seq scans.

I've managed to optimize my queries by avoiding joins against tables, and 
turning the stream interaction into a subquery or CTE.
better performance has come from limiting the number of stream events  ( 
which are only the timestamp and resource_id off a joined table )

The bottlenecks I've encountered have primarily been:

1.  When interacting with a stream, the ordering of event_timestamp and 
deduplicating of resources becomes an issue.
   I've figured out a novel way to work with the most recent events, but 
distant events are troublesome

   using no limit, the query takes 3500 ms
   using a limit of 1, the query takes 320ms
   using a limit of 1000, the query takes 20ms

   there is a dedicated index of on event_timestamp (desc) , and it is 
being used
   according to the planner... finding all the records is fine; 
merging-into and sorting the aggregate to handle the deduplication of records 
in a stream seems to be the issue (either with DISTINCT or max+group_by)


2.  I can't figure out an effective way to search for a term against an 
entire stream (using a tsquery/gin based search)

   I thought about limiting the query by finding matching resources first, 
then locking it to an event stream, but:
   - scanning the entire table for a term takes about 10 seconds on 
an initial hit.  subsequent queries for the same terms end up using the cache, 
and complete within 20ms.

   I get better search performance by calculating the event stream, then 
searching it for matching documents, but I still have the performance issues 
related to limiting the window of events

i didn't include example queries, because I'm more concerned with the general 
approaches and ideas behind dealing with large data sets than i am with raw SQL 
right now.

i'm hoping someone can enlighten me into looking at new ways to solve these 
problems.   i think i've learned more about postgres/sql in the past 48hour 
than I have in the past 15 years, and I'm pretty sure