[GENERAL] advice sought - general approaches to optimizing queries around event streams
I have a growing database with millions of rows that track resources against an event stream. i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans. no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them. a simple form of my database would be: -- 1k of create table stream ( id int not null primary key, ) -- 1MM of create table resource ( id int not null primary key, col_a bool, col_b bool, col_c text, ); -- 10MM of create table streamevent ( id int not null, event_timestamp timestamp not null, stream_id int not null references stream(id) ); -- 10MM of create table resource_2_stream_event( resource_id int not null references resource(id), streamevent_id int not null references streamevent(id) ) Everything is running off of indexes; there are no seq scans. I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE. better performance has come from limiting the number of stream events ( which are only the timestamp and resource_id off a joined table ) The bottlenecks I've encountered have primarily been: 1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue. I've figured out a novel way to work with the most recent events, but distant events are troublesome using no limit, the query takes 3500 ms using a limit of 1, the query takes 320ms using a limit of 1000, the query takes 20ms there is a dedicated index of on event_timestamp (desc) , and it is being used according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by) 2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search) I thought about limiting the query by finding matching resources first, then locking it to an event stream, but: - scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the same terms end up using the cache, and complete within 20ms. I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now. i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will come from new ways of querying data , rather than optimizing the current queries. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] advice sought - general approaches to optimizing queries around event streams
On 27/09/14 09:02, Jonathan Vanasco wrote: I have a growing database with millions of rows that track resources against an event stream. i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans. no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them. a simple form of my database would be: -- 1k of create table stream ( id int not null primary key, ) -- 1MM of create table resource ( id int not null primary key, col_a bool, col_b bool, col_c text, ); -- 10MM of create table streamevent ( id int not null, event_timestamp timestamp not null, stream_id int not null references stream(id) ); -- 10MM of create table resource_2_stream_event( resource_id int not null references resource(id), streamevent_id int not null references streamevent(id) ) Everything is running off of indexes; there are no seq scans. I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE. better performance has come from limiting the number of stream events ( which are only the timestamp and resource_id off a joined table ) The bottlenecks I've encountered have primarily been: 1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue. I've figured out a novel way to work with the most recent events, but distant events are troublesome using no limit, the query takes 3500 ms using a limit of 1, the query takes 320ms using a limit of 1000, the query takes 20ms there is a dedicated index of on event_timestamp (desc) , and it is being used according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by) 2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search) I thought about limiting the query by finding matching resources first, then locking it to an event stream, but: - scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the same terms end up using the cache, and complete within 20ms. I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now. i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will come from new ways of querying data , rather than optimizing the current queries. Minor point: when specifying PRIMARY KEY, you don't need to also put NOT NULL (this should make no change to performance). I notice that the 'id' of 'streamevent' is not marked as a PRIMARY KEY, so it will not have an index associated with it - hence referencing it as a foreign key might be slower than expected. Cheers, Gavin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] advice sought - general approaches to optimizing queries around event streams
We have a similar timeseries database approaching 500m records. We partition the main tables (much like your events) into one year subsets, with a clustered index on timestamp for all but the live year. https://blog.engineyard.com/2013/scaling-postgresql-performance-table-partitioning http://www.postgresql.org/docs/9.3/static/ddl-partitioning.html http://www.postgresql.org/docs/9.3/static/sql-cluster.html As discussed here previously, you can also improve performance using hardware - SSD'd vs spindles. Also note that tablespaces, with indexes on your faster drives data on slower ones can improve performance. http://www.postgresql.org/docs/9.3/static/manage-ag-tablespaces.html Also make sure your db server is optimised for the database size hardware configuration - like perhaps alloe fewer concurrent users, but more resources per user, or see what pgtune recommends. Should help your performance, in terms of underlying db efficiency performance, rather than tweaking your actual queries. Brent Wood Programme leader: Environmental Information Delivery NIWA DDI: +64 (4) 3860529 Brent Wood Principal Technician - GIS and Spatial Data Management Programme Leader - Environmental Information Delivery +64-4-386-0529 | 301 Evans Bay Parade, Greta Point, Wellington | www.niwa.co.nzhttp://www.niwa.co.nz [NIWA]http://www.niwa.co.nz From: pgsql-general-ow...@postgresql.org pgsql-general-ow...@postgresql.org on behalf of Jonathan Vanasco postg...@2xlp.com Sent: Saturday, September 27, 2014 9:02 AM To: PostgreSQL general Subject: [GENERAL] advice sought - general approaches to optimizing queries around event streams I have a growing database with millions of rows that track resources against an event stream. i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans. no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them. a simple form of my database would be: -- 1k of create table stream ( id int not null primary key, ) -- 1MM of create table resource ( id int not null primary key, col_a bool, col_b bool, col_c text, ); -- 10MM of create table streamevent ( id int not null, event_timestamp timestamp not null, stream_id int not null references stream(id) ); -- 10MM of create table resource_2_stream_event( resource_id int not null references resource(id), streamevent_id int not null references streamevent(id) ) Everything is running off of indexes; there are no seq scans. I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE. better performance has come from limiting the number of stream events ( which are only the timestamp and resource_id off a joined table ) The bottlenecks I've encountered have primarily been: 1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue. I've figured out a novel way to work with the most recent events, but distant events are troublesome using no limit, the query takes 3500 ms using a limit of 1, the query takes 320ms using a limit of 1000, the query takes 20ms there is a dedicated index of on event_timestamp (desc) , and it is being used according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by) 2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search) I thought about limiting the query by finding matching resources first, then locking it to an event stream, but: - scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the same terms end up using the cache, and complete within 20ms. I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now. i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure