I have a growing database with millions of rows that track resources against an 
event stream.  

i have a few handfuls of queries that interact with this stream in a variety of 
ways, and I have managed to drop things down from 70s to 3.5s on full scans and 
offer .05s partial scans.  

no matter how i restructure queries, I can't seem to get around a few 
bottlenecks and I wanted to know if there were any tips/tricks from the 
community on how to approach them.  

a simple form of my database would be:

        --  1k of
        create table stream (
                id int not null primary key,
        )

        -- 1MM of
        create table resource (
                id int not null primary key,
                col_a bool,
                col_b bool,
                col_c text,
        );

        -- 10MM of
        create table streamevent (
                id int not null,
                event_timestamp timestamp not null,
                stream_id int not null references stream(id)
        );

        -- 10MM of
        create table resource_2_stream_event(
                resource_id int not null references resource(id),
                streamevent_id int not null references streamevent(id)
        )

Everything is running off of indexes; there are no seq scans.

I've managed to optimize my queries by avoiding joins against tables, and 
turning the stream interaction into a subquery or CTE.  
better performance has come from limiting the number of "stream events"  ( 
which are only the timestamp and resource_id off a joined table ) 

The bottlenecks I've encountered have primarily been:

1.      When interacting with a stream, the ordering of event_timestamp and 
deduplicating of resources becomes an issue.
        I've figured out a novel way to work with the most recent events, but 
distant events are troublesome

        using no limit, the query takes 3500 ms
        using a limit of 10000, the query takes 320ms
        using a limit of 1000, the query takes 20ms

        there is a dedicated index of on event_timestamp (desc) , and it is 
being used
        according to the planner... finding all the records is fine; 
merging-into and sorting the aggregate to handle the deduplication of records 
in a stream seems to be the issue (either with DISTINCT or max+group_by)


2.      I can't figure out an effective way to search for a term against an 
entire stream (using a tsquery/gin based search)

        I thought about limiting the query by finding matching resources first, 
then locking it to an event stream, but:
                - scanning the entire table for a term takes about 10 seconds 
on an initial hit.  subsequent queries for the same terms end up using the 
cache, and complete within 20ms.

        I get better search performance by calculating the event stream, then 
searching it for matching documents, but I still have the performance issues 
related to limiting the window of events

i didn't include example queries, because I'm more concerned with the general 
approaches and ideas behind dealing with large data sets than i am with raw SQL 
right now.  
        
i'm hoping someone can enlighten me into looking at new ways to solve these 
problems.   i think i've learned more about postgres/sql in the past 48hour 
than I have in the past 15 years, and I'm pretty sure that the improvements I 
need will come from new ways of querying data , rather than optimizing the 
current queries.




















-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to