> On 19 Apr 2017, at 6:01, Samuel Williams <space.ship.travel...@gmail.com> 
> wrote:
> 
> Hi.
> 
> We have 400,000,000 records in a table (soon to be 800,000,000), here
> is the schema (\d+)
> 
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
> 
> We want the following kinds of query to be fast:
> 
> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
> "user_event"."what" IN ('poll', 'location_change',
> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
> 175.0805140220076);
> 
> We have a btree index and it appears to be working. However, it's
> still pretty slow.
> 
> EXPLAIN ANALYZE gives the following:
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
> 
> I'm thinking that I need to do the following to help:
> 
> CLUSTER user_event ON index_user_event_for_visits_3 followed by
> analyze... Our data is mostly time series but sometimes we get some
> dumps with historical records.

It seems to me that a large part of the problem is that the server has to scan 
all index entries from that date and within those location bounds to find that 
the distinct user id's in that set are about 114,000 out of 1.7M rows matching 
the selection-criteria. If it could stop at the first location for each user, 
it would have to scan less than a 10th of the index entries that it has to scan 
now...

How high is the correlation between user id and location? That can probably be 
leveraged…
Perhaps you'd get better performance if you'd use multiple indices instead of 
stuffing everything into a single purpose-specific one? I would suggest one on 
(user_id, latitude, longitude) or (latitude, longitude, user_id) and one on 
(created_at, user_id), or perhaps (created_at, latitude, longitude). That 
approach could also reduce the number of indices you have on that table, as 
well as their sizes, making it all fit into cache a little easier. Then again, 
additional operations will be required to combine them.

For a different approach; It may be possible to enrich your data with something 
that is easy to index and query, with a high correlation to (latitude, 
longitude). That could also be used to partition over. Country seems a good 
candidate here, unless all your data-points are in New Zealand like the above?
Then again, some countries are a lot larger, with a higher population, than 
others. And populations can be highly concentrated (Delhi, Moscow to name a 
few).
Another option is to divide the location space up into squares of a fixed size, 
with a partition for each square. About 80% of those squares are unpopulated 
though, being at sea.

Created_at is a very good candidate for partitioning too, especially if you 
don't intend to keep data older than a certain age. Truncating or dropping a 
partition that you no longer need is quick and easy.

With data-sets this large, I'd think you would want to partition on multiple 
dimensions, creating a matrix of partitions under a single master table. I 
don't think PG has a simple way of doing that (yet) though; perhaps it's 
possible by abusing multiple levels of inheritance, but that sounds like a bad 
idea.

And of course, create your partitions sufficiently course to prevent 
overburdening the system tables, which would slow down the query planner.

Hopefully there's something useful in my ramblings!

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to