subject:"Re\: \[GENERAL\] Large data and slow queries"

Re: [GENERAL] Large data and slow queries

2017-04-27 Thread Samuel Williams

Citus is also now just an extension.

Sounds pretty awesome, I'll certainly consider your system if/when we
decide to make changes.

On 22 April 2017 at 08:41, Andrew Staller  wrote:
> Samuel,
>
> Short answer to your questions:  (1) TimescaleDB and CitusDB are focusing on
> solving different problems, and (2) TimescaleDB is an Apache 2-licensed
> extension to run in your Postgres database, not a fork or different system.
>
> Longer answer to your first question:
>
> From what we've read and the conversations we've had with Citus (awesome
> folks, btw), they're primarily solving a different problem -- OLAP queries
> and distributed transactions -- while we are focused on time-series data
> analysis.  As such, we haven't benchmarked against Citus and if we were to,
> it would require some unrealistic finagling in order to make it an
> apples-to-apples comparison.
>
> As an example, their partitioning is based on a single primary key, while
> all writes in their clustered version go to a cluster master (although their
> hosted version of Citus MX changes this a bit).  We perform two-dimensional
> partitioning using your primary key and time, with the partitions
> automatically created/closed on time intervals and data volume in order to
> keep the most recent partitioned chunks of data (and their B-trees) in
> memory to support fast ingest for time-series data. (***) We have only
> released our single-node version, but the clustering will allow you to write
> directly to any node.  Citus has also decided only to support the SQL
> queries they are optimized for, while we've made the decision to support
> "full SQL" (even if some queries might not be fast).
>
> *** We blogged yesterday about TimescaleDB's partitioning design choices in
> more depth, if you are interested:
> https://blog.timescale.com/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c
>
>
> On Thu, Apr 20, 2017 at 5:43 PM, Samuel Williams
>  wrote:
>>
>> Andrew, how would timescaledb compare to citus - and is timescaledb an
>> extension to postgres or is it an entirely separate system?
>>
>> On 21 April 2017 at 02:44, Andrew Staller  wrote:
>> > Awesome thread.
>> >
>> > Samuel,
>> >
>> > Just wanted you to be aware of the work we're doing at TimescaleDB
>> > (http://www.timescale.com/), a time-series database extension for
>> > PostgreSQL.
>> >
>> > Some of how we might help you:
>> > - automatic partitioning by space (primary key - like country_id, for
>> > instance) and time. This creates "chunks" of your data, right-sized by
>> > volume and time constraints (which makes inserts fast at much greater
>> > scale
>> > than Vanilla PostgreSQL - not sure if this is a problem for you)
>> > - this will also help if your queries are selective on time and
>> > country_id
>> > (per this example)
>> > - the partitioning by time allows you to DROP old chunks without the
>> > need
>> > for vacuums
>> >
>> > On Thu, Apr 20, 2017 at 8:30 AM, Vick Khera  wrote:
>> >>
>> >> I'm curious why you have so many partial indexes. Are you trying to
>> >> make
>> >> custom indexes per query? It seems to me you might want to consider
>> >> making
>> >> the indexes general, and remove the redundant ones (that have the same
>> >> prefix list of indexed fields).
>> >>
>> >> Secondly your table is 102Gb. Clearly there's a lot of data here. How
>> >> many
>> >> rows does that take? I would further suggest that you partition this
>> >> table
>> >> such that there are no more than about 10 million rows per partition
>> >> (I've
>> >> done this by using a id % 100 computation). Maybe in your case it makes
>> >> sense to partition it based on the "what" field, because it appears you
>> >> are
>> >> trying to do that with your partial indexes already.
>> >>
>> >> On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams
>> >>  wrote:
>> >>>
>> >>> Okay, so after changing longitude/latitude to float4, and
>> >>> re-organizing the table a bit, I got the query down from about 8
>> >>> minutes to 40 seconds.
>> >>>
>> >>> The details are in the gist comments:
>> >>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>> >>>
>> >>> Now, just need to get performance another 2 orders of magnitude
>> >>> better. Can we make the index a bit more coarse grained, perhaps
>> >>> convert long/lat to integers or something, use a hilbert index, use a
>> >>> postgis index... ideas?
>> >>>
>> >>>
>> >>> --
>> >>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> >>> To make changes to your subscription:
>> >>> http://www.postgresql.org/mailpref/pgsql-general
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > TimescaleDB | Growth & Developer Evangelism
>> > c: 908.581.9509
>> >
>> > 335 Madison Ave.
>> > New York, NY 10017
>> > www.timescale.com
>> > https://github.com/timescale/timescaledb
>>
>>
>> --
>> Sent via

Re: [GENERAL] Large data and slow queries

2017-04-21 Thread Andrew Staller

Samuel,

Short answer to your questions:  (1) TimescaleDB and CitusDB are focusing
on solving different problems, and (2) TimescaleDB is an Apache 2-licensed
extension to run in your Postgres database, not a fork or different system.

Longer answer to your first question:

>From what we've read and the conversations we've had with Citus (awesome
folks, btw), they're primarily solving a different problem -- OLAP queries
and distributed transactions -- while we are focused on time-series data
analysis.  As such, we haven't benchmarked against Citus and if we were to,
it would require some unrealistic finagling in order to make it an
apples-to-apples comparison.

As an example, their partitioning is based on a single primary key, while
all writes in their clustered version go to a cluster master (although
their hosted version of Citus MX changes this a bit).  We perform
two-dimensional partitioning using your primary key and time, with the
partitions automatically created/closed on time intervals and data volume
in order to keep the most recent partitioned chunks of data (and their
B-trees) in memory to support fast ingest for time-series data. (***) We
have only released our single-node version, but the clustering will allow
you to write directly to any node.  Citus has also decided only to support
the SQL queries they are optimized for, while we've made the decision to
support "full SQL" (even if some queries might not be fast).

*** We blogged yesterday about TimescaleDB's partitioning design choices in
more depth, if you are interested:
https://blog.timescale.com/time-series-data-why-and-how-to-u
se-a-relational-database-instead-of-nosql-d0cd6975e87c


On Thu, Apr 20, 2017 at 5:43 PM, Samuel Williams <
space.ship.travel...@gmail.com> wrote:

> Andrew, how would timescaledb compare to citus - and is timescaledb an
> extension to postgres or is it an entirely separate system?
>
> On 21 April 2017 at 02:44, Andrew Staller  wrote:
> > Awesome thread.
> >
> > Samuel,
> >
> > Just wanted you to be aware of the work we're doing at TimescaleDB
> > (http://www.timescale.com/), a time-series database extension for
> > PostgreSQL.
> >
> > Some of how we might help you:
> > - automatic partitioning by space (primary key - like country_id, for
> > instance) and time. This creates "chunks" of your data, right-sized by
> > volume and time constraints (which makes inserts fast at much greater
> scale
> > than Vanilla PostgreSQL - not sure if this is a problem for you)
> > - this will also help if your queries are selective on time and
> country_id
> > (per this example)
> > - the partitioning by time allows you to DROP old chunks without the need
> > for vacuums
> >
> > On Thu, Apr 20, 2017 at 8:30 AM, Vick Khera  wrote:
> >>
> >> I'm curious why you have so many partial indexes. Are you trying to make
> >> custom indexes per query? It seems to me you might want to consider
> making
> >> the indexes general, and remove the redundant ones (that have the same
> >> prefix list of indexed fields).
> >>
> >> Secondly your table is 102Gb. Clearly there's a lot of data here. How
> many
> >> rows does that take? I would further suggest that you partition this
> table
> >> such that there are no more than about 10 million rows per partition
> (I've
> >> done this by using a id % 100 computation). Maybe in your case it makes
> >> sense to partition it based on the "what" field, because it appears you
> are
> >> trying to do that with your partial indexes already.
> >>
> >> On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams
> >>  wrote:
> >>>
> >>> Okay, so after changing longitude/latitude to float4, and
> >>> re-organizing the table a bit, I got the query down from about 8
> >>> minutes to 40 seconds.
> >>>
> >>> The details are in the gist comments:
> >>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
> >>>
> >>> Now, just need to get performance another 2 orders of magnitude
> >>> better. Can we make the index a bit more coarse grained, perhaps
> >>> convert long/lat to integers or something, use a hilbert index, use a
> >>> postgis index... ideas?
> >>>
> >>>
> >>> --
> >>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> >>> To make changes to your subscription:
> >>> http://www.postgresql.org/mailpref/pgsql-general
> >>
> >>
> >
> >
> >
> > --
> > TimescaleDB | Growth & Developer Evangelism
> > c: 908.581.9509
> >
> > 335 Madison Ave.
> > New York, NY 10017
> > www.timescale.com
> > https://github.com/timescale/timescaledb
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>



-- 
TimescaleDB* | *Growth & Developer Evangelism
c: 908.581.9509 <(908)%20581-9509>

335 Madison Ave.
New York, NY 10017
www.timescale.com
https://github.com/timescale/timescaledb

Re: [GENERAL] Large data and slow queries

2017-04-20 Thread Samuel Williams

Scott, Vick, the vast majority of the data is generic. But there are
some specific events we need to look up quickly which are probably
less than a few 100,000 records. We did evaluate partial indexes vs
full indexes. The partial index speeds up our specific queries
significantly while only taking a very small amount of space (often <
100MB, compared to a full index on the DB which is up around 20Gb at
the moment).

On 21 April 2017 at 03:01, Scott Marlowe  wrote:
> On Thu, Apr 20, 2017 at 6:30 AM, Vick Khera  wrote:
>> I'm curious why you have so many partial indexes. Are you trying to make
>> custom indexes per query? It seems to me you might want to consider making
>> the indexes general, and remove the redundant ones (that have the same
>> prefix list of indexed fields).
>>
>> Secondly your table is 102Gb. Clearly there's a lot of data here. How many
>> rows does that take? I would further suggest that you partition this table
>> such that there are no more than about 10 million rows per partition (I've
>> done this by using a id % 100 computation). Maybe in your case it makes
>> sense to partition it based on the "what" field, because it appears you are
>> trying to do that with your partial indexes already.
>
> I would think a two field index might be just as effective and not
> require a lot of maintenance etc.
>
> --
> To understand recursion, one must first understand recursion.
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-20 Thread Samuel Williams

Andrew, how would timescaledb compare to citus - and is timescaledb an
extension to postgres or is it an entirely separate system?

On 21 April 2017 at 02:44, Andrew Staller  wrote:
> Awesome thread.
>
> Samuel,
>
> Just wanted you to be aware of the work we're doing at TimescaleDB
> (http://www.timescale.com/), a time-series database extension for
> PostgreSQL.
>
> Some of how we might help you:
> - automatic partitioning by space (primary key - like country_id, for
> instance) and time. This creates "chunks" of your data, right-sized by
> volume and time constraints (which makes inserts fast at much greater scale
> than Vanilla PostgreSQL - not sure if this is a problem for you)
> - this will also help if your queries are selective on time and country_id
> (per this example)
> - the partitioning by time allows you to DROP old chunks without the need
> for vacuums
>
> On Thu, Apr 20, 2017 at 8:30 AM, Vick Khera  wrote:
>>
>> I'm curious why you have so many partial indexes. Are you trying to make
>> custom indexes per query? It seems to me you might want to consider making
>> the indexes general, and remove the redundant ones (that have the same
>> prefix list of indexed fields).
>>
>> Secondly your table is 102Gb. Clearly there's a lot of data here. How many
>> rows does that take? I would further suggest that you partition this table
>> such that there are no more than about 10 million rows per partition (I've
>> done this by using a id % 100 computation). Maybe in your case it makes
>> sense to partition it based on the "what" field, because it appears you are
>> trying to do that with your partial indexes already.
>>
>> On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams
>>  wrote:
>>>
>>> Okay, so after changing longitude/latitude to float4, and
>>> re-organizing the table a bit, I got the query down from about 8
>>> minutes to 40 seconds.
>>>
>>> The details are in the gist comments:
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>>
>>> Now, just need to get performance another 2 orders of magnitude
>>> better. Can we make the index a bit more coarse grained, perhaps
>>> convert long/lat to integers or something, use a hilbert index, use a
>>> postgis index... ideas?
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>
>
>
> --
> TimescaleDB | Growth & Developer Evangelism
> c: 908.581.9509
>
> 335 Madison Ave.
> New York, NY 10017
> www.timescale.com
> https://github.com/timescale/timescaledb


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-20 Thread Scott Marlowe

On Thu, Apr 20, 2017 at 6:30 AM, Vick Khera  wrote:
> I'm curious why you have so many partial indexes. Are you trying to make
> custom indexes per query? It seems to me you might want to consider making
> the indexes general, and remove the redundant ones (that have the same
> prefix list of indexed fields).
>
> Secondly your table is 102Gb. Clearly there's a lot of data here. How many
> rows does that take? I would further suggest that you partition this table
> such that there are no more than about 10 million rows per partition (I've
> done this by using a id % 100 computation). Maybe in your case it makes
> sense to partition it based on the "what" field, because it appears you are
> trying to do that with your partial indexes already.

I would think a two field index might be just as effective and not
require a lot of maintenance etc.

-- 
To understand recursion, one must first understand recursion.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-20 Thread Andrew Staller

Awesome thread.

Samuel,

Just wanted you to be aware of the work we're doing at TimescaleDB (
http://www.timescale.com/), a time-series database extension for PostgreSQL.

Some of how we might help you:
- automatic partitioning by space (primary key - like country_id, for
instance) and time. This creates "chunks" of your data, right-sized by
volume and time constraints (which makes inserts fast at much greater scale
than Vanilla PostgreSQL - not sure if this is a problem for you)
- this will also help if your queries are selective on time and country_id
(per this example)
- the partitioning by time allows you to DROP old chunks without the need
for vacuums

On Thu, Apr 20, 2017 at 8:30 AM, Vick Khera  wrote:

> I'm curious why you have so many partial indexes. Are you trying to make
> custom indexes per query? It seems to me you might want to consider making
> the indexes general, and remove the redundant ones (that have the same
> prefix list of indexed fields).
>
> Secondly your table is 102Gb. Clearly there's a lot of data here. How many
> rows does that take? I would further suggest that you partition this table
> such that there are no more than about 10 million rows per partition (I've
> done this by using a id % 100 computation). Maybe in your case it makes
> sense to partition it based on the "what" field, because it appears you are
> trying to do that with your partial indexes already.
>
> On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams <
> space.ship.travel...@gmail.com> wrote:
>
>> Okay, so after changing longitude/latitude to float4, and
>> re-organizing the table a bit, I got the query down from about 8
>> minutes to 40 seconds.
>>
>> The details are in the gist comments:
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>
>> Now, just need to get performance another 2 orders of magnitude
>> better. Can we make the index a bit more coarse grained, perhaps
>> convert long/lat to integers or something, use a hilbert index, use a
>> postgis index... ideas?
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
>


-- 
TimescaleDB* | *Growth & Developer Evangelism
c: 908.581.9509

335 Madison Ave.
New York, NY 10017
www.timescale.com
https://github.com/timescale/timescaledb

Re: [GENERAL] Large data and slow queries

2017-04-20 Thread Vick Khera

I'm curious why you have so many partial indexes. Are you trying to make
custom indexes per query? It seems to me you might want to consider making
the indexes general, and remove the redundant ones (that have the same
prefix list of indexed fields).

Secondly your table is 102Gb. Clearly there's a lot of data here. How many
rows does that take? I would further suggest that you partition this table
such that there are no more than about 10 million rows per partition (I've
done this by using a id % 100 computation). Maybe in your case it makes
sense to partition it based on the "what" field, because it appears you are
trying to do that with your partial indexes already.

On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams <
space.ship.travel...@gmail.com> wrote:

> Okay, so after changing longitude/latitude to float4, and
> re-organizing the table a bit, I got the query down from about 8
> minutes to 40 seconds.
>
> The details are in the gist comments:
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>
> Now, just need to get performance another 2 orders of magnitude
> better. Can we make the index a bit more coarse grained, perhaps
> convert long/lat to integers or something, use a hilbert index, use a
> postgis index... ideas?
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams

Okay, so after changing longitude/latitude to float4, and
re-organizing the table a bit, I got the query down from about 8
minutes to 40 seconds.

The details are in the gist comments:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

Now, just need to get performance another 2 orders of magnitude
better. Can we make the index a bit more coarse grained, perhaps
convert long/lat to integers or something, use a hilbert index, use a
postgis index... ideas?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Alban Hertroys


> On 19 Apr 2017, at 12:58, Samuel Williams  
> wrote:
> 
> It's interesting you talk about using multiple indexes. In
> MySQL/MariaDB and derivatives, I've never seen the query planner
> consider using multiple indexes. So, it's possible that Postgres may
> use multiple indexes if it saves time? Or do you mean, doing something
> like manually joining the data and leveraging the different indexes
> explicitly?

PG is capable of doing bitmap heap scans to combine results from multiple 
indices, among other things.
Whether that will actually improve performance in this case I don't know, but 
it's worth a try I think.

> The correlation between user_id and location... well, it's somewhat
> temporally related.

So users are constantly moving around but happen to be at the same locations at 
regular intervals?

In my experience, people don't usually move around much, so you should 
certainly be able to pinpoint them mostly to a specific area, right? (Hence my 
suggestions for a country column or partitioning in squares)


> On 19 April 2017 at 22:50, Alban Hertroys  wrote:
>> 
>>> On 19 Apr 2017, at 6:01, Samuel Williams  
>>> wrote:
>>> 
>>> Hi.
>>> 
>>> We have 400,000,000 records in a table (soon to be 800,000,000), here
>>> is the schema (\d+)
>>> 
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>> 
>>> We want the following kinds of query to be fast:
>>> 
>>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>>> "user_event"."what" IN ('poll', 'location_change',
>>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>>> 175.0805140220076);
>>> 
>>> We have a btree index and it appears to be working. However, it's
>>> still pretty slow.
>>> 
>>> EXPLAIN ANALYZE gives the following:
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>>> 
>>> I'm thinking that I need to do the following to help:
>>> 
>>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>>> analyze... Our data is mostly time series but sometimes we get some
>>> dumps with historical records.
>> 
>> It seems to me that a large part of the problem is that the server has to 
>> scan all index entries from that date and within those location bounds to 
>> find that the distinct user id's in that set are about 114,000 out of 1.7M 
>> rows matching the selection-criteria. If it could stop at the first location 
>> for each user, it would have to scan less than a 10th of the index entries 
>> that it has to scan now...
>> 
>> How high is the correlation between user id and location? That can probably 
>> be leveraged…
>> Perhaps you'd get better performance if you'd use multiple indices instead 
>> of stuffing everything into a single purpose-specific one? I would suggest 
>> one on (user_id, latitude, longitude) or (latitude, longitude, user_id) and 
>> one on (created_at, user_id), or perhaps (created_at, latitude, longitude). 
>> That approach could also reduce the number of indices you have on that 
>> table, as well as their sizes, making it all fit into cache a little easier. 
>> Then again, additional operations will be required to combine them.
>> 
>> For a different approach; It may be possible to enrich your data with 
>> something that is easy to index and query, with a high correlation to 
>> (latitude, longitude). That could also be used to partition over. Country 
>> seems a good candidate here, unless all your data-points are in New Zealand 
>> like the above?
>> Then again, some countries are a lot larger, with a higher population, than 
>> others. And populations can be highly concentrated (Delhi, Moscow to name a 
>> few).
>> Another option is to divide the location space up into squares of a fixed 
>> size, with a partition for each square. About 80% of those squares are 
>> unpopulated though, being at sea.
>> 
>> Created_at is a very good candidate for partitioning too, especially if you 
>> don't intend to keep data older than a certain age. Truncating or dropping a 
>> partition that you no longer need is quick and easy.
>> 
>> With data-sets this large, I'd think you would want to partition on multiple 
>> dimensions, creating a matrix of partitions under a single master table. I 
>> don't think PG has a simple way of doing that (yet) though; perhaps it's 
>> possible by abusing multiple levels of inheritance, but that sounds like a 
>> bad idea.
>> 
>> And of course, create your partitions sufficiently course to prevent 
>> overburdening the system tables, which would slow down the query planner.
>> 
>> Hopefully there's something useful in my ramblings!
>> 
>> Alban Hertroys
>> --
>> If you can't see the forest for the trees,
>> cut the trees and you'll find there is no forest.
>> 
> 
> 
> -- 
> Sent

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams

Ah right, yeah, it's insert only. So, it's never been vacuumed.

On 20 April 2017 at 01:25, Stephen Frost  wrote:
> Greetings,
>
> * Samuel Williams (space.ship.travel...@gmail.com) wrote:
>> Thanks for all the suggestions Stephen.
>>
>> > That explain analyze shows a whole ton of heap fetches.  When was the
>> last time a VACUUM was run on this table, to build the visibility map?
>> Without the visibility map being current, an Index-Only Scan, as is
>> happening here, can really suck.
>>
>> I'm using the default 9.6 config, I thought that auto-vacuum was on by 
>> default?
>
> Sure, but that doesn't mean the table gets vacuumed.  In particular,
> actual vacuums are only kicked off when the number of *updated* or
> *deleted* tuples passes the autovacuum threshold.  If no updates or
> deletes are done on the table (iow, it's essentially an insert-only or
> insert-mostly table), the autovacuum daemon nevers runs a vacuum on it
> (though it'll run analyze's).
>
> https://www.postgresql.org/docs/current/static/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-THRESHOLD
>
> Check pg_stat_all_tables to see when/if the table has actually been
> vacuumed.
>
> Thanks!
>
> Stephen


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Stephen Frost

Greetings,

* Samuel Williams (space.ship.travel...@gmail.com) wrote:
> Thanks for all the suggestions Stephen.
> 
> > That explain analyze shows a whole ton of heap fetches.  When was the
> last time a VACUUM was run on this table, to build the visibility map?
> Without the visibility map being current, an Index-Only Scan, as is
> happening here, can really suck.
> 
> I'm using the default 9.6 config, I thought that auto-vacuum was on by 
> default?

Sure, but that doesn't mean the table gets vacuumed.  In particular,
actual vacuums are only kicked off when the number of *updated* or
*deleted* tuples passes the autovacuum threshold.  If no updates or
deletes are done on the table (iow, it's essentially an insert-only or
insert-mostly table), the autovacuum daemon nevers runs a vacuum on it
(though it'll run analyze's).

https://www.postgresql.org/docs/current/static/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-THRESHOLD

Check pg_stat_all_tables to see when/if the table has actually been
vacuumed.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams

Thanks for all the suggestions Stephen.

> That explain analyze shows a whole ton of heap fetches.  When was the
last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

I'm using the default 9.6 config, I thought that auto-vacuum was on by default?



On 20 April 2017 at 00:48, Stephen Frost  wrote:
> Greetings,
>
> * Samuel Williams (space.ship.travel...@gmail.com) wrote:
>> We want the following kinds of query to be fast:
>
> "kinds of query" isn't helpful, you should be reviewing exactly the
> queries you care about because statistics and your exact data set and
> what the exact query you're running is will all have an impact.
>
>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>> "user_event"."what" IN ('poll', 'location_change',
>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>> 175.0805140220076);
>
> Ugh.  You should really look at and consider PostGIS whenever you're
> working with geospatial data.
>
>> We have a btree index and it appears to be working. However, it's
>> still pretty slow.
>
> You're finding some 17M records and then reducing that with DISTINCT to
> only 114k.  Hard to say if it's faster to just brute-force your way
> through that with a HashAgg (as your explain analyze shows), or if a
> loose index scan would work better (check out how to do one in PG here:
> https://wiki.postgresql.org/wiki/Loose_indexscan).
>
>> EXPLAIN ANALYZE gives the following:
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>
> That explain analyze shows a whole ton of heap fetches.  When was the
> last time a VACUUM was run on this table, to build the visibility map?
> Without the visibility map being current, an Index-Only Scan, as is
> happening here, can really suck.
>
>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>> analyze... Our data is mostly time series but sometimes we get some
>> dumps with historical records.
>>
>> Perhaps add a BRIN index on created_at
>
> BRIN would be helpful if you wanted a smaller index.  That doesn't seem
> to be the case here.
>
>> I'm wondering if... we can use an index to cache, all user_ids seen on
>> a given day. If we change our index to be more discrete, e.g.
>> created_at::date, would this help? The set union of user_ids for 365
>> days should be pretty fast?
>
> Materialized views are often useful, particularly when the results are
> (relatively) slow moving.
>
>> I'm open to any ideas or suggestions, ideally we can keep
>> optimisations within the database, rather than adding a layer of
>> caching on top.
>
> If you need to go the materialized view route, I'd definitely recommend
> doing that in the database rather than outside it.
>
> Thanks!
>
> Stephen


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Stephen Frost

Greetings,

* Samuel Williams (space.ship.travel...@gmail.com) wrote:
> We want the following kinds of query to be fast:

"kinds of query" isn't helpful, you should be reviewing exactly the
queries you care about because statistics and your exact data set and
what the exact query you're running is will all have an impact.

> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
> "user_event"."what" IN ('poll', 'location_change',
> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
> 175.0805140220076);

Ugh.  You should really look at and consider PostGIS whenever you're
working with geospatial data.

> We have a btree index and it appears to be working. However, it's
> still pretty slow.

You're finding some 17M records and then reducing that with DISTINCT to
only 114k.  Hard to say if it's faster to just brute-force your way
through that with a HashAgg (as your explain analyze shows), or if a
loose index scan would work better (check out how to do one in PG here:
https://wiki.postgresql.org/wiki/Loose_indexscan).

> EXPLAIN ANALYZE gives the following:
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

That explain analyze shows a whole ton of heap fetches.  When was the
last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

> CLUSTER user_event ON index_user_event_for_visits_3 followed by
> analyze... Our data is mostly time series but sometimes we get some
> dumps with historical records.
> 
> Perhaps add a BRIN index on created_at

BRIN would be helpful if you wanted a smaller index.  That doesn't seem
to be the case here.

> I'm wondering if... we can use an index to cache, all user_ids seen on
> a given day. If we change our index to be more discrete, e.g.
> created_at::date, would this help? The set union of user_ids for 365
> days should be pretty fast?

Materialized views are often useful, particularly when the results are
(relatively) slow moving.

> I'm open to any ideas or suggestions, ideally we can keep
> optimisations within the database, rather than adding a layer of
> caching on top.

If you need to go the materialized view route, I'd definitely recommend
doing that in the database rather than outside it.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Martijn Tonies (Upscene Productions)


Hello Vinny, Samuel,



Perhaps I'm missing something, but I'd be interested in the reasoning
behind this.

For column 'what', it seems you have no index on all values, only
indices with specific values for 'what'.

How does this speed up the search? Will PostgreSQL use those indices,
instead of using a generic index on 'what' and optionally other
columns?




That's a "partial index", it only contains records that meet the 
requirements of the index definition.


https://www.postgresql.org/docs/9.5/static/indexes-partial.html

Basically; if you create an index on records where 'name = kees' then if 
your query contains "where name=kees"
the planner can just load that index and know that the records in that 
index will not contain

any other names, saving the need to filter for 'name=kees'


Martijn that is a good question. It's because we are only concerned
with a subset of events for this index and this particular query. The
query planner can recognise this and use the index correctly. By doing
this, we reduce the size of the index significantly. In the best case,
where we only wanted a few things, the index was reduced from 21GB to
8MB.



Thank for the answers, I've seen such indices before, but used, for example,
on an UPPER(...) of all values, never thought of using them to filter out
specific values in order to make the index smaller.


With regards,

Martijn Tonies
Upscene Productions
http://www.upscene.com

Database Workbench - developer tool for Oracle, MS SQL Server, PostgreSQL,
SQL Anywhere, MySQL, InterBase, NexusDB and Firebird. 




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread vinny


On 2017-04-19 13:25, Martijn Tonies (Upscene Productions) wrote:

Samuel, others,

Perhaps I'm missing something, but I'd be interested in the reasoning
behind this.

For column 'what', it seems you have no index on all values, only
indices with specific values for 'what'.

How does this speed up the search? Will PostgreSQL use those indices,
instead of using a generic index on 'what' and optionally other
columns?


With regards,

Martijn Tonies
Upscene Productions
http://www.upscene.com



That's a "partial index", it only contains records that meet the 
requirements of the index definition.


https://www.postgresql.org/docs/9.5/static/indexes-partial.html

Basically; if you create an index on records where 'name = kees' then if 
your query contains "where name=kees"
the planner can just load that index and know that the records in that 
index will not contain

any other names, saving the need to filter for 'name=kees'


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams

Martijn that is a good question. It's because we are only concerned
with a subset of events for this index and this particular query. The
query planner can recognise this and use the index correctly. By doing
this, we reduce the size of the index significantly. In the best case,
where we only wanted a few things, the index was reduced from 21GB to
8MB.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Martijn Tonies (Upscene Productions)


Samuel, others,

Perhaps I'm missing something, but I'd be interested in the reasoning behind 
this.


For column 'what', it seems you have no index on all values, only indices 
with specific values for 'what'.


How does this speed up the search? Will PostgreSQL use those indices, 
instead of using a generic index on 'what' and optionally other columns?



With regards,

Martijn Tonies
Upscene Productions
http://www.upscene.com


-Original Message- 
From: Samuel Williams

Sent: Wednesday, April 19, 2017 6:01 AM
To: pgsql-general
Subject: [GENERAL] Large data and slow queries

Hi.

We have 400,000,000 records in a table (soon to be 800,000,000), here
is the schema (\d+)

https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

We want the following kinds of query to be fast:

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

We have a btree index and it appears to be working. However, it's
still pretty slow.

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

I'm thinking that I need to do the following to help:

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

Perhaps add a BRIN index on created_at

I'm wondering if... we can use an index to cache, all user_ids seen on
a given day. If we change our index to be more discrete, e.g.
created_at::date, would this help? The set union of user_ids for 365
days should be pretty fast?

I'm open to any ideas or suggestions, ideally we can keep
optimisations within the database, rather than adding a layer of
caching on top.

Kind regards,
Samuel


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general 




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread vinny


On 2017-04-19 09:48, John R Pierce wrote:

On 4/19/2017 12:31 AM, vinny wrote:
Given the number of records, my first thought was either partitioning 
or partial-indexes.

The fewer rows are in the index, the quicker it will be to check,
and it's not a lot of work to create separate indexes for lat/long 
ranges or dates.


that only works if the planner can figure out which partitions to use
in advance, otherwise it ends up having to scan all the partitions.


--
john r pierce, recycling bits in santa cruz



True, but then again, the proposed queries are quite straight-forward
so I don't expect that to be a problem, really.

Worth a test, if only to see if it helps.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams

Thanks Alban, I appreciate your ideas and thoughts.

I'm a little reluctant to go down the partitioning route as I think
we'll probably end up sharding out horizontally using citus data in
the near future and naive postgres partitioning may hamper that
effort.

It's interesting you talk about using multiple indexes. In
MySQL/MariaDB and derivatives, I've never seen the query planner
consider using multiple indexes. So, it's possible that Postgres may
use multiple indexes if it saves time? Or do you mean, doing something
like manually joining the data and leveraging the different indexes
explicitly?

The correlation between user_id and location... well, it's somewhat
temporally related.



On 19 April 2017 at 22:50, Alban Hertroys  wrote:
>
>> On 19 Apr 2017, at 6:01, Samuel Williams  
>> wrote:
>>
>> Hi.
>>
>> We have 400,000,000 records in a table (soon to be 800,000,000), here
>> is the schema (\d+)
>>
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>
>> We want the following kinds of query to be fast:
>>
>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>> "user_event"."what" IN ('poll', 'location_change',
>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>> 175.0805140220076);
>>
>> We have a btree index and it appears to be working. However, it's
>> still pretty slow.
>>
>> EXPLAIN ANALYZE gives the following:
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>>
>> I'm thinking that I need to do the following to help:
>>
>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>> analyze... Our data is mostly time series but sometimes we get some
>> dumps with historical records.
>
> It seems to me that a large part of the problem is that the server has to 
> scan all index entries from that date and within those location bounds to 
> find that the distinct user id's in that set are about 114,000 out of 1.7M 
> rows matching the selection-criteria. If it could stop at the first location 
> for each user, it would have to scan less than a 10th of the index entries 
> that it has to scan now...
>
> How high is the correlation between user id and location? That can probably 
> be leveraged…
> Perhaps you'd get better performance if you'd use multiple indices instead of 
> stuffing everything into a single purpose-specific one? I would suggest one 
> on (user_id, latitude, longitude) or (latitude, longitude, user_id) and one 
> on (created_at, user_id), or perhaps (created_at, latitude, longitude). That 
> approach could also reduce the number of indices you have on that table, as 
> well as their sizes, making it all fit into cache a little easier. Then 
> again, additional operations will be required to combine them.
>
> For a different approach; It may be possible to enrich your data with 
> something that is easy to index and query, with a high correlation to 
> (latitude, longitude). That could also be used to partition over. Country 
> seems a good candidate here, unless all your data-points are in New Zealand 
> like the above?
> Then again, some countries are a lot larger, with a higher population, than 
> others. And populations can be highly concentrated (Delhi, Moscow to name a 
> few).
> Another option is to divide the location space up into squares of a fixed 
> size, with a partition for each square. About 80% of those squares are 
> unpopulated though, being at sea.
>
> Created_at is a very good candidate for partitioning too, especially if you 
> don't intend to keep data older than a certain age. Truncating or dropping a 
> partition that you no longer need is quick and easy.
>
> With data-sets this large, I'd think you would want to partition on multiple 
> dimensions, creating a matrix of partitions under a single master table. I 
> don't think PG has a simple way of doing that (yet) though; perhaps it's 
> possible by abusing multiple levels of inheritance, but that sounds like a 
> bad idea.
>
> And of course, create your partitions sufficiently course to prevent 
> overburdening the system tables, which would slow down the query planner.
>
> Hopefully there's something useful in my ramblings!
>
> Alban Hertroys
> --
> If you can't see the forest for the trees,
> cut the trees and you'll find there is no forest.
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Alban Hertroys

> On 19 Apr 2017, at 6:01, Samuel Williams  
> wrote:
> 
> Hi.
> 
> We have 400,000,000 records in a table (soon to be 800,000,000), here
> is the schema (\d+)
> 
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
> 
> We want the following kinds of query to be fast:
> 
> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
> "user_event"."what" IN ('poll', 'location_change',
> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
> 175.0805140220076);
> 
> We have a btree index and it appears to be working. However, it's
> still pretty slow.
> 
> EXPLAIN ANALYZE gives the following:
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
> 
> I'm thinking that I need to do the following to help:
> 
> CLUSTER user_event ON index_user_event_for_visits_3 followed by
> analyze... Our data is mostly time series but sometimes we get some
> dumps with historical records.

It seems to me that a large part of the problem is that the server has to scan 
all index entries from that date and within those location bounds to find that 
the distinct user id's in that set are about 114,000 out of 1.7M rows matching 
the selection-criteria. If it could stop at the first location for each user, 
it would have to scan less than a 10th of the index entries that it has to scan 
now...

How high is the correlation between user id and location? That can probably be 
leveraged…
Perhaps you'd get better performance if you'd use multiple indices instead of 
stuffing everything into a single purpose-specific one? I would suggest one on 
(user_id, latitude, longitude) or (latitude, longitude, user_id) and one on 
(created_at, user_id), or perhaps (created_at, latitude, longitude). That 
approach could also reduce the number of indices you have on that table, as 
well as their sizes, making it all fit into cache a little easier. Then again, 
additional operations will be required to combine them.

For a different approach; It may be possible to enrich your data with something 
that is easy to index and query, with a high correlation to (latitude, 
longitude). That could also be used to partition over. Country seems a good 
candidate here, unless all your data-points are in New Zealand like the above?
Then again, some countries are a lot larger, with a higher population, than 
others. And populations can be highly concentrated (Delhi, Moscow to name a 
few).
Another option is to divide the location space up into squares of a fixed size, 
with a partition for each square. About 80% of those squares are unpopulated 
though, being at sea.

Created_at is a very good candidate for partitioning too, especially if you 
don't intend to keep data older than a certain age. Truncating or dropping a 
partition that you no longer need is quick and easy.

With data-sets this large, I'd think you would want to partition on multiple 
dimensions, creating a matrix of partitions under a single master table. I 
don't think PG has a simple way of doing that (yet) though; perhaps it's 
possible by abusing multiple levels of inheritance, but that sounds like a bad 
idea.

And of course, create your partitions sufficiently course to prevent 
overburdening the system tables, which would slow down the query planner.

Hopefully there's something useful in my ramblings!

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread Samuel Williams

> Did that 50% performance gain come from just the datatype, or that fact that 
> the index became smaller?

How would one measure this?

On 19 April 2017 at 19:48, John R Pierce  wrote:
> On 4/19/2017 12:31 AM, vinny wrote:
>>
>> Given the number of records, my first thought was either partitioning or
>> partial-indexes.
>> The fewer rows are in the index, the quicker it will be to check,
>> and it's not a lot of work to create separate indexes for lat/long ranges
>> or dates.
>
>
> that only works if the planner can figure out which partitions to use in
> advance, otherwise it ends up having to scan all the partitions.
>
>
>
> --
> john r pierce, recycling bits in santa cruz
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread John R Pierce


On 4/19/2017 12:31 AM, vinny wrote:
Given the number of records, my first thought was either partitioning 
or partial-indexes.

The fewer rows are in the index, the quicker it will be to check,
and it's not a lot of work to create separate indexes for lat/long 
ranges or dates.


that only works if the planner can figure out which partitions to use in 
advance, otherwise it ends up having to scan all the partitions.



--
john r pierce, recycling bits in santa cruz



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-19 Thread vinny


On 2017-04-19 07:04, Samuel Williams wrote:

Thanks John. Yes, you are absolutely right, you want the index to be
bottom heavy so you can cull as much as possible at the top. I'm
familiar with that, once implementing a brute-force sudoku solver, it
has the same principle.

I've been working on this all afternoon. By reducing the longitude,
latitude columns to float4, in my test cases, I found about 50%
improvement in performance. It may also use less space. So part of the
problem was my choice of data type. We've computed that float4 has a
worst case precision of about 1.6m which we are okay with for
analytics data.

Another option we may consider is using a (signed) integer - e.g.
longitude = 180*(v/2^31) and latitude = 180*(v/2^31) as this has a
uniform error across all points, but it's a bit more cumbersome to
handle. Is there a rational datatype in postgres which works like
this?



On 19 April 2017 at 16:42, John R Pierce  wrote:

On 4/18/2017 9:01 PM, Samuel Williams wrote:


We want the following kinds of query to be fast:

SELECT ... AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);




I wonder if GIST would work better if you use the native POINT type, 
and

compared it like

mypoint <@ BOX
'((174.6307139779924,-37.03079375089291),(175.0805140220076,-36.67086424910709
))'

with a gist index on mypoint...

but, it all hinges on which clauses in your query are most selective, 
thats

where you want an index.

--
john r pierce, recycling bits in santa cruz



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Did that 50% performance gain come from just the datatype, or that fact 
that the index became smaller?


Given the number of records, my first thought was either partitioning or 
partial-indexes.

The fewer rows are in the index, the quicker it will be to check,
and it's not a lot of work to create separate indexes for lat/long 
ranges or dates.




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-18 Thread Samuel Williams

Thanks John. Yes, you are absolutely right, you want the index to be
bottom heavy so you can cull as much as possible at the top. I'm
familiar with that, once implementing a brute-force sudoku solver, it
has the same principle.

I've been working on this all afternoon. By reducing the longitude,
latitude columns to float4, in my test cases, I found about 50%
improvement in performance. It may also use less space. So part of the
problem was my choice of data type. We've computed that float4 has a
worst case precision of about 1.6m which we are okay with for
analytics data.

Another option we may consider is using a (signed) integer - e.g.
longitude = 180*(v/2^31) and latitude = 180*(v/2^31) as this has a
uniform error across all points, but it's a bit more cumbersome to
handle. Is there a rational datatype in postgres which works like
this?

On 19 April 2017 at 16:42, John R Pierce  wrote:
> On 4/18/2017 9:01 PM, Samuel Williams wrote:
>>
>> We want the following kinds of query to be fast:
>>
>> SELECT ... AND (latitude > -37.03079375089291 AND latitude <
>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>> 175.0805140220076);
>
>
>
> I wonder if GIST would work better if you use the native POINT type, and
> compared it like
>
> mypoint <@ BOX
> '((174.6307139779924,-37.03079375089291),(175.0805140220076,-36.67086424910709
> ))'
>
> with a gist index on mypoint...
>
> but, it all hinges on which clauses in your query are most selective, thats
> where you want an index.
>
> --
> john r pierce, recycling bits in santa cruz
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-18 Thread John R Pierce


On 4/18/2017 9:01 PM, Samuel Williams wrote:

We want the following kinds of query to be fast:

SELECT ... AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);



I wonder if GIST would work better if you use the native POINT type, and 
compared it like


mypoint <@ BOX 
'((174.6307139779924,-37.03079375089291),(175.0805140220076,-36.67086424910709 
))'


with a gist index on mypoint...

but, it all hinges on which clauses in your query are most selective, 
thats where you want an index.


--
john r pierce, recycling bits in santa cruz



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

2017-04-18 Thread Samuel Williams

Oh, I've also tried earth distance and ll_to_earth in a GIST index...
it was slower that the BTREE index on a small subset of data in my
tests.

On 19 April 2017 at 16:01, Samuel Williams
 wrote:
> Hi.
>
> We have 400,000,000 records in a table (soon to be 800,000,000), here
> is the schema (\d+)
>
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>
> We want the following kinds of query to be fast:
>
> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
> "user_event"."what" IN ('poll', 'location_change',
> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
> 175.0805140220076);
>
> We have a btree index and it appears to be working. However, it's
> still pretty slow.
>
> EXPLAIN ANALYZE gives the following:
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>
> I'm thinking that I need to do the following to help:
>
> CLUSTER user_event ON index_user_event_for_visits_3 followed by
> analyze... Our data is mostly time series but sometimes we get some
> dumps with historical records.
>
> Perhaps add a BRIN index on created_at
>
> I'm wondering if... we can use an index to cache, all user_ids seen on
> a given day. If we change our index to be more discrete, e.g.
> created_at::date, would this help? The set union of user_ids for 365
> days should be pretty fast?
>
> I'm open to any ideas or suggestions, ideally we can keep
> optimisations within the database, rather than adding a layer of
> caching on top.
>
> Kind regards,
> Samuel


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

Re: [GENERAL] Large data and slow queries

26 matches

Site Navigation

Mail list logo

Footer information