Re: [GENERAL] Improve PostGIS performance with 62 million rows?

Israel Brewster Thu, 05 Jan 2017 13:01:32 -0800

> On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pram...@cleverelephant.ca> wrote:
> 
> Yes, you did. You want a query that spits out a tupleset of goemetries (one 
> each for each wee segment), and then you can join that set to your main table 
> using st_dwithin() as the join clause.
> So start by ditching the main table and just work on a query that generates a 
> pile of wee segments.


Ahhh, I see you've done this sort of thing before 
(http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html 
<http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html>) 
:-)

So following that advice I came up with the following query:

WITH dump AS (SELECT
    ST_DumpPoints(
        ST_Segmentize(
            ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 
61.179167,-156.77 71.285833)'),
            600
        )::geometry
    ) as pt
),
pts AS (
    SELECT (pt).geom, (pt).path[1] as vert FROM dump
)
SELECT elevation 
FROM data 
INNER JOIN (SELECT 
    ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
    FROM pts a 
    INNER JOIN pts b 
    ON a.vert=b.vert-1 AND b.vert>1) segments
ON  ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Which yields the following EXPLAIN ANALYZE (https://explain.depesz.com/s/RsTD 
<https://explain.depesz.com/s/ukwc>):

                                                                                
                                 QUERY PLAN                                     
                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=11611706.90..11611706.91 rows=1 width=4) (actual 
time=1171.814..1171.814 rows=1 loops=1)
   CTE dump
     ->  Result  (cost=0.00..5.25 rows=1000 width=32) (actual time=0.024..1.989 
rows=1939 loops=1)
   CTE pts
     ->  CTE Scan on dump  (cost=0.00..20.00 rows=1000 width=36) (actual 
time=0.032..4.071 rows=1939 loops=1)
   ->  Sort  (cost=11611681.65..11611768.65 rows=34800 width=4) (actual 
time=1171.813..1171.813 rows=1 loops=1)
         Sort Key: data.elevation DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Nested Loop  (cost=0.55..11611507.65 rows=34800 width=4) (actual 
time=0.590..1167.615 rows=28408 loops=1)
               ->  Nested Loop  (cost=0.00..8357.50 rows=1665 width=64) (actual 
time=0.046..663.475 rows=1938 loops=1)
                     Join Filter: (a.vert = (b.vert - 1))
                     Rows Removed by Join Filter: 3755844
                     ->  CTE Scan on pts b  (cost=0.00..22.50 rows=333 
width=36) (actual time=0.042..0.433 rows=1938 loops=1)
                           Filter: (vert > 1)
                           Rows Removed by Filter: 1
                     ->  CTE Scan on pts a  (cost=0.00..20.00 rows=1000 
width=36) (actual time=0.000..0.149 rows=1939 loops=1938)
               ->  Index Scan using location_gix on data  (cost=0.55..6968.85 
rows=1 width=36) (actual time=0.085..0.256 rows=15 loops=1938)
                     Index Cond: (location && 
_st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double 
precision))
                     Filter: (((st_makeline(ARRAY[a.geom, b.geom]))::geography 
&& _st_expand(location, '600'::double precision)) AND _st_dwithin(location, 
(st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double precision, true))
                     Rows Removed by Filter: 7
 Planning time: 4.318 ms
 Execution time: 1171.994 ms
(22 rows)

So not bad. Went from 20+ seconds to a little over 1 second. Still noticeable 
for a end user, but defiantly usable - and like mentioned, that's a worst-case 
scenario query. Thanks!

Of course, if you have any suggestions for further improvement, I'm all ears :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

> 
> On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <isr...@ravnalaska.net 
> <mailto:isr...@ravnalaska.net>> wrote:
> On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pram...@cleverelephant.ca 
> <mailto:pram...@cleverelephant.ca>> wrote:
>> 
>> The index filters using bounding boxes.  A long, diagonal route will have a 
>> large bounding box, relative to the area you actually care about (within a 
>> narrow strip of the route). Use ST_Segmentize() to add points to your route, 
>> ST_DumpPoints() to dump those out as point and ST_MakeLine to generate new 
>> lines from those points, each line very short. The maximum index 
>> effectiveness will come when your line length is close to your buffer width.
>> 
>> P
> 
> Ok, I think I understand the concept. So attempting to follow your advice, I 
> modified the query to be:
> 
> SELECT elevation
> FROM data
> WHERE
>     ST_DWithin(
>         location,
>         (SELECT ST_MakeLine(geom)::geography as split_line
>          FROM (SELECT
>         (ST_DumpPoints(
>             ST_Segmentize(
>                 ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 
> 61.179167,-156.77 71.285833)'),
>                 600
>             )::geometry
>         )).geom
>     ) s1),
>         600
>     )
> ORDER BY elevation DESC limit 1;
> 
> It took some fiddling to find a syntax that Postgresql would accept, but 
> eventually that's what I came up with. Unfortunately, far from improving 
> performance, it killed it - in running the query, it went from 22 seconds to 
> several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at the 
> query execution plan shows, at least partially, why:
> 
>                                   QUERY PLAN                                  
> ------------------------------------------------------------------------------
>  Limit  (cost=17119748.98..17119748.98 rows=1 width=4)
>    InitPlan 1 (returns $0)
>      ->  Aggregate  (cost=17.76..17.77 rows=1 width=32)
>            ->  Result  (cost=0.00..5.25 rows=1000 width=32)
>    ->  Sort  (cost=17119731.21..17171983.43 rows=20900890 width=4)
>          Sort Key: data.elevation DESC
>          ->  Seq Scan on data  (cost=0.00..17015226.76 rows=20900890 width=4)
>                Filter: st_dwithin(location, $0, '600'::double precision)
> (8 rows)
> 
> So apparently it is now doing a sequential scan on data rather than using the 
> index. And, of course, sorting 20 million rows is not trivial either. Did I 
> do something wrong with forming the query?
> 
> -----------------------------------------------
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293 <tel:(907)%20450-7293>
> -----------------------------------------------
> 
>> 
>> On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <isr...@ravnalaska.net 
>> <mailto:isr...@ravnalaska.net>> wrote:
>> I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of latitude 
>> (numeric), longitude(numeric), elevation(integer) data, along with a PostGIS 
>> (2.3.0) geometry column (location), running on a CentOS 6.8 box with 64GB 
>> RAM and a RAID10 SSD data drive. I'm trying to get the maximum elevation 
>> along a path, for which purpose I've come up with the following query (for 
>> one particular path example):
>> 
>> SELECT elevation FROM data                                                   
>>                                                                              
>>                                                                              
>>                                                                              
>>                                                                              
>>          WHERE ST_DWithin(location, 
>> ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 
>> 71.285833)'), 600)                                                           
>>                                                                              
>>                                                                              
>>                                                          ORDER BY elevation 
>> LIMIT 1;
>> 
>> The EXPLAIN ANALYZE output of this particular query 
>> (https://explain.depesz.com/s/heZ <https://explain.depesz.com/s/heZ>) shows:
>> 
>>                                                                              
>>                                                                              
>>             QUERY PLAN                                                       
>>                                                                              
>>                                   
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=4.83..4.83 rows=1 width=4) (actual time=22653.840..22653.842 
>> rows=1 loops=1)
>>    ->  Sort  (cost=4.83..4.83 rows=1 width=4) (actual 
>> time=22653.837..22653.837 rows=1 loops=1)
>>          Sort Key: elevation DESC
>>          Sort Method: top-N heapsort  Memory: 25kB
>>          ->  Index Scan using location_gix on data  (cost=0.42..4.82 rows=1 
>> width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
>>                Index Cond: (location && 
>> '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography)
>>                Filter: 
>> (('0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography
>>  && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, 
>> '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography,
>>  '600'::double precision, true))
>>                Rows Removed by Filter: 4934534
>>  Planning time: 0.741 ms
>>  Execution time: 22653.906 ms
>> (10 rows)
>> 
>> So it is using the index properly, but still takes a good 22 seconds to run, 
>> most of which appears to be in the Index Scan.
>> 
>> Is there any way to improve this, or is this going to be about as good as it 
>> gets with the number of rows being dealt with? I was planning to use this 
>> for a real-time display - punch in a couple of points, get some information 
>> about the route between, including maximum elevation - but with it taking 22 
>> seconds for the longer routes at least, that doesn't make for the best user 
>> experience.
>> 
>> It's perhaps worth noting that the example above is most likely a worst case 
>> scenario. I would expect the vast majority of routes to be significantly 
>> shorter, and I want to say the shorter routes query much faster [testing 
>> needed]. That said, the faster the better, even for short routes :-)
>> -----------------------------------------------
>> Israel Brewster
>> Systems Analyst II
>> Ravn Alaska
>> 5245 Airport Industrial Rd
>> Fairbanks, AK 99709
>> (907) 450-7293 <tel:(907)%20450-7293>
>> -----------------------------------------------
>> 
>> 
>> 
>> 
>> 
>> 
> 
>

Re: [GENERAL] Improve PostGIS performance with 62 million rows?

Reply via email to