Re: [GENERAL] Improve PostGIS performance with 62 million rows?

Israel Brewster Thu, 05 Jan 2017 11:37:32 -0800

On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pram...@cleverelephant.ca> wrote:
> 
> The index filters using bounding boxes.  A long, diagonal route will have a 
> large bounding box, relative to the area you actually care about (within a 
> narrow strip of the route). Use ST_Segmentize() to add points to your route, 
> ST_DumpPoints() to dump those out as point and ST_MakeLine to generate new 
> lines from those points, each line very short. The maximum index 
> effectiveness will come when your line length is close to your buffer width.
> 
> P


Ok, I think I understand the concept. So attempting to follow your advice, I 
modified the query to be:

SELECT elevation
FROM data
WHERE
    ST_DWithin(
        location,
        (SELECT ST_MakeLine(geom)::geography as split_line
         FROM (SELECT
        (ST_DumpPoints(
            ST_Segmentize(
                ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 
61.179167,-156.77 71.285833)'),
                600
            )::geometry
        )).geom
    ) s1),
        600
    )
ORDER BY elevation DESC limit 1;

It took some fiddling to find a syntax that Postgresql would accept, but 
eventually that's what I came up with. Unfortunately, far from improving 
performance, it killed it - in running the query, it went from 22 seconds to 
several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at the 
query execution plan shows, at least partially, why:

                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Limit  (cost=17119748.98..17119748.98 rows=1 width=4)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=17.76..17.77 rows=1 width=32)
           ->  Result  (cost=0.00..5.25 rows=1000 width=32)
   ->  Sort  (cost=17119731.21..17171983.43 rows=20900890 width=4)
         Sort Key: data.elevation DESC
         ->  Seq Scan on data  (cost=0.00..17015226.76 rows=20900890 width=4)
               Filter: st_dwithin(location, $0, '600'::double precision)
(8 rows)

So apparently it is now doing a sequential scan on data rather than using the 
index. And, of course, sorting 20 million rows is not trivial either. Did I do 
something wrong with forming the query?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

> 
> On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <isr...@ravnalaska.net 
> <mailto:isr...@ravnalaska.net>> wrote:
> I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of latitude 
> (numeric), longitude(numeric), elevation(integer) data, along with a PostGIS 
> (2.3.0) geometry column (location), running on a CentOS 6.8 box with 64GB RAM 
> and a RAID10 SSD data drive. I'm trying to get the maximum elevation along a 
> path, for which purpose I've come up with the following query (for one 
> particular path example):
> 
> SELECT elevation FROM data                                                    
>                                                                               
>                                                                               
>                                                                               
>                                                                               
>     WHERE ST_DWithin(location, 
> ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 
> 71.285833)'), 600)                                                            
>                                                                               
>                                                                               
>                                                       ORDER BY elevation 
> LIMIT 1;
> 
> The EXPLAIN ANALYZE output of this particular query 
> (https://explain.depesz.com/s/heZ <https://explain.depesz.com/s/heZ>) shows:
> 
>                                                                               
>                                                                               
>           QUERY PLAN                                                          
>                                                                               
>                               
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=4.83..4.83 rows=1 width=4) (actual time=22653.840..22653.842 
> rows=1 loops=1)
>    ->  Sort  (cost=4.83..4.83 rows=1 width=4) (actual 
> time=22653.837..22653.837 rows=1 loops=1)
>          Sort Key: elevation DESC
>          Sort Method: top-N heapsort  Memory: 25kB
>          ->  Index Scan using location_gix on data  (cost=0.42..4.82 rows=1 
> width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
>                Index Cond: (location && 
> '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography)
>                Filter: 
> (('0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography
>  && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, 
> '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography,
>  '600'::double precision, true))
>                Rows Removed by Filter: 4934534
>  Planning time: 0.741 ms
>  Execution time: 22653.906 ms
> (10 rows)
> 
> So it is using the index properly, but still takes a good 22 seconds to run, 
> most of which appears to be in the Index Scan.
> 
> Is there any way to improve this, or is this going to be about as good as it 
> gets with the number of rows being dealt with? I was planning to use this for 
> a real-time display - punch in a couple of points, get some information about 
> the route between, including maximum elevation - but with it taking 22 
> seconds for the longer routes at least, that doesn't make for the best user 
> experience.
> 
> It's perhaps worth noting that the example above is most likely a worst case 
> scenario. I would expect the vast majority of routes to be significantly 
> shorter, and I want to say the shorter routes query much faster [testing 
> needed]. That said, the faster the better, even for short routes :-)
> -----------------------------------------------
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293 <tel:(907)%20450-7293>
> -----------------------------------------------
> 
> 
> 
> 
> 
>

Re: [GENERAL] Improve PostGIS performance with 62 million rows?

Reply via email to