[postgis-users] Non-linear time cost; please suggest a better way to structure the query

Rory Meyer Wed, 23 Feb 2022 06:14:23 -0800

Afternoon all,

I've got a database full of GPS points (along with associated data like speed, 
bearing, time, GPS_ID, class etc) and I'm trying to do complex aggregations 
with the data. I'm trying to build up a "heatmap" of the data by first creating 
a grid of polygons using ST_HexagonGrid and then using a window function to 
overlay the lines (created from a lead/lag window of each GPS point and the 
next one from the same GPS_ID) over the grid. I'd like to get the the number of 
seconds that gps carrying vehicles spend in each hex cell, grouped by class, 
speed, date, direction etc etc. The end goal would be to query a lon/lat and 
get a bunch of aggregated data for different classes, speed/bearing 
distributions.


Here's a simplified look at the SQL (sorry, it's not really simple...):

'''
SELECT
    grid.gid,
    grid.geom,
    avg(traj.bearing, 511.0) AS avg_bearing,
    avg(traj.time_delta) AS avg_time_delta,
    sum(((st_length(st_intersection(traj.traj, grid.geom)) * traj.time_delta) / 
traj.traj_dist)) AS cum_time_in_grid
FROM (
(my_hex_grid AS grid LEFT JOIN ( SELECT
                                                         subquery.gps_id,
                                                         subquery.event_date,
                                                         subquery.bearing,
                                                         subquery.time_delta,
                                                         
st_makeline(subquery.pos, subquery.pos2) AS traj,
                                                         
st_distance(subquery.pos, subquery.pos2) AS traj_dist
                                                         FROM (
                                                         SELECT
                                                         gps.mmsi,
                                                         
date_part('epoch'::text, (lead(gps.event_time) OVER time_order - 
gps.event_time)) AS                                                             
   time_delta,
                                                         gps.geom,
                                                         gps.bearing,
                                                         lead(gps.geom) OVER 
time_order AS geom2
                                                          FROM gps
                                                         WHERE ((gps.event_time 
>= '<Start Time>') AND (gps.event_time <= '<End Time>'))
                                                         WINDOW time_order AS 
(PARTITION BY gps.gps_id ORDER BY gps.event_time)) as subquery
                                  ON (st_intersects(gps.traj, grid.geom)))
  GROUP BY grid.gid, grid.geom
'''

My issue is that I've got a non-linear increase in time that the query takes to 
complete. If <Start Time> to <End Time> is a couple of hours then it's takes a 
couple of seconds to run. If it's for a day, it takes a couple minutes to run. 
If it's for a week it takes a couple of hours.

I'd like to run this for over a year of data but that won't be feasible at this 
rate.

Is there some way to avoid this non-linear increase in time or would it be best 
to just write some python code to loop through smaller chunks of data and write 
the results somewhere?

Regards,
Rory

_______________________________________________
postgis-users mailing list
postgis-users@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users

[postgis-users] Non-linear time cost; please suggest a better way to structure the query

Reply via email to