Re: [postgis-users] Improve 3D BB query on large set of lines

2017-01-20 Thread Tom Kazimiers

Hi Rémi,

Thanks a lot for your thorough reply. Especially scenario A.1 has helped 
me to improve performance by over an order of magnitude for our most 
common use case. My reply got a bit longer, sorry, I wanted to provide a 
clearer picture of my setup. For further tests I changed the system, 
which is an exact mirror of my previous test setup, but this one has 
SSDs. The generated query plan is the same, but the timing is of course 
a little bit bitter. But for completeness, this is my original query 
now:


 SELECT te.id
 FROM treenode_edge te
 WHERE te.edge &&& 'LINESTRINGZ( -537284.0 912464.0 131705.0,
 1456444.0  -340912.0 131670.0)'
  AND ST_3DDWithin(te.edge, ST_MakePolygon(ST_GeomFromText(
   'LINESTRING( -537284.0 -340912.0 131687.5, 1456444.0 -340912.0 131687.5,
1456444.0  912464.0 131687.5, -537284.0  912464.0 131687.5,
-537284.0 -340912.0 131687.5)')), 17.5)
  AND te.project_id = 1;

Like before, I use &&& to get all edges that have a bounding box that
intersects with the query bounding box. ST_3DDWithin is used to lower
number of false positives, i.e. edges with an intersecting BB, but that
don't intersect the query BB. The query returns 6799 rows and runs in
about 600ms:

 Bitmap Heap Scan on treenode_edge te  (cost=77.04..6555.06 rows=112 width=8) 
(actual time=595.209..614.640 rows=6799 loops=1)
   Recheck Cond: (edge &&& 
'0102800200886520C1A0D82B41C81300413C393641C0CE14C1B0120041'::geometry)
   Filter: ((edge && 
'01038001000500AB6520C106CF14C1B0120041AB6520C1C3D82B41B012004100804D393641C3D82B41C813004100804D39364106CF14C1C8130041AB6520C106CF14C1B0120041'::geometry)
 AND (project_id = 1) AND 
('01038001000500886520C1C0CE14C13C1300413C393641C0CE14C13C1300413C393641A0D82B413C130041886520C1A0D82B413C130041886520C1C0CE14C13C130041'::geometry
 && st_expand(edge, '17.5'::double precision)) AND _st_3ddwithin(edge, 
'01038001000500886520C1C0CE14C13C1300413C393641C0CE14C13C1300413C393641A0D82B413C130041886520C1A0D82B413C130041886520C1C0CE14C13C130041'::geometry,
 '17.5'::double precision))
   Rows Removed by Filter: 26
   Heap Blocks: exact=4120
   Buffers: shared hit=24703
   ->  Bitmap Index Scan on treenode_edge_gix  (cost=0.00..77.01 rows=1679 
width=0) (actual time=594.444..594.444 rows=6825 loops=1)
 Index Cond: (edge &&& 
'0102800200886520C1A0D82B41C81300413C393641C0CE14C1B0120041'::geometry)
 Buffers: shared hit=20583
 Planning time: 0.661 ms
 Execution time: 615.102 ms

The bitmap index scan on the ND GIST index is still the major problem.

On Thu, Jan 19, 2017 at 03:40:47PM +0100, Rémi Cura wrote:

- hardware : are your tables on SSD, does your index fit in RAM


Our production and test system have the following setup, it is mainly 
used for the database and a web-server (no other heavy processes, barely any load).


- 24 x Intel Xeon CPU E5-2630 v2 @ 2.60GHz
- 128GB RAM (all tables and indices fit into it)
- LSI MegaRAID SAS 2208, RAID 1, 1TB total, built from 2 x 1 TB Samsung 
 SSD 850 PRO 1TB (this is where the Postgres data directory is)


The Postgres data directory has a size of 40GB and the size of one 
complete dump of the database of interest is about 5GB at the moment. Of 
course we made sure we mounted the partition with the noatime option and 
we use the ext4 filesystem.



- postgres tuning : you have customised postgresql.conf to fit your
hardware


Yes, we modified it a bit. This is mainly what we changed:

max_connections = 250
shared_buffers = 16GB
work_mem = 256MB
maintenance_work_mem = 2GB
wal_buffers = 16MB
effective_cache_size = 16GB
default_statistics_target = 1000

These settings worked well so far for us on most queries. But I wonder 
if less connections and more work_mem might be useful.


For the sake of completeness, we also tried the parallel worker feature of
Postgres 9.6, but couldn't see any query using it (since no sequential 
were used in the first place, I believe).



- query writing : your query is optimally written (looks fine here)


Thanks, this improves my confidence that I didn't miss anything obvious 
on the query level. :-)



- data model (detailed after)


I add some more details below, but first I want to add some more
context: typically we deal with individual nodes, stored in a separate
treenode table (the result of the PostGIS query is then joined into it).
Each node has a parent (except the root node) and the edge between them
is an actual part of the neuron 

Re: [postgis-users] Improve 3D BB query on large set of lines

2017-01-19 Thread Rémi Cura
Hey,
This is an interesting problem for sure.
AS always in the "slow query" topic,
improvement can come from a number of places
 - hardware : are your tables on SSD, does your index fit in RAM
 - postgres tuning : you have customised postgresql.conf to fit your
hardware
 - query writing : your query is optimally written (looks fine here)
 - data model (detailed after)

IF you want fast results, you need a good indexing, and this is possible if
you exploit the sparsness of your data.
So where is the sparsness?




 * Scenario A.1: your usual slices affects a majority of trees
  - you don't really need an index on X,Y, but mainly on Z.
The idea is to separate your index into index(X,Y) + index(Z)
 Try adding an index on GIST(numrange(ST_Zmin,ST_ZMax)) (you could add
it anyway)
 and an index on GIST(geom) (aka only 2D).
 * Scenario A.2 : your usual slices affect few trees
  - you could easily perform a 2 part query, where the first part use an
index on tree bounding box, and the second perform the query on edge. This
require to create and maintain a tree_bounding_box table with appropriate
indexes, which is quite easy with triggers Ex (not optimised):
   WITH filtering_by_tree AS (
 SELECT distinct tree_id
 FROM my_sync_table_of_tree_bounding_boxes as tree
 WHERE tree.bbox &&& your_slice
   ) SELECT edge_id
 FROM treenode_edge  AS te
 WHERE te.geom &&& your_slice
  AND EXISTS (SELECT 1 FROM filtering_by_tree As ft WHERE ft.tree_id =
te.tree_id)
  - you could use pgpointcloud to similar effect by construction a patch
per tree.

Now something is very ambiguous in your data, and it could have major
consequences :
 * Are the edges between nodes that represent neuron __only logical__ (they
are straight lines, always the same Zmax-Zmin, and represent a conceptual
graph), or are they polylines that represent the __actual__ neurone path?

 * Scenario B.1.1 : edges are only logical and have a constant Zmax-Zmin,
i.e. nodes have quantified Z :
  - you don't need edges at all for indexing, you should work only on nodes
(i.e. index nodes IN Z + (X,Y), find nodes, then find relevant edges). The
great advantage in this case is that you can use the full power of
pgpointcloud to create meaningfull patches (of variables size for instance,
see this
, 3.1,
Adapt patch grouping rules ), which means scalling in the range of billions
of points possible.

  - you could try to think of using the dual of your current graph (edges
become nodes, nodes become edges).

 * Scenario B.1.2 : edges are only logical, but nodes don't have a constant
Zmax-Zmin. : you could insert (virtual) nodes so you are again in the same
case as scenario B.1.1

 * Scenario B.2 : edges are actual neurone path.
  - you could construct intermediary objects (groups of neurone) of
variable size (those within cubes), so as to have again a 2 steps query :
first look for groups of neuron, then for neurons inside

 * Scenario C : your data don't have obvious sparsity.
  - use the same strategy you currently use, but scale by partitionning
your data into numerous tables.
   You can see the partitionning entry of the postgres manual.

   This would be much easier to create and maintain if you could partition
on only Z dimension.
   Note that you will need to duplicate (and then de-duplicate) some edges.


It's hard to recommend anything without further information on your
data/your requirement/your ressources.

Cheers,
Rémi-C








2017-01-17 22:43 GMT+01:00 Tom Kazimiers :

> Hi all,
>
> I am using Postgres 9.6 with PostGIS 2.3 and want to improve the
> performance of a bounding box intersection query on a large set of lines. I
> asked the same questions two month ago on gis.stackexchange.com, but
> didn't get any reply. Therefore, please excuse me for cross-posting, but I
> figured this list would be a good place to ask. This is the original post:
>
>  https://gis.stackexchange.com/questions/215762
>
> Problem: We store about 6 million lines/edges in a 3D space and use an axis
> aligned bounding box to find all intersecting (and included) lines. This
> works
> well, but we want to improve the query time for larger point sets.
>
> Context: Tree-like 3D structures represent neurons, each node has a parent
> node
> or it is the root. At the moment we deal with about 15 million nodes,
> grouped
> into 15 trees (many > 1 nodes). I want to improve existing
> performance
> bottlenecks with bigger result sets plus I plan to scale this setup to
> 10-100x
> the nodes. Typical query bounding boxes are rather slices than boxes, i.e.
> they expand any range in XY, but only a short distance in Z, but it could
> in principal any dimension that is "flat".
>
> Setup: The table storing edges looks like this:
>
>  =>\d+ treenode_edge
>Tabelle 

[postgis-users] Improve 3D BB query on large set of lines

2017-01-17 Thread Tom Kazimiers

Hi all,

I am using Postgres 9.6 with PostGIS 2.3 and want to improve the 
performance of a bounding box intersection query on a large set of 
lines. I asked the same questions two month ago on 
gis.stackexchange.com, but didn't get any reply. Therefore, please 
excuse me for cross-posting, but I figured this list would be a good 
place to ask. This is the original post:


 https://gis.stackexchange.com/questions/215762

Problem: We store about 6 million lines/edges in a 3D space and use an axis
aligned bounding box to find all intersecting (and included) lines. This works
well, but we want to improve the query time for larger point sets.

Context: Tree-like 3D structures represent neurons, each node has a parent node
or it is the root. At the moment we deal with about 15 million nodes, grouped
into 15 trees (many > 1 nodes). I want to improve existing performance
bottlenecks with bigger result sets plus I plan to scale this setup to 10-100x
the nodes. Typical query bounding boxes are rather slices than boxes, 
i.e. they expand any range in XY, but only a short distance in Z, but it 
could in principal any dimension that is "flat".


Setup: The table storing edges looks like this:

 =>\d+ treenode_edge
   Tabelle »public.treenode_edge«
	 Spalte   |  Typ  | Attribute 
 +---+---

  id | bigint| not null
  project_id | integer   | not null
  edge   | geometry(LineStringZ) | 
 Indexe:

  "treenode_edge_pkey" PRIMARY KEY, btree (id)
  "treenode_edge_gix" gist (edge gist_geometry_ops_nd)
  "treenode_edge_project_id_index" btree (project_id)

Note that there is a 3D index for the edge column in place (treenode_edge_gix).

Current timing: To request 2000 nodes with an typical bounding box size I use
the following query:

 SELECT te.id
 FROM treenode_edge te
 -- left bottom z2, right top z1   
 WHERE te.edge &&& 'LINESTRINGZ( -537284.0  699984.0 84770.0,

  1456444.0 
-128432.0 84735.0)'
   -- left top halfz, right top halfz,
   -- right bottom halfz, left bottom halfz,
   -- left top halfz; halfz (distance)
 AND ST_3DDWithin(te.edge, ST_MakePolygon(ST_GeomFromText(
  'LINESTRING( -537284.0 -128432.0 84752.5, 1456444.0 -128432.0 84752.5,
   1456444.0  699984.0 84752.5, -537284.0  
699984.0 84752.5,
   -537284.0 -128432.0 84752.5)')), 17.5)
 AND te.project_id = 1
 LIMIT 2000;

This takes about 900ms, statistics are up-to-date. This is not so bad, 
but I look for strategies to make this much faster still. The &&& 
operator filters already most of the existing edges by bounding box test 
(using index) so that ST_3DWithin only needs to check the edges that are 
most likely part of the result.


The query plan looks like this:

 Limit  (cost=48.26..4311.24 rows=70 width=8) (actual time=856.261..864.208 
rows=2000 loops=1)
 Buffers: shared hit=20470
 ->  Bitmap Heap Scan on treenode_edge te  (cost=48.26..4311.24 rows=70 
width=8) (actual time=856.257..863.974 rows=2000 loops=1)
   Recheck Cond: (edge &&& 
'0102800200886520C1A05C254120B2F4403C393641005BFFC0F0AFF440'::geometry)
   Filter: ((edge && 
'01038001000500AB6520C1185CFFC0F0AFF440AB6520C1C35C2541F0AFF44000804D393641C35C254120B2F44000804D393641185CFFC020B2F440AB6520C1185CFFC0F0AFF440'::geometry)
 AND (project_id = 1) AND 
('01038001000500886520C1005BFFC008B1F4403C393641005BFFC008B1F4403C393641A05C254108B1F440886520C1A05C254108B1F440886520C1005BFFC008B1F440'::geometry
 && st_expand(edge, '17.5'::double precision)) AND _st_3ddwithin(edge, 
'01038001000500886520C1005BFFC008B1F4403C393641005BFFC008B1F4403C393641A05C254108B1F440886520C1A05C254108B1F440886520C1005BFFC008B1F440'::geometry,
 '17.5'::double precision))
   Heap Blocks: exact=1816
   Buffers: shared hit=20470
   ->  Bitmap Index Scan on treenode_edge_gix  
(cost=0.00..48.25 rows=1044 width=0) (actual time=855.795..855.795 rows=2856 
loops=1)
 Index Cond: (edge &&& 
'0102800200886520C1A05C254120B2F4403C393641005BFFC0F0AFF440'::geometry)
 Buffers: shared hit=18654
  Planning time: 3.467