GitHub user jsell-rh edited a discussion: How to Optimize Query Performance With Large Edge Label Count?
## Background We're building an experimental "property graph as a service" called [Kartograph](https://github.com/openshift-hyperfleet/kartograph). We've been building around Apache AGE as our graph database, and have followed the recommended approach for applying indexes to ids, properties, and start/end ids [for edges]. Given the nature of the application, the schema of the graph(s) cannot be known ahead of time. Further, we want to enable users to build graphs that suit their use-case, even if they're not optimal in terms of providing good graph query performance. ## The Problem We've noticed that as we've gotten to this scale, ~50k nodes ~250k edges ~3k edge labels query performance for multi-hop, broad (no edge label specified) graph traversals is quite slow (10-30 seconds). For example, ```SQL EXPLAIN ANALYZE SELECT * FROM cypher('kartograph_graph', $$ MATCH (n)-[r1]->(m)-[r2]->(o) RETURN { start: labels(n), hop1: type(r1), middle: labels(m), hop2: type(r2), end: labels(o) } LIMIT 5 $$) AS (r agtype); -[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | Limit (cost=0.00..0.49 rows=5 width=32) (actual time=22.066..63.774 rows=5 loops=1) -[ RECORD 2 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | -> Nested Loop (cost=0.00..1402544173936438673408.00 rows=14215878455608488230912 width=32) (actual time=22.064..63.769 rows=5 loops=1) -[ RECORD 3 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | -> Nested Loop (cost=0.00..85282993755341904.00 rows=1453842801965249792 width=240) (actual time=21.633..50.280 rows=5 loops=1) -[ RECORD 4 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | -> Nested Loop (cost=0.00..8723524158570.23 rows=148682960354960 width=186) (actual time=16.357..36.170 rows=5 loops=1) -[ RECORD 5 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | -> Nested Loop (cost=0.00..2615046062.49 rows=15205648554 width=132) (actual time=10.891..21.698 rows=5 loops=1) -[ RECORD 6 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | -> Append (cost=0.00..80532.41 rows=3020495 width=66) (actual time=0.035..0.150 rows=3 loops=1) -[ RECORD 7 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | -> Seq Scan on _ag_label_edge r1_1 (cost=0.00..0.00 rows=1 width=56) (actual time=0.026..0.026 rows=0 loops=1) -[ RECORD 8 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | -> Seq Scan on "HAS_PROVIDER_CONFIG" r1_2 (cost=0.00..19.70 rows=970 width=56) (actual time=0.008..0.010 rows=3 loops=1) -[ RECORD 9 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | -> Seq Scan on "GRANTED_BY" r1_3 (cost=0.00..5.10 rows=110 width=192) (never executed) ... -[ RECORD 22516 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | -> Index Scan using "idx_kartograph_graph_StorageProtocol_id_btree" on "StorageProtocol" o_1690 (cost=0.15..0.26 rows=6 width=40) (actual time=0.002..0.002 rows=0 loops=4) -[ RECORD 22517 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | Index Cond: (id = r2.end_id) -[ RECORD 22518 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | -> Index Scan using "idx_kartograph_graph_ErrorHandlingMode_id_btree" on "ErrorHandlingMode" o_1691 (cost=0.15..0.26 rows=6 width=40) (actual time=0.001..0.001 rows=0 loops=4) -[ RECORD 22519 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | Index Cond: (id = r2.end_id) -[ RECORD 22520 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | -> Index Scan using "idx_kartograph_graph_ContainerfileInstruction_id_btree" on "ContainerfileInstruction" o_1692 (cost=0.15..0.26 rows=6 width=40) (actual time=0.002..0.002 rows=0 loops=4) -[ RECORD 22521 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | Index Cond: (id = r2.end_id) -[ RECORD 22522 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | Planning Time: 15306.662 ms -[ RECORD 22523 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | Execution Time: 169.827 ms ``` Note that the planning time is the long pole in the tent: ```sql QUERY PLAN | Planning Time: 15306.662 ms -[ RECORD 22523 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- QUERY PLAN | Execution Time: 169.827 ms ``` ## The Question(s) (I know there are a lot here, but I wanted to capture what has been floating around my head. Some questions may have false premises, feel free to point these out!) **Primary concern:** - With >=3k edge label tables, is Apache AGE's `label=table` architecture fundamentally limiting for our scale? - Does the above-mentioned performance track with what you've experienced? **Architecture & Design:** - Did AGE consider alternative label storage patterns (e.g., label as property with indexes)? I'm curious to learn more about the trade-offs. - Does AGE perform any in-memory query optimization or caching beyond standard Postgres? **Optimization strategies:** - Any ideas on how we might optimized for these types of queries? - We're considering: - Views/caching layers that can provide the labels connected to a certain node label so that queries can quickly specify only the connected labels. - Modifying the graph creation methodology to prioritize minimal label counts. GitHub link: https://github.com/apache/age/discussions/2314 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
