GitHub user jsell-rh edited a discussion: How to Optimize Query Performance 
With Large Edge Label Count?

## Background
We're building an experimental "property graph as a service" called 
[Kartograph](https://github.com/openshift-hyperfleet/kartograph). We've been 
building around Apache AGE as our graph database, and have followed the 
recommended approach for applying indexes to ids, properties, and start/end ids 
[for edges].

Given the nature of the application, the schema of the graph(s) cannot be known 
ahead of time. Further, we want to enable users to build graphs that suite 
their use-case, even if they're not optimal in terms of providing good graph 
query performance.

## The Problem
We've noticed that as we've gotten to this scale,
~50k nodes
~250k edges
~4k edge labels

query performance for multi-hop, broad (no edge label specified) graph 
traversals is quite slow (10-30 seconds). For example,
```SQL
EXPLAIN ANALYZE SELECT * FROM cypher('kartograph_graph', $$
  MATCH (n)-[r1]->(m)-[r2]->(o)
  RETURN {
    start: labels(n),
    hop1: type(r1),
    middle: labels(m),
    hop2: type(r2),
    end: labels(o)
  }
  LIMIT 5
$$) AS (r agtype);

-[ RECORD 1 
]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Limit  (cost=0.00..0.49 rows=5 width=32) (actual 
time=22.066..63.774 rows=5 loops=1)
-[ RECORD 2 
]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |   ->  Nested Loop  (cost=0.00..1402544173936438673408.00 
rows=14215878455608488230912 width=32) (actual time=22.064..63.769 rows=5 
loops=1)
-[ RECORD 3 
]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |         ->  Nested Loop  (cost=0.00..85282993755341904.00 
rows=1453842801965249792 width=240) (actual time=21.633..50.280 rows=5 loops=1)
-[ RECORD 4 
]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |               ->  Nested Loop  (cost=0.00..8723524158570.23 
rows=148682960354960 width=186) (actual time=16.357..36.170 rows=5 loops=1)
-[ RECORD 5 
]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |                     ->  Nested Loop  (cost=0.00..2615046062.49 
rows=15205648554 width=132) (actual time=10.891..21.698 rows=5 loops=1)
-[ RECORD 6 
]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |                           ->  Append  (cost=0.00..80532.41 
rows=3020495 width=66) (actual time=0.035..0.150 rows=3 loops=1)
-[ RECORD 7 
]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |                                 ->  Seq Scan on _ag_label_edge 
r1_1  (cost=0.00..0.00 rows=1 width=56) (actual time=0.026..0.026 rows=0 
loops=1)
-[ RECORD 8 
]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |                                 ->  Seq Scan on 
"HAS_PROVIDER_CONFIG" r1_2  (cost=0.00..19.70 rows=970 width=56) (actual 
time=0.008..0.010 rows=3 loops=1)
-[ RECORD 9 
]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |                                 ->  Seq Scan on "GRANTED_BY" r1_3  
(cost=0.00..5.10 rows=110 width=192) (never executed)

...

-[ RECORD 22516 
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |               ->  Index Scan using 
"idx_kartograph_graph_StorageProtocol_id_btree" on "StorageProtocol" o_1690  
(cost=0.15..0.26 rows=6 width=40) (actual time=0.002..0.002 rows=0 loops=4)
-[ RECORD 22517 
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |                     Index Cond: (id = r2.end_id)
-[ RECORD 22518 
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |               ->  Index Scan using 
"idx_kartograph_graph_ErrorHandlingMode_id_btree" on "ErrorHandlingMode" o_1691 
 (cost=0.15..0.26 rows=6 width=40) (actual time=0.001..0.001 rows=0 loops=4)
-[ RECORD 22519 
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |                     Index Cond: (id = r2.end_id)
-[ RECORD 22520 
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |               ->  Index Scan using 
"idx_kartograph_graph_ContainerfileInstruction_id_btree" on 
"ContainerfileInstruction" o_1692  (cost=0.15..0.26 rows=6 width=40) (actual 
time=0.002..0.002 rows=0 loops=4)
-[ RECORD 22521 
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN |                     Index Cond: (id = r2.end_id)
-[ RECORD 22522 
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Planning Time: 15306.662 ms
-[ RECORD 22523 
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Execution Time: 169.827 ms
```

Note that the planning time is the long pole in the tent: 
```sql
QUERY PLAN | Planning Time: 15306.662 ms
-[ RECORD 22523 
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Execution Time: 169.827 ms
```

## The Question(s)

(I know there are a lot here, but I wanted to capture what has been floating 
around my head. Some questions may have false premises, feel free to point 
these out!)

- Is there a practical upper-limit recommended for label counts in Apache AGE, 
given the underlying `label=table` architecture?
- Does the above-mentioned performance track with what you've experienced?
- Any ideas on how we might optimized for these types of queries? 
  -  We're considering:
      -  Views/caching layers that can provide the labels connected to a 
certain node label so that queries can quickly specify only the connected 
labels.
      - Modifying the graph creation methodology to prioritize minimal label 
counts.
- Does Apache AGE do any in-memory graph operations on top of what Postgres 
does? 
- Did Apache AGE consider another label storage pattern, one that is less 
table-intensive? Curious what other patterns were explored, and what trade-offs 
were considered/made.

GitHub link: https://github.com/apache/age/discussions/2314

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to