Re: [I] Major Performance Difference: SQL vs. Cypher for Aggregation/Ordering [age]

via GitHub Mon, 21 Jul 2025 12:37:38 -0700


MuhammadTahaNaveed commented on issue #2194:
URL: https://github.com/apache/age/issues/2194#issuecomment-3098181976


   @serdarmicrosoft In your SQL query, the tables are specified, but in Cypher 
query, the target label is not specified which causes age to scan all the 
vertex labels.
   
   **SQL query plan:**
   ```
   db=# EXPLAIN select count(*), u.id from issue_2194."User" u, 
issue_2194."HAS_INTERACTION" h, issue_2194."Book" b where u.id = h.start_id and 
b.id = h.end_id GROUP BY u.id ORDER BY 1 DESC LIMIT 10;
                                                      QUERY PLAN                
                                    
   
-----------------------------------------------------------------------------------------------------------------
   ..........................
                              ->  Seq Scan on "User" u  (cost=0.00..22.00 
rows=1200 width=8)
                              ...........
                                          ->  Seq Scan on "HAS_INTERACTION" h  
(cost=0.00..19.70 rows=970 width=16)
                                          .........
                                          ->  Seq Scan on "Book" b  
(cost=0.00..22.00 rows=1200 width=8)
   ```
   
   **Cypher query plan:**
   ```
   db=# SELECT * FROM cypher('issue_2194', $$EXPLAIN MATCH 
(u:User)-[:HAS_INTERACTION]->() RETURN u.user_id, count(*) ORDER BY count(*) 
DESC LIMIT 10 $$) as (user_id agtype, count agtype);
                                                                                
       QUERY PLAN                                                               
                      
   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   ............................................
                                                ->  Parallel Append  
(cost=0.00..78.24 rows=2001 width=8)
                                                      ->  Parallel Seq Scan on 
"Book" _age_default_alias_1_2  (cost=0.00..17.06 rows=706 width=8)
                                                      ->  Parallel Seq Scan on 
"Author" _age_default_alias_1_3  (cost=0.00..17.06 rows=706 width=8)
                                                      ->  Parallel Seq Scan on 
"Review" _age_default_alias_1_4  (cost=0.00..17.06 rows=706 width=8)
                                                      ->  Parallel Seq Scan on 
"User" _age_default_alias_1_5  (cost=0.00..17.06 rows=706 width=8)
                                                      ->  Parallel Seq Scan on 
_ag_label_vertex _age_default_alias_1_1  (cost=0.00..0.00 rows=1 width=8)
                                                      .....
                                                            ->  Seq Scan on 
"HAS_INTERACTION" _age_default_alias_0  (cost=0.00..19.70 rows=970 
                                                            ..........
                                                            ->  Seq Scan on 
"User" u  (cost=0.00..22.00 rows=1200 width=40)
   
   ```
   
   Potential optimization is to **explicitly include the label on the target 
node**:
   
   ```
   SELECT * FROM cypher('graph', $$ 
       EXPLAIN 
       MATCH (u:User)-[:HAS_INTERACTION]->(:Book) 
       RETURN u.user_id, count(*) 
       ORDER BY count(*) DESC 
       LIMIT 10 
   $$) AS (user_id agtype, count agtype);
   ```
   
   That said, I believe AGE should ideally be smart enough to infer and apply 
such label constraints automatically to reduce unnecessary scans when label 
filters are present elsewhere in the pattern.
   
   Let us know if that improves some performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Major Performance Difference: SQL vs. Cypher for Aggregation/Ordering [age]

Reply via email to