[GitHub] [age] rafsun42 commented on issue #1004: Research multi-label

via GitHub Thu, 03 Aug 2023 14:56:25 -0700


rafsun42 commented on issue #1004:
URL: https://github.com/apache/age/issues/1004#issuecomment-1664693751


   # Benchmark test of multi-label (as an array of label ids)
   
   ## Structure of vertex table
   
   ```sql
   CREATE TABLE intarray_test.vertex_default
   (
       id bigint,
       label_ids int4[] NOT NULL
       -- property column not included
   );
   ```
   
   ## Populating the table
   
   * Number of vertices (rows) inserted is: 10,000,000 (ten million).
   
   * Number of labels each vertex have is: between 1 to 4.
   
   * Label IDs range from 1 to 9.
   
   * Sample of table content:
   
   ```
   (pg-13.9) agedev=*# SELECT * FROM intarray_test.vertex_default LIMIT 10;
      id    | label_ids 
   ---------+-----------
    9891835 | {8,3}
    9891836 | {1,3,6}
    9891837 | {2}
    9891838 | {6,4,1,5}
    9891839 | {9,5,4,1}
    9891840 | {9,4,5}
    9891841 | {4,8}
    9891842 | {9,8,4}
    9891843 | {1,4}
    9891844 | {5,7,4,1}
   (10 rows)
   ```
   
   ## Queries that are tested
   
   1. `MATCH (:1)`,
   Consider 1 as both label ID and label name.
   
   2. `MATCH (:1|2)`.
   Matches 1 or 2. Similar to union.
   
   3. `MATCH(:1:2)`.
   Matches 1 and 2. Similar to intersection.
   
   4. Matches exactly 1 and 2. Similar to equality.
   
   ```sql
   EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT *
   FROM intarray_test.vertex_default WHERE '{1}'::int4[] <@ label_ids;
   
   EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT *
   FROM intarray_test.vertex_default WHERE '{1}'::int4[] <@ label_ids OR 
'{2}'::int4[] <@ label_ids ;
   
   EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT *
   FROM intarray_test.vertex_default WHERE '{1,2}'::int4[] <@ label_ids;
   
   EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT *
   FROM intarray_test.vertex_default WHERE '{1,2}'::int4[] <@ label_ids AND 
label_ids <@ '{1,2}'::int4[];
   ```
   
   ## Use of index and cluster
   
   ### Index
   
   A table `vertex_indexed` is created. `Gist` index is added on `label_ids`.
   
   ### Cluster
   
   A table `vertex_clustered` is created. `Gist` index is also added on 
`label_ids`.
   
   Additionally, the content of the table is clustered (ordering of physical 
table) by label ids:
   
   * Smaller length of array label_ids goes first.
   
   * For similar length of array label_ids, order by the content of the array 
in descending order.
   
   For example, physical order of label_id column would look like this in the 
clustered table:
   ```
   {1}
   {1}
   {2}
   {2}
   {3}
   {1,2}
   {1,2}
   {1,2}
   {1,3}
   {1,3}
   {2,4}
   {3,8}
   {1,2,8}
   {1,4,5}
   {4,5,6}
   ...
   ```
   
   ## Benchmark result
   
   The above queries are run on the 3 tables.
   
   Note: the `intarray` module needs to be installed to run the queries.
   
   Execution time retrieved from the EXPLAIN ANALYZE command is provided below:
   
   |   | vertex_default  | vertex_indexed  | vertex_clustered  |
   |---|---|---|---|
   |  Query 1 | 11198.419 ms  | 8218.173 ms  | 4909.334 ms  |
   |  Query 2 |  18695.878 ms | 15470.613 ms (seq scan) | 9666.337 ms  |
   |  Query 3 | 11850.516 ms  | 7518.450 ms  | 1760.783 ms  |
   |  Query 4 |  5664.825 ms (para seq scan) |  9095.456 ms |  1746.895 ms |
   
   
   ## Conclusion
   
   Clustered table with Gist index performs better other two tables.
   
   Caveat is clustering process takes time. But, it is useful for analytical 
databases where data is read-only. So, the table needs to be clustered only 
once. On the other hand, transactional databases, where CREATE and SET are 
frequent, re-clustering after every insert is not a usable solution.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@age.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [age] rafsun42 commented on issue #1004: Research multi-label

Reply via email to