panosfol commented on issue #995:
URL: https://github.com/apache/age/issues/995#issuecomment-1637134723
> **3. filter_vertices_on_label_id**
>
> This function is used internally by the following query:
>
> ```sql
> MATCH (:Person)-[e:IN]->(:Title{primaryTitle:'Inception'}) RETURN e
> ```
>
> The QPT is:
>
> ```
> Gather (cost=569028.74..853951.48 rows=105 width=32)
> Workers Planned: 2
> -> Parallel Hash Join (cost=568028.74..852940.98 rows=44 width=32)
> Hash Cond: (e.end_id = _age_default_alias_0.id)
> -> Parallel Seq Scan on "IN" e (cost=0.00..284748.30 rows=43610
width=29)
> Filter: ((_extract_label_id(start_id))::integer = 4)
> -> Parallel Hash (cost=567965.30..567965.30 rows=5075 width=8)
> -> Parallel Seq Scan on "Title" _age_default_alias_0
(cost=0.00..567965.30 rows=5075 width=8)
> Filter: (properties @>
agtype_build_map('primaryTitle'::text, '"Inception"'::agtype))
> ```
>
> The function adds filter condition in a query plan. In the above QPT, this
line `Filter: ((_extract_label_id(start_id))::integer = 4) ` is built by it.
>
> Because person is filtered by only label (i.e. `(:Person)`) and no
property filter or variable is used, internally the `Person` table is not
joined with the `IN` table. The `_extract_label_id` can tell which label
`start_id` belongs to, and eliminates the join.
>
> In order to drop the concept of `graphid`, we will need to stop using the
function `_extract_label_id`. One alternative is to actually to the join.
Except, not with the `Person` table. A duplicate table of `Person` can be used.
It can be trimmed to have only ID column and indexed strategically, to reduce
the join time.
>
> So, `Person` table will have a duplicate `Person_hash`. It will only have
the ID column. Dropping the properties column will make the join faster since
each disk read can load more rows now. The ID column, then, will be indexed by
hash method, so a hash join can be performed. A combination of less data to
load from disk and hash index, the join can be made faster than a regular join
(which is performed from Title).
>
> @panosfol @Zainab-Saad @WendelLana @CapnSpek What do you guys think of
this solution? Can the other two functions that you researched adopt this
solution?
From my understanding this solution would need us know the `label_name` in
order to find the correct duplicate table. Therefore I've researched in which
of the above 3 functions (`entity_exists()`, `get_label_name()`,
`filter_vertices_on_label_id()`) we have access to the `label_name` and in what
context are they called.
First, the `entity_exists()` function is called by `merge_vertex()` and
`create_vertex()`, both of which have as argument `cypher_target_node` struct
that has `char *label_name` as field, therefore we can actually access the
`label_name` in the context that `entity_exists()` is being called.
The `filter_vertices_on_label_id()` has `char *label` as argument, so we
actually have the `label_name` in the context of the function. And finally the
`get_label_name()` is being called by the `age_startnode()` and `age_endnode()`
in the executor stage and it doesn't have a way to access the `label_name`
without using the `graphid`. The problem is that the `age_startnode()` and
`age_endnode()` are not internal C functions and their usage is to take an
`edge` and return either the start node or the end node. The only way to
accomplish that right now is through the node's `ID`, because that's the only
information that the `edge` is holding for its 2 vertices.
In order to remove completely the `graphid` utility we need to come up with
a way specifically for the `edge` structure because the `get_label_name()`
function is being called by the 2 functions (`age_startnode()`,
`age_endnode()`) that have only an edge as argument. Or change/remove those 2
functions completely.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]