A couple things...

I recommend using longer names than 1 letter for your aliases, what you save in 
typing you lose a couple times over again when wondering what "r" is or why "t" 
has anything to do with "nodes"

In your CTE you're doing a 3 table join. There's no need to include the nodes 
table in there at all, you can get the node ID from the edge table.
...union all select e.child, top from r, edges as e where e.parent = r.id)...

The big thing though is in the where clause.
where...and id not in (select parent from edges where parent = id)...

The "where parent = id" bit turns that into a <correlated> sub query, meaning 
it runs it once for each row.
It's kind of saying "where X is in the bucket of (X's that are in the bucket of 
Y's)"

If you get rid of that...
where...and id not in (select parent from edges)...
It becomes a non-correlated sub query which only has to be run once. "where X 
is in the bucket of Y's"


Making those 2 tweaks to your query let's look at the plans. You can see we 
drop the search of the nodes table, and subquery 4 is no longer correlated.

Old:
sqlite> explain query plan with recursive r (id, top) as (select id, id from 
nodes union all select t.id, top from nodes as t, edges as e, r where e.parent 
= r.id and t.id = e.child) select top, count(*) from r where top != id and id 
not in (select parent from edges where parent = id) group by top;
selectid|order|from|detail
2|0|0|SCAN TABLE nodes
3|0|1|SCAN TABLE edges AS e
3|1|0|SEARCH TABLE nodes AS t USING INTEGER PRIMARY KEY (rowid=?)
3|2|2|SCAN TABLE r
1|0|0|COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
0|0|0|SCAN SUBQUERY 1
0|0|0|EXECUTE CORRELATED LIST SUBQUERY 4
4|0|0|SCAN TABLE edges
0|0|0|USE TEMP B-TREE FOR GROUP BY


New:
sqlite> explain query plan with recursive r (id, top) as (select id, id from 
nodes union all select e.child, top from edges as e, r where e.parent = r.id) 
select top, count(*) from r where top != id and id not in (select parent from 
edges) group by top;
selectid|order|from|detail
2|0|0|SCAN TABLE nodes
3|0|0|SCAN TABLE edges AS e
3|1|1|SCAN TABLE r
1|0|0|COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
0|0|0|SCAN SUBQUERY 1
0|0|0|EXECUTE LIST SUBQUERY 4
4|0|0|SCAN TABLE edges
0|0|0|USE TEMP B-TREE FOR GROUP BY


Now give your modified query a go and let me know how it compares to what I 
came up with.


-----Original Message-----
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Shane Dev
Sent: Wednesday, January 03, 2018 12:45 AM
To: SQLite mailing list
Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.

Hi David,

Nice work! your query is far quicker than mine- even without the
reverseEdges index. I think you are right about the problem of potentially
double counting leaves. There weren't any multi-parent nodes in my test
data so I didn't notice this mistake.

Could you please explain why your query is so much faster?

On 2 January 2018 at 17:50, David Raymond <david.raym...@tomtom.com> wrote:

> I think you need a union there instead of a union all. Otherwise you're
> double (or more) counting leaves where there is more than 1 path to get to
> the leaf.
>
> I don't have a large dataset to test it on, but how about something like:
>
> create table nodes
> (
>   id integer primary key,
>   description text
> );
>
> create table edges
> (
>   parent int not null references nodes,
>   child int not null references nodes,
>   primary key (parent, child),
>   check (parent != child)
> ) without rowid;
> create index reverseEdges on edges (child, parent);
>
> create view leafCounts as with recursive
> leaves (id) as (
>   select nodes.id
>   from nodes left outer join edges
>   on nodes.id = edges.parent
>   where edges.parent is null
> ),
> paths (parent, child) as (
>   select parent, child from edges
>   union
>   select paths.parent, edges.child
>   from paths inner join edges
>   on paths.child = edges.parent
> )
> select parent, count(*) as leafCount
> from paths
> where child in leaves
> group by parent;
>
>
> -----Original Message-----
> From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org]
> On Behalf Of Shane Dev
> Sent: Monday, January 01, 2018 11:14 AM
> To: SQLite mailing list
> Subject: [sqlite] Efficient query to count number of leaves in a DAG.
>
> Hi,
>
> I want to the count the number of leaves (descendants without children) for
> each node in a DAG
>
> DAG definition -
>
> CREATE TABLE nodes(id integer primary key, description text);
> CREATE TABLE edges(parent not null references nodes, child not null
> references nodes, primary key(parent, child));
>
> My query -
>
> CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> select id, id from nodes
> union all
> select t.id, top from nodes as t, edges as e, r where e.parent=r.id and
> t.id
> =e.child)
> select top, count(*) from r where top<>id and id not in (select parent from
> edges where parent=id) group by top;
>
> It seems to work but is complex to understand and debug despite my aim to
> keep it simple as possible, but more importantly - it is very slow when
> there are more than a few thousand nodes and edges.
>
> It there a more efficient (and ideally simpler) way?
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to