Hi David, Nice work! your query is far quicker than mine- even without the reverseEdges index. I think you are right about the problem of potentially double counting leaves. There weren't any multi-parent nodes in my test data so I didn't notice this mistake.
Could you please explain why your query is so much faster? On 2 January 2018 at 17:50, David Raymond <david.raym...@tomtom.com> wrote: > I think you need a union there instead of a union all. Otherwise you're > double (or more) counting leaves where there is more than 1 path to get to > the leaf. > > I don't have a large dataset to test it on, but how about something like: > > create table nodes > ( > id integer primary key, > description text > ); > > create table edges > ( > parent int not null references nodes, > child int not null references nodes, > primary key (parent, child), > check (parent != child) > ) without rowid; > create index reverseEdges on edges (child, parent); > > create view leafCounts as with recursive > leaves (id) as ( > select nodes.id > from nodes left outer join edges > on nodes.id = edges.parent > where edges.parent is null > ), > paths (parent, child) as ( > select parent, child from edges > union > select paths.parent, edges.child > from paths inner join edges > on paths.child = edges.parent > ) > select parent, count(*) as leafCount > from paths > where child in leaves > group by parent; > > > -----Original Message----- > From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] > On Behalf Of Shane Dev > Sent: Monday, January 01, 2018 11:14 AM > To: SQLite mailing list > Subject: [sqlite] Efficient query to count number of leaves in a DAG. > > Hi, > > I want to the count the number of leaves (descendants without children) for > each node in a DAG > > DAG definition - > > CREATE TABLE nodes(id integer primary key, description text); > CREATE TABLE edges(parent not null references nodes, child not null > references nodes, primary key(parent, child)); > > My query - > > CREATE VIEW v_count_leaves as with recursive r(id, top) as ( > select id, id from nodes > union all > select t.id, top from nodes as t, edges as e, r where e.parent=r.id and > t.id > =e.child) > select top, count(*) from r where top<>id and id not in (select parent from > edges where parent=id) group by top; > > It seems to work but is complex to understand and debug despite my aim to > keep it simple as possible, but more importantly - it is very slow when > there are more than a few thousand nodes and edges. > > It there a more efficient (and ideally simpler) way? > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users