Hi David, I started from scratch with a new database and confirmed your findings - v_count_leaves_new is actually faster than leafCounts.
My error in the original database was neglecting to specify type of the columns in the EDGES table - CREATE TABLE edges(parent not null references nodes, child not null references nodes, primary key(parent, child)) without rowid; I had assumed the EDGES columns would inherit the affinity of the referenced table key (NODES.id). However after correcting the table definition to CREATE TABLE edges(parent *int* not null references nodes, child *int* not null references nodes, primary key(parent, child)) without rowid; queries over 10000 nodes and edges dropped from 45 to 0.1 sec. A valuable lesson for me! On 6 January 2018 at 00:28, David Raymond <david.raym...@tomtom.com> wrote: > Something is seriously funky here. I'm getting the opposite, where your > query appears to be going faster than mine. I used your queries there to > populate nodes and edges, based on 1,000,000 nodes. I even added in the > extra index which turns out isn't used anyway. With it all in memory my > version is taking 59 seconds, whereas your _new version is taking 28 > seconds and the old version is only 34. So apparently I should be taking > query advice from you. > > If I change my union into a union all it goes down to 31 seconds, so > closer to yours. > If I change your union all into a union the time jumps to 156 seconds. > I think I was thinking of a graph with possible loops or multiple paths to > get from A to B, which is why I went with the union. > > So my next question is: what SQLite version are you using, and what > hardware are you on? > > Are you query plans looking like what I'm seeing here? > > sqlite> select * from v_count_leaves_new where top = 777; > --EQP-- 3,0,0,SCAN TABLE nodes > --EQP-- 4,0,1,SCAN TABLE r > --EQP-- 4,1,0,SEARCH TABLE edges AS e USING COVERING INDEX > sqlite_autoindex_edges_1 (parent=?) > --EQP-- 2,0,0,COMPOUND SUBQUERIES 0 AND 0 (UNION ALL) > --EQP-- 1,0,0,SCAN SUBQUERY 2 > --EQP-- 0,0,0,USING INDEX sqlite_autoindex_edges_1 FOR IN-OPERATOR > --EQP-- 0,0,0,SCAN SUBQUERY 1 > top|count(*) > 777|314 > Run Time: real 28.502 user 28.454582 sys 0.000000 > > --now with union all > sqlite> select * from leafCounts2 where parent = 777; > --EQP-- 3,0,0,SCAN TABLE edges > --EQP-- 4,0,0,SCAN TABLE paths > --EQP-- 4,1,1,SEARCH TABLE edges USING COVERING INDEX > sqlite_autoindex_edges_1 (parent=?) > --EQP-- 2,0,0,COMPOUND SUBQUERIES 0 AND 0 (UNION ALL) > --EQP-- 1,0,0,SCAN SUBQUERY 2 > --EQP-- 1,0,0,EXECUTE LIST SUBQUERY 5 > --EQP-- 5,0,0,SCAN TABLE nodes > --EQP-- 5,1,1,SEARCH TABLE edges USING COVERING INDEX > sqlite_autoindex_edges_1 (parent=?) > --EQP-- 0,0,0,SCAN SUBQUERY 1 > parent|leafCount > 777|314 > Run Time: real 31.590 user 31.434202 sys 0.000000 > > > -----Original Message----- > From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] > On Behalf Of Shane Dev > Sent: Friday, January 05, 2018 4:57 PM > To: SQLite mailing list > Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG. > > Hi David, > > According to https://sqlite.org/lang_with.html, "Optimization note: ...if > the example had used UNION instead of UNION ALL, then SQLite would have had > to keep around all previously generated content in order to check for > duplicates. For this reason, programmers should strive to use UNION ALL > instead of UNION when feasible." > > Despite that, your RCTE with UNION is much faster than mine. > > sqlite> select count(*) from nodes; > count(*) > 10000 > sqlite> select count(*) from edges; > count(*) > 9990 > > Here is how create my test data - > > sqlite> .sch v_generate_nodes > -- Generates an infinite series of x, 'nodex' records where x = 1, 2, 3 ... > CREATE VIEW v_generate_nodes as with recursive rcte(id, description) as > (select 1, 'node1' union all select id+1, 'node'||(id+1) from rcte) select > * from rcte; > sqlite> insert into nodes select from v_generate_nodes limit 10000; > > sqlite> .sch v_generate_edges > -- Randomly generates edges between entries in the nodes table. > ---Assumption : node ids are 1, 2, 3...n without gaps > -- Each node will have 0 or 1 parents and 0, 1, 2, ... children > CREATE VIEW v_generate_edges as with rcte(parent, child) as (select > cast(abs(random())/9223372036854775808 as integer), 1 union all select > cast(abs(random())/9223372036854775808*(child+1) as integer), child+1 from > rcte where child <= (select count(*) from nodes) limit (select count(*) > from nodes)) select * from rcte where parent>0; > sqlite> insert into edges select * from v_generate_edges; > > > > On 5 January 2018 at 18:32, David Raymond <david.raym...@tomtom.com> > wrote: > > > Hmm. Maybe try yours with union instead of union all? Though if there's > > only 1 path between any pair of nodes that shouldn't make too much > > difference. Otherwise I'm getting low on ideas. > > > > What're the record counts for nodes and edges? > > > > -----Original Message----- > > From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] > > On Behalf Of Shane Dev > > Sent: Thursday, January 04, 2018 5:20 PM > > To: SQLite mailing list > > Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG. > > > > Hi David > > > > I recommend using longer names than 1 letter for your aliases, what you > > > save in typing you lose a couple times over again when wondering what > "r" > > > is or why "t" has anything to do with "nodes" > > > > > > > Fair enough. I tend to use shorts names to reduce the risk of typos. My > > original node table was called "tasks". I tried to simplify the query for > > this forum post but neglected to change the alias. > > > > > > > > In your CTE you're doing a 3 table join. There's no need to include the > > > nodes table in there at all, you can get the node ID from the edge > table. > > > ...union all select e.child, top from r, edges as e where e.parent = > > r.id > > > )... > > > > > > > You're right in this case. My original node table "tasks" had more > columns > > which I wanted in the final result set. > > > > > > > > The big thing though is in the where clause. > > > where...and id not in (select parent from edges where parent = id)... > > > > > > > That was a sloppy mistake, I changed it to ..and id not in (select > parent > > from edges)... but it was still very slow > > > > > > > > Old: > > > sqlite> explain query plan with recursive r (id, top) as (select id, id > > > from nodes union all select t.id, top from nodes as t, edges as e, r > > > where e.parent = r.id and t.id = e.child) select top, count(*) from r > > > where top != id and id not in (select parent from edges where parent = > > id) > > > group by top; > > > > > > > CREATE VIEW v_count_leaves as with recursive r (id, top) as (select id, > id > > from nodes union all select t.id, top from nodes as t, edges as e, r > where > > e.parent = r.id and t.id = e.child) select top, count(*) from r where > top > > != id and id not in (select parent from edges where parent = id) group by > > top; > > > > sqlite> select * from v_count_leaves where top=679; > > top count(*) > > 679 2 > > Run Time: real 73.365 user 73.328125 sys 0.000000 > > > > > > > New: > > > sqlite> explain query plan with recursive r (id, top) as (select id, id > > > from nodes union all select e.child, top from edges as e, r where > > e.parent > > > = r.id) select top, count(*) from r where top != id and id not in > > (select > > > parent from edges) group by top; > > > Now give your modified query a go and let me know how it compares to > what > > > I came up with. > > > > > > > CREATE VIEW v_count_leaves_new as with recursive r (id, top) as (select > id, > > id from nodes union all select e.child, top from edges as e, r where > > e.parent = r.id) select top, count(*) from r where top != id and id not > in > > (select parent from edges) group by top; > > > > sqlite> select * from v_count_leaves_new where top=679; > > top count(*) > > 679 2 > > Run Time: real 45.099 user 45.093750 sys 0.000000 > > > > faster, but about 8 times slower than your query - > > > > sqlite> select * from leafcounts where parent=679; > > parent leafCount > > 679 2 > > Run Time: real 5.639 user 5.640625 sys 0.000000 > > > > and that is without the reverseEdges index. > > > > I still don't understand why "leafcounts" is so much faster than > > "v_count_leaves_new" > > > > > > > > > > > -----Original Message----- > > > From: sqlite-users [mailto:sqlite-users-bounces@ > mailinglists.sqlite.org] > > > On Behalf Of Shane Dev > > > Sent: Wednesday, January 03, 2018 12:45 AM > > > To: SQLite mailing list > > > Subject: Re: [sqlite] Efficient query to count number of leaves in a > DAG. > > > > > > Hi David, > > > > > > Nice work! your query is far quicker than mine- even without the > > > reverseEdges index. I think you are right about the problem of > > potentially > > > double counting leaves. There weren't any multi-parent nodes in my test > > > data so I didn't notice this mistake. > > > > > > Could you please explain why your query is so much faster? > > > > > > On 2 January 2018 at 17:50, David Raymond <david.raym...@tomtom.com> > > > wrote: > > > > > > > I think you need a union there instead of a union all. Otherwise > you're > > > > double (or more) counting leaves where there is more than 1 path to > get > > > to > > > > the leaf. > > > > > > > > I don't have a large dataset to test it on, but how about something > > like: > > > > > > > > create table nodes > > > > ( > > > > id integer primary key, > > > > description text > > > > ); > > > > > > > > create table edges > > > > ( > > > > parent int not null references nodes, > > > > child int not null references nodes, > > > > primary key (parent, child), > > > > check (parent != child) > > > > ) without rowid; > > > > create index reverseEdges on edges (child, parent); > > > > > > > > create view leafCounts as with recursive > > > > leaves (id) as ( > > > > select nodes.id > > > > from nodes left outer join edges > > > > on nodes.id = edges.parent > > > > where edges.parent is null > > > > ), > > > > paths (parent, child) as ( > > > > select parent, child from edges > > > > union > > > > select paths.parent, edges.child > > > > from paths inner join edges > > > > on paths.child = edges.parent > > > > ) > > > > select parent, count(*) as leafCount > > > > from paths > > > > where child in leaves > > > > group by parent; > > > > > > > > > > > > -----Original Message----- > > > > From: sqlite-users [mailto:sqlite-users-bounces@ > > mailinglists.sqlite.org] > > > > On Behalf Of Shane Dev > > > > Sent: Monday, January 01, 2018 11:14 AM > > > > To: SQLite mailing list > > > > Subject: [sqlite] Efficient query to count number of leaves in a DAG. > > > > > > > > Hi, > > > > > > > > I want to the count the number of leaves (descendants without > children) > > > for > > > > each node in a DAG > > > > > > > > DAG definition - > > > > > > > > CREATE TABLE nodes(id integer primary key, description text); > > > > CREATE TABLE edges(parent not null references nodes, child not null > > > > references nodes, primary key(parent, child)); > > > > > > > > My query - > > > > > > > > CREATE VIEW v_count_leaves as with recursive r(id, top) as ( > > > > select id, id from nodes > > > > union all > > > > select t.id, top from nodes as t, edges as e, r where e.parent=r.id > > and > > > > t.id > > > > =e.child) > > > > select top, count(*) from r where top<>id and id not in (select > parent > > > from > > > > edges where parent=id) group by top; > > > > > > > > It seems to work but is complex to understand and debug despite my > aim > > to > > > > keep it simple as possible, but more importantly - it is very slow > when > > > > there are more than a few thousand nodes and edges. > > > > > > > > It there a more efficient (and ideally simpler) way? > > > > _______________________________________________ > > > > sqlite-users mailing list > > > > sqlite-users@mailinglists.sqlite.org > > > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > > > > _______________________________________________ > > > > sqlite-users mailing list > > > > sqlite-users@mailinglists.sqlite.org > > > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > > > > > > > _______________________________________________ > > > sqlite-users mailing list > > > sqlite-users@mailinglists.sqlite.org > > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > > > _______________________________________________ > > > sqlite-users mailing list > > > sqlite-users@mailinglists.sqlite.org > > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > > > > > _______________________________________________ > > sqlite-users mailing list > > sqlite-users@mailinglists.sqlite.org > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > > _______________________________________________ > > sqlite-users mailing list > > sqlite-users@mailinglists.sqlite.org > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > > > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users