Re: [sqlite] Efficient query to count number of leaves in a DAG.

Shane Dev Sun, 07 Jan 2018 11:48:59 -0800

Hi David,

I started from scratch with a new database and confirmed your findings -
v_count_leaves_new is actually faster than  leafCounts.


My error in the original database was neglecting to specify type of the
columns in the EDGES table -

CREATE TABLE edges(parent not null references nodes, child not null
references nodes, primary key(parent, child)) without rowid;

I had assumed the EDGES columns would inherit the affinity of the
referenced table key (NODES.id). However after correcting the table
definition to

CREATE TABLE edges(parent *int* not null references nodes, child *int* not
null references nodes, primary key(parent, child)) without rowid;

queries over 10000 nodes and edges dropped from 45 to 0.1 sec. A valuable
lesson for me!

On 6 January 2018 at 00:28, David Raymond <david.raym...@tomtom.com> wrote:

> Something is seriously funky here. I'm getting the opposite, where your
> query appears to be going faster than mine. I used your queries there to
> populate nodes and edges, based on 1,000,000 nodes. I even added in the
> extra index which turns out isn't used anyway. With it all in memory my
> version is taking 59 seconds, whereas your _new version is taking 28
> seconds and the old version is only 34. So apparently I should be taking
> query advice from you.
>
> If I change my union into a union all it goes down to 31 seconds, so
> closer to yours.
> If I change your union all into a union the time jumps to 156 seconds.
> I think I was thinking of a graph with possible loops or multiple paths to
> get from A to B, which is why I went with the union.
>
> So my next question is: what SQLite version are you using, and what
> hardware are you on?
>
> Are you query plans looking like what I'm seeing here?
>
> sqlite> select * from v_count_leaves_new where top = 777;
> --EQP-- 3,0,0,SCAN TABLE nodes
> --EQP-- 4,0,1,SCAN TABLE r
> --EQP-- 4,1,0,SEARCH TABLE edges AS e USING COVERING INDEX
> sqlite_autoindex_edges_1 (parent=?)
> --EQP-- 2,0,0,COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
> --EQP-- 1,0,0,SCAN SUBQUERY 2
> --EQP-- 0,0,0,USING INDEX sqlite_autoindex_edges_1 FOR IN-OPERATOR
> --EQP-- 0,0,0,SCAN SUBQUERY 1
> top|count(*)
> 777|314
> Run Time: real 28.502 user 28.454582 sys 0.000000
>
> --now with union all
> sqlite> select * from leafCounts2 where parent = 777;
> --EQP-- 3,0,0,SCAN TABLE edges
> --EQP-- 4,0,0,SCAN TABLE paths
> --EQP-- 4,1,1,SEARCH TABLE edges USING COVERING INDEX
> sqlite_autoindex_edges_1 (parent=?)
> --EQP-- 2,0,0,COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
> --EQP-- 1,0,0,SCAN SUBQUERY 2
> --EQP-- 1,0,0,EXECUTE LIST SUBQUERY 5
> --EQP-- 5,0,0,SCAN TABLE nodes
> --EQP-- 5,1,1,SEARCH TABLE edges USING COVERING INDEX
> sqlite_autoindex_edges_1 (parent=?)
> --EQP-- 0,0,0,SCAN SUBQUERY 1
> parent|leafCount
> 777|314
> Run Time: real 31.590 user 31.434202 sys 0.000000
>
>
> -----Original Message-----
> From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org]
> On Behalf Of Shane Dev
> Sent: Friday, January 05, 2018 4:57 PM
> To: SQLite mailing list
> Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
>
> Hi David,
>
> According to https://sqlite.org/lang_with.html, "Optimization note: ...if
> the example had used UNION instead of UNION ALL, then SQLite would have had
> to keep around all previously generated content in order to check for
> duplicates. For this reason, programmers should strive to use UNION ALL
> instead of UNION when feasible."
>
> Despite that, your RCTE with UNION is much faster than mine.
>
> sqlite> select count(*) from nodes;
> count(*)
> 10000
> sqlite> select count(*) from edges;
> count(*)
> 9990
>
> Here is how create my test data -
>
> sqlite> .sch v_generate_nodes
> -- Generates an infinite series of x, 'nodex' records where x = 1, 2, 3 ...
> CREATE VIEW v_generate_nodes as with recursive rcte(id, description) as
> (select 1, 'node1' union all select id+1, 'node'||(id+1) from rcte) select
> * from rcte;
> sqlite> insert into nodes select from v_generate_nodes limit 10000;
>
> sqlite> .sch v_generate_edges
> -- Randomly generates edges between entries in the nodes table.
> ---Assumption : node ids are 1, 2, 3...n without gaps
> -- Each node will have 0 or 1 parents and 0, 1, 2, ... children
> CREATE VIEW v_generate_edges as with rcte(parent, child) as (select
> cast(abs(random())/9223372036854775808 as integer), 1 union all select
> cast(abs(random())/9223372036854775808*(child+1) as integer), child+1 from
> rcte where child <= (select count(*) from nodes) limit (select count(*)
> from nodes)) select * from rcte where parent>0;
> sqlite> insert into edges select * from v_generate_edges;
>
>
>
> On 5 January 2018 at 18:32, David Raymond <david.raym...@tomtom.com>
> wrote:
>
> > Hmm. Maybe try yours with union instead of union all? Though if there's
> > only 1 path between any pair of nodes that shouldn't make too much
> > difference. Otherwise I'm getting low on ideas.
> >
> > What're the record counts for nodes and edges?
> >
> > -----Original Message-----
> > From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org]
> > On Behalf Of Shane Dev
> > Sent: Thursday, January 04, 2018 5:20 PM
> > To: SQLite mailing list
> > Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
> >
> > Hi David
> >
> > I recommend using longer names than 1 letter for your aliases, what you
> > > save in typing you lose a couple times over again when wondering what
> "r"
> > > is or why "t" has anything to do with "nodes"
> > >
> >
> > Fair enough. I tend to use shorts names to reduce the risk of typos. My
> > original node table was called "tasks". I tried to simplify the query for
> > this forum post but neglected to change the alias.
> >
> > >
> > > In your CTE you're doing a 3 table join. There's no need to include the
> > > nodes table in there at all, you can get the node ID from the edge
> table.
> > > ...union all select e.child, top from r, edges as e where e.parent =
> > r.id
> > > )...
> > >
> >
> > You're right in this case. My original node table "tasks" had more
> columns
> > which I wanted in the final result set.
> >
> > >
> > > The big thing though is in the where clause.
> > > where...and id not in (select parent from edges where parent = id)...
> > >
> >
> > That was a sloppy mistake, I changed it to  ..and id not in (select
> parent
> > from edges)... but it was still very slow
> >
> > >
> > > Old:
> > > sqlite> explain query plan with recursive r (id, top) as (select id, id
> > > from nodes union all select t.id, top from nodes as t, edges as e, r
> > > where e.parent = r.id and t.id = e.child) select top, count(*) from r
> > > where top != id and id not in (select parent from edges where parent =
> > id)
> > > group by top;
> > >
> >
> > CREATE VIEW v_count_leaves as with recursive r (id, top) as (select id,
> id
> > from nodes union all select t.id, top from nodes as t, edges as e, r
> where
> > e.parent = r.id and t.id = e.child) select top, count(*) from r where
> top
> > != id and id not in (select parent from edges where parent = id) group by
> > top;
> >
> > sqlite> select * from v_count_leaves where top=679;
> > top     count(*)
> > 679     2
> > Run Time: real 73.365 user 73.328125 sys 0.000000
> >
> >
> > > New:
> > > sqlite> explain query plan with recursive r (id, top) as (select id, id
> > > from nodes union all select e.child, top from edges as e, r where
> > e.parent
> > > = r.id) select top, count(*) from r where top != id and id not in
> > (select
> > > parent from edges) group by top;
> > > Now give your modified query a go and let me know how it compares to
> what
> > > I came up with.
> > >
> >
> > CREATE VIEW v_count_leaves_new as with recursive r (id, top) as (select
> id,
> > id from nodes union all select e.child, top from edges as e, r where
> > e.parent = r.id) select top, count(*) from r where top != id and id not
> in
> > (select parent from edges) group by top;
> >
> > sqlite> select * from v_count_leaves_new where top=679;
> > top     count(*)
> > 679     2
> > Run Time: real 45.099 user 45.093750 sys 0.000000
> >
> > faster, but about 8 times slower than your query -
> >
> > sqlite> select * from leafcounts where parent=679;
> > parent  leafCount
> > 679     2
> > Run Time: real 5.639 user 5.640625 sys 0.000000
> >
> > and that is without the reverseEdges index.
> >
> > I still don't understand why "leafcounts" is so much faster than
> > "v_count_leaves_new"
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: sqlite-users [mailto:sqlite-users-bounces@
> mailinglists.sqlite.org]
> > > On Behalf Of Shane Dev
> > > Sent: Wednesday, January 03, 2018 12:45 AM
> > > To: SQLite mailing list
> > > Subject: Re: [sqlite] Efficient query to count number of leaves in a
> DAG.
> > >
> > > Hi David,
> > >
> > > Nice work! your query is far quicker than mine- even without the
> > > reverseEdges index. I think you are right about the problem of
> > potentially
> > > double counting leaves. There weren't any multi-parent nodes in my test
> > > data so I didn't notice this mistake.
> > >
> > > Could you please explain why your query is so much faster?
> > >
> > > On 2 January 2018 at 17:50, David Raymond <david.raym...@tomtom.com>
> > > wrote:
> > >
> > > > I think you need a union there instead of a union all. Otherwise
> you're
> > > > double (or more) counting leaves where there is more than 1 path to
> get
> > > to
> > > > the leaf.
> > > >
> > > > I don't have a large dataset to test it on, but how about something
> > like:
> > > >
> > > > create table nodes
> > > > (
> > > >   id integer primary key,
> > > >   description text
> > > > );
> > > >
> > > > create table edges
> > > > (
> > > >   parent int not null references nodes,
> > > >   child int not null references nodes,
> > > >   primary key (parent, child),
> > > >   check (parent != child)
> > > > ) without rowid;
> > > > create index reverseEdges on edges (child, parent);
> > > >
> > > > create view leafCounts as with recursive
> > > > leaves (id) as (
> > > >   select nodes.id
> > > >   from nodes left outer join edges
> > > >   on nodes.id = edges.parent
> > > >   where edges.parent is null
> > > > ),
> > > > paths (parent, child) as (
> > > >   select parent, child from edges
> > > >   union
> > > >   select paths.parent, edges.child
> > > >   from paths inner join edges
> > > >   on paths.child = edges.parent
> > > > )
> > > > select parent, count(*) as leafCount
> > > > from paths
> > > > where child in leaves
> > > > group by parent;
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: sqlite-users [mailto:sqlite-users-bounces@
> > mailinglists.sqlite.org]
> > > > On Behalf Of Shane Dev
> > > > Sent: Monday, January 01, 2018 11:14 AM
> > > > To: SQLite mailing list
> > > > Subject: [sqlite] Efficient query to count number of leaves in a DAG.
> > > >
> > > > Hi,
> > > >
> > > > I want to the count the number of leaves (descendants without
> children)
> > > for
> > > > each node in a DAG
> > > >
> > > > DAG definition -
> > > >
> > > > CREATE TABLE nodes(id integer primary key, description text);
> > > > CREATE TABLE edges(parent not null references nodes, child not null
> > > > references nodes, primary key(parent, child));
> > > >
> > > > My query -
> > > >
> > > > CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> > > > select id, id from nodes
> > > > union all
> > > > select t.id, top from nodes as t, edges as e, r where e.parent=r.id
> > and
> > > > t.id
> > > > =e.child)
> > > > select top, count(*) from r where top<>id and id not in (select
> parent
> > > from
> > > > edges where parent=id) group by top;
> > > >
> > > > It seems to work but is complex to understand and debug despite my
> aim
> > to
> > > > keep it simple as possible, but more importantly - it is very slow
> when
> > > > there are more than a few thousand nodes and edges.
> > > >
> > > > It there a more efficient (and ideally simpler) way?
> > > > _______________________________________________
> > > > sqlite-users mailing list
> > > > sqlite-users@mailinglists.sqlite.org
> > > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > > > _______________________________________________
> > > > sqlite-users mailing list
> > > > sqlite-users@mailinglists.sqlite.org
> > > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > > >
> > > _______________________________________________
> > > sqlite-users mailing list
> > > sqlite-users@mailinglists.sqlite.org
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > > _______________________________________________
> > > sqlite-users mailing list
> > > sqlite-users@mailinglists.sqlite.org
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > >
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Efficient query to count number of leaves in a DAG.

Reply via email to