Re: [sqlite] Efficient query to count number of leaves in a DAG.

Shane Dev Fri, 05 Jan 2018 13:57:39 -0800

Hi David,

According to https://sqlite.org/lang_with.html, "Optimization note: ...if
the example had used UNION instead of UNION ALL, then SQLite would have had
to keep around all previously generated content in order to check for
duplicates. For this reason, programmers should strive to use UNION ALL
instead of UNION when feasible."


Despite that, your RCTE with UNION is much faster than mine.

sqlite> select count(*) from nodes;
count(*)
10000
sqlite> select count(*) from edges;
count(*)
9990

Here is how create my test data -

sqlite> .sch v_generate_nodes
-- Generates an infinite series of x, 'nodex' records where x = 1, 2, 3 ...
CREATE VIEW v_generate_nodes as with recursive rcte(id, description) as
(select 1, 'node1' union all select id+1, 'node'||(id+1) from rcte) select
* from rcte;
sqlite> insert into nodes select from v_generate_nodes limit 10000;

sqlite> .sch v_generate_edges
-- Randomly generates edges between entries in the nodes table.
---Assumption : node ids are 1, 2, 3...n without gaps
-- Each node will have 0 or 1 parents and 0, 1, 2, ... children
CREATE VIEW v_generate_edges as with rcte(parent, child) as (select
cast(abs(random())/9223372036854775808 as integer), 1 union all select
cast(abs(random())/9223372036854775808*(child+1) as integer), child+1 from
rcte where child <= (select count(*) from nodes) limit (select count(*)
from nodes)) select * from rcte where parent>0;
sqlite> insert into edges select * from v_generate_edges;



On 5 January 2018 at 18:32, David Raymond <david.raym...@tomtom.com> wrote:

> Hmm. Maybe try yours with union instead of union all? Though if there's
> only 1 path between any pair of nodes that shouldn't make too much
> difference. Otherwise I'm getting low on ideas.
>
> What're the record counts for nodes and edges?
>
> -----Original Message-----
> From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org]
> On Behalf Of Shane Dev
> Sent: Thursday, January 04, 2018 5:20 PM
> To: SQLite mailing list
> Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
>
> Hi David
>
> I recommend using longer names than 1 letter for your aliases, what you
> > save in typing you lose a couple times over again when wondering what "r"
> > is or why "t" has anything to do with "nodes"
> >
>
> Fair enough. I tend to use shorts names to reduce the risk of typos. My
> original node table was called "tasks". I tried to simplify the query for
> this forum post but neglected to change the alias.
>
> >
> > In your CTE you're doing a 3 table join. There's no need to include the
> > nodes table in there at all, you can get the node ID from the edge table.
> > ...union all select e.child, top from r, edges as e where e.parent =
> r.id
> > )...
> >
>
> You're right in this case. My original node table "tasks" had more columns
> which I wanted in the final result set.
>
> >
> > The big thing though is in the where clause.
> > where...and id not in (select parent from edges where parent = id)...
> >
>
> That was a sloppy mistake, I changed it to  ..and id not in (select parent
> from edges)... but it was still very slow
>
> >
> > Old:
> > sqlite> explain query plan with recursive r (id, top) as (select id, id
> > from nodes union all select t.id, top from nodes as t, edges as e, r
> > where e.parent = r.id and t.id = e.child) select top, count(*) from r
> > where top != id and id not in (select parent from edges where parent =
> id)
> > group by top;
> >
>
> CREATE VIEW v_count_leaves as with recursive r (id, top) as (select id, id
> from nodes union all select t.id, top from nodes as t, edges as e, r where
> e.parent = r.id and t.id = e.child) select top, count(*) from r where top
> != id and id not in (select parent from edges where parent = id) group by
> top;
>
> sqlite> select * from v_count_leaves where top=679;
> top     count(*)
> 679     2
> Run Time: real 73.365 user 73.328125 sys 0.000000
>
>
> > New:
> > sqlite> explain query plan with recursive r (id, top) as (select id, id
> > from nodes union all select e.child, top from edges as e, r where
> e.parent
> > = r.id) select top, count(*) from r where top != id and id not in
> (select
> > parent from edges) group by top;
> > Now give your modified query a go and let me know how it compares to what
> > I came up with.
> >
>
> CREATE VIEW v_count_leaves_new as with recursive r (id, top) as (select id,
> id from nodes union all select e.child, top from edges as e, r where
> e.parent = r.id) select top, count(*) from r where top != id and id not in
> (select parent from edges) group by top;
>
> sqlite> select * from v_count_leaves_new where top=679;
> top     count(*)
> 679     2
> Run Time: real 45.099 user 45.093750 sys 0.000000
>
> faster, but about 8 times slower than your query -
>
> sqlite> select * from leafcounts where parent=679;
> parent  leafCount
> 679     2
> Run Time: real 5.639 user 5.640625 sys 0.000000
>
> and that is without the reverseEdges index.
>
> I still don't understand why "leafcounts" is so much faster than
> "v_count_leaves_new"
>
>
>
>
> > -----Original Message-----
> > From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org]
> > On Behalf Of Shane Dev
> > Sent: Wednesday, January 03, 2018 12:45 AM
> > To: SQLite mailing list
> > Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
> >
> > Hi David,
> >
> > Nice work! your query is far quicker than mine- even without the
> > reverseEdges index. I think you are right about the problem of
> potentially
> > double counting leaves. There weren't any multi-parent nodes in my test
> > data so I didn't notice this mistake.
> >
> > Could you please explain why your query is so much faster?
> >
> > On 2 January 2018 at 17:50, David Raymond <david.raym...@tomtom.com>
> > wrote:
> >
> > > I think you need a union there instead of a union all. Otherwise you're
> > > double (or more) counting leaves where there is more than 1 path to get
> > to
> > > the leaf.
> > >
> > > I don't have a large dataset to test it on, but how about something
> like:
> > >
> > > create table nodes
> > > (
> > >   id integer primary key,
> > >   description text
> > > );
> > >
> > > create table edges
> > > (
> > >   parent int not null references nodes,
> > >   child int not null references nodes,
> > >   primary key (parent, child),
> > >   check (parent != child)
> > > ) without rowid;
> > > create index reverseEdges on edges (child, parent);
> > >
> > > create view leafCounts as with recursive
> > > leaves (id) as (
> > >   select nodes.id
> > >   from nodes left outer join edges
> > >   on nodes.id = edges.parent
> > >   where edges.parent is null
> > > ),
> > > paths (parent, child) as (
> > >   select parent, child from edges
> > >   union
> > >   select paths.parent, edges.child
> > >   from paths inner join edges
> > >   on paths.child = edges.parent
> > > )
> > > select parent, count(*) as leafCount
> > > from paths
> > > where child in leaves
> > > group by parent;
> > >
> > >
> > > -----Original Message-----
> > > From: sqlite-users [mailto:sqlite-users-bounces@
> mailinglists.sqlite.org]
> > > On Behalf Of Shane Dev
> > > Sent: Monday, January 01, 2018 11:14 AM
> > > To: SQLite mailing list
> > > Subject: [sqlite] Efficient query to count number of leaves in a DAG.
> > >
> > > Hi,
> > >
> > > I want to the count the number of leaves (descendants without children)
> > for
> > > each node in a DAG
> > >
> > > DAG definition -
> > >
> > > CREATE TABLE nodes(id integer primary key, description text);
> > > CREATE TABLE edges(parent not null references nodes, child not null
> > > references nodes, primary key(parent, child));
> > >
> > > My query -
> > >
> > > CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> > > select id, id from nodes
> > > union all
> > > select t.id, top from nodes as t, edges as e, r where e.parent=r.id
> and
> > > t.id
> > > =e.child)
> > > select top, count(*) from r where top<>id and id not in (select parent
> > from
> > > edges where parent=id) group by top;
> > >
> > > It seems to work but is complex to understand and debug despite my aim
> to
> > > keep it simple as possible, but more importantly - it is very slow when
> > > there are more than a few thousand nodes and edges.
> > >
> > > It there a more efficient (and ideally simpler) way?
> > > _______________________________________________
> > > sqlite-users mailing list
> > > sqlite-users@mailinglists.sqlite.org
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > > _______________________________________________
> > > sqlite-users mailing list
> > > sqlite-users@mailinglists.sqlite.org
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > >
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Efficient query to count number of leaves in a DAG.

Reply via email to