[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Richard Hipp Wed, 18 Nov 2015 21:11:13 -0500

On 11/18/15, Nico Williams <nico at cryptonector.com> wrote:
>
> Now, diff(1) (i.e., the Unix command) can do this very efficiently
> because it just moves a "cursor" in both files looking for different
> records (lines).
>


Umm..  Diff is a little more complicated than that.  See
https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
further niformation.  Basically, any efficient implementation of
diff(1) needs to have both files completely in memory, because the
computation does lots of jumping around and searching within both
files.  I wrote the diff implementation for Fossil, from scratch, so I
have some familiarity with these kinds of things.  Would that it were
so simple as a couple of moving cursors...

That said, given sorted tables like this:

CREATE TABLE t1(a,b,c,PRIMARY KEY(a,b));
CREATE TABLE t2(x,y,z,PRIMARY KEY(x,y));

Then the following query:

SELECT a, b FROM t1 EXCEPT SELECT x, y FROM t2 ORDER BY 1, 2;

Works by running both SELECT statements in parallel - walking cursors
down each table - and "merging" the results together in a single pass.
Sadly, SQLite currrently does *not* do this if you omit the ORDER BY
clause:

SELECT a, b FROM t1 EXCEPT SELECT x, y FROM t2;

So here is a case where (paradoxically) adding an ORDER BY clause
makes a query go faster.  I noticed this a few weeks ago and was
thinking that I need to do something about it.

-- 
D. Richard Hipp
drh at sqlite.org

[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Reply via email to