[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Nico Williams Wed, 18 Nov 2015 21:53:08 -0600

On Wed, Nov 18, 2015 at 09:11:13PM -0500, Richard Hipp wrote:
> On 11/18/15, Nico Williams <nico at cryptonector.com> wrote:
> >
> > Now, diff(1) (i.e., the Unix command) can do this very efficiently
> > because it just moves a "cursor" in both files looking for different
> > records (lines).
> >
> 
> Umm..  Diff is a little more complicated than that.  See


Yes, quite right, but for this particular case (only the PK columns
selected) the algorithm I described suffices and is fast because the
two table sources are already sorted.  I know I neglected to mention
explicitly that these are PK columns.

> https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
> further niformation.  Basically, any efficient implementation of
> diff(1) needs to have both files completely in memory, because the
> computation does lots of jumping around and searching within both
> files.  I wrote the diff implementation for Fossil, from scratch, so I
> have some familiarity with these kinds of things.  Would that it were
> so simple as a couple of moving cursors...

I quite believe that.  This isn't a generic diff.

> That said, given sorted tables like this:
> 
> CREATE TABLE t1(a,b,c,PRIMARY KEY(a,b));
> CREATE TABLE t2(x,y,z,PRIMARY KEY(x,y));
> 
> Then the following query:
> 
> SELECT a, b FROM t1 EXCEPT SELECT x, y FROM t2 ORDER BY 1, 2;
> 
> Works by running both SELECT statements in parallel - walking cursors
> down each table - and "merging" the results together in a single pass.
> Sadly, SQLite currrently does *not* do this if you omit the ORDER BY
> clause:
> 
> SELECT a, b FROM t1 EXCEPT SELECT x, y FROM t2;
> 
> So here is a case where (paradoxically) adding an ORDER BY clause
> makes a query go faster.  I noticed this a few weeks ago and was
> thinking that I need to do something about it.

The data is already sorted (in the primary key index), thus the ORDER BY
costs nothing.  Then the ORDER BY is just a clue that SQLite3 needed?

The difference (for those following) is this:

1|0|0|SCAN TABLE t1 USING COVERING INDEX sqlite_autoindex_t1_1
2|0|0|SCAN TABLE t2 USING COVERING INDEX sqlite_autoindex_t2_1
0|0|0|COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (EXCEPT)

vs.

1|0|0|SCAN TABLE t1 USING COVERING INDEX sqlite_autoindex_t1_1
2|0|0|SCAN TABLE t2 USING COVERING INDEX sqlite_autoindex_t2_1
0|0|0|COMPOUND SUBQUERIES 1 AND 2 (EXCEPT)

The improvement is on the EXCEPT, which gets one half the diff (either
the additions or the deletions).  That's nice, and it's not nothing.

The same query plan would work for the UNION ALL of the two EXCEPT
queries.  This is partly why I think that either having a DIFF operator
or the optimizer recognize the diff-style query would help.  And if
there was such an operator/optimization, then the above optimization
would indeed be icing on the cake.

Anyways, this is good, this makes the SQL diff only twice as slow as it
could be.

I also noticed that UNIQUE indices aren't given the same treatment as
PKs.  That seems strange (though maybe not a big deal, not for me
anyways).

Thanks,

Nico
--

[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Reply via email to