[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Stadin, Benjamin Wed, 18 Nov 2015 23:38:53 +0000

I've written a Sqlite diff tool that does this fast, but using a different 
approach. I'm adding a column named md5checksum to each to be diffed table, and 
add a index to it. 
A small tool then generates checksums for all rows in the table. Instead of 
comparing all columns, only the difference of checksums are considered when 
generating a diff of changed rows. The set of inserted and deleted rows is then 
applied again in a optimized way to the old db.


Ben

Von meinem iPad gesendet

> Am 18.11.2015 um 21:11 schrieb Nico Williams <nico at cryptonector.com>:
> 
> Consider two tables with the same columns and primary keys, and then
> consider this query:
> 
> SELECT 'added., a.* FROM
>  (SELECT a.* FROM a EXCEPT SELECT b.* FROM b) a
> UNION ALL
> SELECT 'deleted', d.* FROM
>  (SELECT b.* FROM b EXCEPT SELECT a.* FROM a) d;
> 
> In other words, the SQL analog of a straightforward diff(1) of two
> [sorted] files.  I think this is the canonical way to do this in SQL,
> though there are variants.
> 
> Now, diff(1) (i.e., the Unix command) can do this very efficiently
> because it just moves a "cursor" in both files looking for different
> records (lines).
> 
> SQLite3, on the other hand, does four table scans and further scans two
> sub-queries' result row-sets.  If only the optimizer could recognize
> this pattern, then it could surely do as well as diff(1).  This sort of
> query strikes me as a common enough kind of query that implementing such
> an optimization might be worthwhile, though the SQLite3 team will have
> better knowledge of that.  I'd certainly appreciate such an
> optimization.  (No, Postgres 9.4.4 doesn't seem to optimize this
> substantially better either.)
> 
> A difference UDF would be one way to handle this.  Table-valued
> functions are half the battle, but one still needs table-valued
> arguments to UDFs to make it clean looking:
> 
>  -- Outputs two-column rows where the first column denotes addition vs
>  -- deletion and the second is a row-valued column with the result
>  -- columns of the corresponding row in the old or new table source:
>  SELECT diff((SELECT ...), (SELECT ...));
> 
> Using "eval" and JSON one could define a diff() UDF that takes two
> strings, compile the two strings as SQL statements and execute them,
> stepping over the results of each just like diff(1) would:
> 
>  -- Outputs JSON arrays of addition/deletion marker and column array or
>  -- object.
>  SELECT diff('SELECT ...', 'SELECT ...');
> 
> and/or
> 
>  SELECT diff('old_view_or_table', 'new_view_or_table');
> 
> Perhaps SQL could grow a DIFFERENCE operator for this, though there are
> syntactic/semantic issues to deal with.  Maybe:
> 
>  SELECT d.diff, o.c0, o.c1, .., o.cN
>  FROM old_source o
>  DIFFERENCE (/*deletion*/ '<', '>' /*addition*/)) d
>  SELECT d.diff, n.c0, n.c1, .., n.cN
>  FROM new_source n;
> 
> One might get a comm(1) equivalent too:
> 
>  SELECT d.diff, o.c0, o.c1, .., o.cN
>  FROM old_source o
>  DIFFERENCE (/*deletion*/ '<', '>' /*addition*/), '=' /*same*/) d
>  SELECT d.diff, n.c0, n.c1, .., n.cN
>  FROM new_source n;
> 
> Nico
> -- 
> _______________________________________________
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Diff two tables as fast as diff(1) (of sorted files)

Reply via email to