Re: [HACKERS] MD5 aggregate

Marko Kreen Mon, 17 Jun 2013 04:54:20 -0700

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
> On 15 June 2013 10:22, Dean Rasheed <[email protected]> wrote:
> > There seem to be 2 separate directions that this could go, which
> > really meet different requirements:
> >
> > 1). Produce an unordered sum for SQL to compare 2 tables regardless of
> > the order in which they are scanned. A possible approach to this might
> > be something like an aggregate
> >
> > md5_total(text/bytea) returns text
> >
> > that returns the sum of the md5 values of each input value, treating
> > each md5 value as an unsigned 128-bit integer, and then producing the
> > hexadecimal representation of the final sum. This should out-perform a
> > solution based on numeric addition, and in typical cases, the result
> > wouldn't be much longer than a regular md5 sum, and so would be easy
> > to eyeball for differences.
> >
> 
> I've been playing around with the idea of an aggregate that computes
> the sum of the md5 hashes of each of its inputs, which I've called
> md5_total() for now, although I'm not particularly wedded to that
> name. Comparing it with md5_agg() on a 100M row table (see attached
> test script) produces interesting results:
> 
> SELECT md5_agg(foo.*::text)
>   FROM (SELECT * FROM foo ORDER BY id) foo;
> 
>  50bc42127fb9b028c9708248f835ed8f
> 
> Time: 92960.021 ms
> 
> SELECT md5_total(foo.*::text) FROM foo;
> 
>  02faea7fafee4d253fc94cfae031afc43c03479c
> 
> Time: 96190.343 ms
> 
> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
> result is longer) but it seems like it would be very useful for
> quickly comparing data in SQL, since its value is not dependent on the
> row-order making it easier to use and better performing if there is no
> usable index for ordering.
> 
> Note, however, that if there is an index that can be used for
> ordering, the performance is not necessarily better than md5_agg(), as
> this example shows. There is a small additional overhead per row for
> initialising the MD5 sums, and adding the results to the total, but I
> think the biggest factor is that md5_total() is processing more data.
> The reason is that MD5 works on 64-byte blocks, so the total amount of
> data going through the core MD5 algorithm is each row's size is
> rounded up to a multiple of 64. In this simple case it ends up
> processing around 1.5 times as much data:
> 
> SELECT sum(length(foo.*::text)) AS md5_agg,
>        sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
> 
>   md5_agg   |  md5_total
> ------------+-------------
>  8103815438 | 12799909248
> 
> although of course that overhead won't be as large on wider tables,
> and even in this case the overall performance is still on a par with
> md5_agg().
> 
> ISTM that both aggregates are potentially useful in different
> situations. I would probably typically use md5_total() because of its
> simplicity/order-independence and consistent performance, but
> md5_agg() might also be useful when comparing with external data.


Few notes:

- Index-scan over whole table is *very* bad for larger tables (few times
  bigger than available RAM).  If you want to promote such use you should
  also warn against use on loaded server.

- It's pointless to worry about overflow on 128-bit ints.  Just let it
  happen.  Adding complexity for that does not bring any advantage.

- Using some faster 128-bit hash may be useful - check out CityHash
  or SpookyHash.  You can get C implementation from pghashlib.

-- 
marko



-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MD5 aggregate

Reply via email to