On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: > On 15 June 2013 10:22, Dean Rasheed <dean.a.rash...@gmail.com> wrote: > > There seem to be 2 separate directions that this could go, which > > really meet different requirements: > > > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of > > the order in which they are scanned. A possible approach to this might > > be something like an aggregate > > > > md5_total(text/bytea) returns text > > > > that returns the sum of the md5 values of each input value, treating > > each md5 value as an unsigned 128-bit integer, and then producing the > > hexadecimal representation of the final sum. This should out-perform a > > solution based on numeric addition, and in typical cases, the result > > wouldn't be much longer than a regular md5 sum, and so would be easy > > to eyeball for differences. > > > > I've been playing around with the idea of an aggregate that computes > the sum of the md5 hashes of each of its inputs, which I've called > md5_total() for now, although I'm not particularly wedded to that > name. Comparing it with md5_agg() on a 100M row table (see attached > test script) produces interesting results: > > SELECT md5_agg(foo.*::text) > FROM (SELECT * FROM foo ORDER BY id) foo; > > 50bc42127fb9b028c9708248f835ed8f > > Time: 92960.021 ms > > SELECT md5_total(foo.*::text) FROM foo; > > 02faea7fafee4d253fc94cfae031afc43c03479c > > Time: 96190.343 ms > > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its > result is longer) but it seems like it would be very useful for > quickly comparing data in SQL, since its value is not dependent on the > row-order making it easier to use and better performing if there is no > usable index for ordering. > > Note, however, that if there is an index that can be used for > ordering, the performance is not necessarily better than md5_agg(), as > this example shows. There is a small additional overhead per row for > initialising the MD5 sums, and adding the results to the total, but I > think the biggest factor is that md5_total() is processing more data. > The reason is that MD5 works on 64-byte blocks, so the total amount of > data going through the core MD5 algorithm is each row's size is > rounded up to a multiple of 64. In this simple case it ends up > processing around 1.5 times as much data: > > SELECT sum(length(foo.*::text)) AS md5_agg, > sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo; > > md5_agg | md5_total > ------------+------------- > 8103815438 | 12799909248 > > although of course that overhead won't be as large on wider tables, > and even in this case the overall performance is still on a par with > md5_agg(). > > ISTM that both aggregates are potentially useful in different > situations. I would probably typically use md5_total() because of its > simplicity/order-independence and consistent performance, but > md5_agg() might also be useful when comparing with external data.
Few notes: - Index-scan over whole table is *very* bad for larger tables (few times bigger than available RAM). If you want to promote such use you should also warn against use on loaded server. - It's pointless to worry about overflow on 128-bit ints. Just let it happen. Adding complexity for that does not bring any advantage. - Using some faster 128-bit hash may be useful - check out CityHash or SpookyHash. You can get C implementation from pghashlib. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers