On 15 June 2013 10:22, Dean Rasheed <dean.a.rash...@gmail.com> wrote: > There seem to be 2 separate directions that this could go, which > really meet different requirements: > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of > the order in which they are scanned. A possible approach to this might > be something like an aggregate > > md5_total(text/bytea) returns text > > that returns the sum of the md5 values of each input value, treating > each md5 value as an unsigned 128-bit integer, and then producing the > hexadecimal representation of the final sum. This should out-perform a > solution based on numeric addition, and in typical cases, the result > wouldn't be much longer than a regular md5 sum, and so would be easy > to eyeball for differences. >
I've been playing around with the idea of an aggregate that computes the sum of the md5 hashes of each of its inputs, which I've called md5_total() for now, although I'm not particularly wedded to that name. Comparing it with md5_agg() on a 100M row table (see attached test script) produces interesting results: SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo; 50bc42127fb9b028c9708248f835ed8f Time: 92960.021 ms SELECT md5_total(foo.*::text) FROM foo; 02faea7fafee4d253fc94cfae031afc43c03479c Time: 96190.343 ms Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its result is longer) but it seems like it would be very useful for quickly comparing data in SQL, since its value is not dependent on the row-order making it easier to use and better performing if there is no usable index for ordering. Note, however, that if there is an index that can be used for ordering, the performance is not necessarily better than md5_agg(), as this example shows. There is a small additional overhead per row for initialising the MD5 sums, and adding the results to the total, but I think the biggest factor is that md5_total() is processing more data. The reason is that MD5 works on 64-byte blocks, so the total amount of data going through the core MD5 algorithm is each row's size is rounded up to a multiple of 64. In this simple case it ends up processing around 1.5 times as much data: SELECT sum(length(foo.*::text)) AS md5_agg, sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo; md5_agg | md5_total ------------+------------- 8103815438 | 12799909248 although of course that overhead won't be as large on wider tables, and even in this case the overall performance is still on a par with md5_agg(). ISTM that both aggregates are potentially useful in different situations. I would probably typically use md5_total() because of its simplicity/order-independence and consistent performance, but md5_agg() might also be useful when comparing with external data. Regards, Dean
md5_agg_v2.patch
Description: Binary data
md5-100m-row-test.sql
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers