Re: Review [was Re: [HACKERS] MD5 aggregate]

2013-06-23 Thread Dean Rasheed
On 21 June 2013 21:04, David Fetter da...@fetter.org wrote:
 On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:
 On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
  On 15 June 2013 10:22, Dean Rasheed dean.a.rash...@gmail.com wrote:
   There seem to be 2 separate directions that this could go, which
   really meet different requirements:
  
   1). Produce an unordered sum for SQL to compare 2 tables regardless of
   the order in which they are scanned. A possible approach to this might
   be something like an aggregate
  
   md5_total(text/bytea) returns text
  
   that returns the sum of the md5 values of each input value, treating
   each md5 value as an unsigned 128-bit integer, and then producing the
   hexadecimal representation of the final sum. This should out-perform a
   solution based on numeric addition, and in typical cases, the result
   wouldn't be much longer than a regular md5 sum, and so would be easy
   to eyeball for differences.
  
 
  I've been playing around with the idea of an aggregate that computes
  the sum of the md5 hashes of each of its inputs, which I've called
  md5_total() for now, although I'm not particularly wedded to that
  name. Comparing it with md5_agg() on a 100M row table (see attached
  test script) produces interesting results:
 
  SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;
 
   50bc42127fb9b028c9708248f835ed8f
 
  Time: 92960.021 ms
 
  SELECT md5_total(foo.*::text) FROM foo;
 
   02faea7fafee4d253fc94cfae031afc43c03479c
 
  Time: 96190.343 ms
 
  Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
  result is longer) but it seems like it would be very useful for
  quickly comparing data in SQL, since its value is not dependent on the
  row-order making it easier to use and better performing if there is no
  usable index for ordering.
 
  Note, however, that if there is an index that can be used for
  ordering, the performance is not necessarily better than md5_agg(), as
  this example shows. There is a small additional overhead per row for
  initialising the MD5 sums, and adding the results to the total, but I
  think the biggest factor is that md5_total() is processing more data.
  The reason is that MD5 works on 64-byte blocks, so the total amount of
  data going through the core MD5 algorithm is each row's size is
  rounded up to a multiple of 64. In this simple case it ends up
  processing around 1.5 times as much data:
 
  SELECT sum(length(foo.*::text)) AS md5_agg,
 sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
 
md5_agg   |  md5_total
  +-
   8103815438 | 12799909248
 
  although of course that overhead won't be as large on wider tables,
  and even in this case the overall performance is still on a par with
  md5_agg().
 
  ISTM that both aggregates are potentially useful in different
  situations. I would probably typically use md5_total() because of its
  simplicity/order-independence and consistent performance, but
  md5_agg() might also be useful when comparing with external data.
 
  Regards,
  Dean

 Performance review (skills needed: ability to time performance)

 Does the patch slow down simple tests?

 Yes.  For an MD5 checksum of some random data, here are
 results from master:

 shackle@postgres:5493=# WITH t1 AS (SELECT 
 string_agg(chr(floor(255*random()+1)::int),'') AS a FROM 
 generate_series(1,1)),
 postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN 
 generate_series(1,1))
 postgres-# select md5(a::text) FROM t2;
 shackle@postgres:5493=# \timing
 Timing is on.
 shackle@postgres:5493=# \g
 Time: 955.393 ms
 shackle@postgres:5493=# \g
 Time: 950.909 ms
 shackle@postgres:5493=# \g
 Time: 947.955 ms
 shackle@postgres:5493=# \g
 Time: 946.193 ms
 shackle@postgres:5493=# \g
 Time: 950.591 ms
 shackle@postgres:5493=# \g
 Time: 952.346 ms
 shackle@postgres:5493=# \g
 Time: 948.623 ms
 shackle@postgres:5493=# \g
 Time: 939.819 ms

 and here from master + the patch:

 WITH t1 AS (SELECT 
 string_agg(chr(floor(255*random()+1)::int),'') AS a FROM 
 generate_series(1,1)),
 t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,1))
 select md5(a::text) FROM t2;
 Time: 1129.178 ms
 shackle@postgres:5494=# \g
 Time: 1134.013 ms
 shackle@postgres:5494=# \g
 Time: 1124.387 ms
 shackle@postgres:5494=# \g
 Time: 1119.733 ms
 shackle@postgres:5494=# \g
 Time: 1104.906 ms
 shackle@postgres:5494=# \g
 Time: 1137.055 ms
 shackle@postgres:5494=# \g
 Time: 1128.977 ms
 

Review [was Re: [HACKERS] MD5 aggregate]

2013-06-21 Thread David Fetter
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
 On 15 June 2013 10:22, Dean Rasheed dean.a.rash...@gmail.com wrote:
  There seem to be 2 separate directions that this could go, which
  really meet different requirements:
 
  1). Produce an unordered sum for SQL to compare 2 tables regardless of
  the order in which they are scanned. A possible approach to this might
  be something like an aggregate
 
  md5_total(text/bytea) returns text
 
  that returns the sum of the md5 values of each input value, treating
  each md5 value as an unsigned 128-bit integer, and then producing the
  hexadecimal representation of the final sum. This should out-perform a
  solution based on numeric addition, and in typical cases, the result
  wouldn't be much longer than a regular md5 sum, and so would be easy
  to eyeball for differences.
 
 
 I've been playing around with the idea of an aggregate that computes
 the sum of the md5 hashes of each of its inputs, which I've called
 md5_total() for now, although I'm not particularly wedded to that
 name. Comparing it with md5_agg() on a 100M row table (see attached
 test script) produces interesting results:
 
 SELECT md5_agg(foo.*::text)
   FROM (SELECT * FROM foo ORDER BY id) foo;
 
  50bc42127fb9b028c9708248f835ed8f
 
 Time: 92960.021 ms
 
 SELECT md5_total(foo.*::text) FROM foo;
 
  02faea7fafee4d253fc94cfae031afc43c03479c
 
 Time: 96190.343 ms
 
 Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
 result is longer) but it seems like it would be very useful for
 quickly comparing data in SQL, since its value is not dependent on the
 row-order making it easier to use and better performing if there is no
 usable index for ordering.
 
 Note, however, that if there is an index that can be used for
 ordering, the performance is not necessarily better than md5_agg(), as
 this example shows. There is a small additional overhead per row for
 initialising the MD5 sums, and adding the results to the total, but I
 think the biggest factor is that md5_total() is processing more data.
 The reason is that MD5 works on 64-byte blocks, so the total amount of
 data going through the core MD5 algorithm is each row's size is
 rounded up to a multiple of 64. In this simple case it ends up
 processing around 1.5 times as much data:
 
 SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
 
   md5_agg   |  md5_total
 +-
  8103815438 | 12799909248
 
 although of course that overhead won't be as large on wider tables,
 and even in this case the overall performance is still on a par with
 md5_agg().
 
 ISTM that both aggregates are potentially useful in different
 situations. I would probably typically use md5_total() because of its
 simplicity/order-independence and consistent performance, but
 md5_agg() might also be useful when comparing with external data.
 
 Regards,
 Dean

Submission review:

Is the patch in a patch format which has context? (eg: context diff format)

Yes.

Does it apply cleanly to the current git master?

Yes.

Does it include reasonable tests, necessary doc patches, etc? 

Yes.

Usability review:

Does the patch actually implement that?

Yes.

Do we want that?

Enough do.

Do we already have it?

No.

Does it follow SQL spec, or the community-agreed behavior?

No, and yes, respectively.

Does it include pg_dump support (if applicable)?

N/A

Are there dangers?

Patch changes the MD5 implementation, which could
theoretically result in backward incompatibility if the
changes are not 100% backward-compatible.

Have all the bases been covered? 

Yes.

Feature test:

Does the feature work as advertised?

Yes.

Are there corner cases the author has failed to consider?

Not that I've found so far.

Are there any assertion failures or crashes?

No.

Performance review (skills needed: ability to time performance)

Does the patch slow down simple tests?

Yes.  For an MD5 checksum of some random data, here are
results from master:

shackle@postgres:5493=# WITH t1 AS (SELECT 
string_agg(chr(floor(255*random()+1)::int),'') AS a FROM 
generate_series(1,1)),
postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN 
generate_series(1,1))
postgres-# select md5(a::text) FROM t2;
shackle@postgres:5493=# \timing 
Timing is on.
shackle@postgres:5493=# \g
Time: 955.393 ms
shackle@postgres:5493=# \g
Time: 950.909 ms
shackle@postgres:5493=# \g
Time: 947.955 ms
shackle@postgres:5493=# \g
Time: 946.193 ms
shackle@postgres:5493=# \g
Time: 950.591 ms
shackle@postgres:5493=# \g
Time: 952.346 ms
  

Re: Review [was Re: [HACKERS] MD5 aggregate]

2013-06-21 Thread David Fetter
On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:
 On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
  On 15 June 2013 10:22, Dean Rasheed dean.a.rash...@gmail.com wrote:
   There seem to be 2 separate directions that this could go, which
   really meet different requirements:
  
   1). Produce an unordered sum for SQL to compare 2 tables regardless of
   the order in which they are scanned. A possible approach to this might
   be something like an aggregate
  
   md5_total(text/bytea) returns text
  
   that returns the sum of the md5 values of each input value, treating
   each md5 value as an unsigned 128-bit integer, and then producing the
   hexadecimal representation of the final sum. This should out-perform a
   solution based on numeric addition, and in typical cases, the result
   wouldn't be much longer than a regular md5 sum, and so would be easy
   to eyeball for differences.
  
  
  I've been playing around with the idea of an aggregate that computes
  the sum of the md5 hashes of each of its inputs, which I've called
  md5_total() for now, although I'm not particularly wedded to that
  name. Comparing it with md5_agg() on a 100M row table (see attached
  test script) produces interesting results:
  
  SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;
  
   50bc42127fb9b028c9708248f835ed8f
  
  Time: 92960.021 ms
  
  SELECT md5_total(foo.*::text) FROM foo;
  
   02faea7fafee4d253fc94cfae031afc43c03479c
  
  Time: 96190.343 ms
  
  Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
  result is longer) but it seems like it would be very useful for
  quickly comparing data in SQL, since its value is not dependent on the
  row-order making it easier to use and better performing if there is no
  usable index for ordering.
  
  Note, however, that if there is an index that can be used for
  ordering, the performance is not necessarily better than md5_agg(), as
  this example shows. There is a small additional overhead per row for
  initialising the MD5 sums, and adding the results to the total, but I
  think the biggest factor is that md5_total() is processing more data.
  The reason is that MD5 works on 64-byte blocks, so the total amount of
  data going through the core MD5 algorithm is each row's size is
  rounded up to a multiple of 64. In this simple case it ends up
  processing around 1.5 times as much data:
  
  SELECT sum(length(foo.*::text)) AS md5_agg,
 sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
  
md5_agg   |  md5_total
  +-
   8103815438 | 12799909248
  
  although of course that overhead won't be as large on wider tables,
  and even in this case the overall performance is still on a par with
  md5_agg().
  
  ISTM that both aggregates are potentially useful in different
  situations. I would probably typically use md5_total() because of its
  simplicity/order-independence and consistent performance, but
  md5_agg() might also be useful when comparing with external data.
  
  Regards,
  Dean

 Performance review (skills needed: ability to time performance)
 
 Does the patch slow down simple tests?
 
 Yes.  For an MD5 checksum of some random data, here are
 results from master:
 
 shackle@postgres:5493=# WITH t1 AS (SELECT 
 string_agg(chr(floor(255*random()+1)::int),'') AS a FROM 
 generate_series(1,1)),
 postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN 
 generate_series(1,1))
 postgres-# select md5(a::text) FROM t2;
 shackle@postgres:5493=# \timing 
 Timing is on.
 shackle@postgres:5493=# \g
 Time: 955.393 ms
 shackle@postgres:5493=# \g
 Time: 950.909 ms
 shackle@postgres:5493=# \g
 Time: 947.955 ms
 shackle@postgres:5493=# \g
 Time: 946.193 ms
 shackle@postgres:5493=# \g
 Time: 950.591 ms
 shackle@postgres:5493=# \g
 Time: 952.346 ms
 shackle@postgres:5493=# \g
 Time: 948.623 ms
 shackle@postgres:5493=# \g
 Time: 939.819 ms
 
 and here from master + the patch:
 
 WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') 
 AS a FROM generate_series(1,1)),
 t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,1))
 select md5(a::text) FROM t2;
 Time: 1129.178 ms
 shackle@postgres:5494=# \g
 Time: 1134.013 ms
 shackle@postgres:5494=# \g
 Time: 1124.387 ms
 shackle@postgres:5494=# \g
 Time: 1119.733 ms
 shackle@postgres:5494=# \g
 Time: 1104.906 ms
 shackle@postgres:5494=# \g
 Time: 1137.055 ms
 shackle@postgres:5494=# \g
 Time: 1128.977 ms
 shackle@postgres:5494=# \g